michellab / Sire

Sire Molecular Simulations Framework
http://siremol.org
GNU General Public License v3.0
95 stars 26 forks source link

CI failures #380

Closed lohedges closed 2 years ago

lohedges commented 2 years ago

I'm seeing fairly consistent build failures that exit with the following error:

Run conda-build -c conda-forge -c michellab Sire/recipes/sire
No numpy version specified in conda_build_config.yaml.  Falling back to default numpy value of 1.16
WARNING:conda_build.metadata:No numpy version specified in conda_build_config.yaml.  Falling back to default numpy value of 1.16
INFO:conda_build.variants:Adding in variants from internal_defaults
Adding in variants from internal_defaults
INFO:conda_build.variants:Adding in variants from /home/runner/work/Sire/Sire/Sire/recipes/sire/conda_build_config.yaml
Adding in variants from /home/runner/work/Sire/Sire/Sire/recipes/sire/conda_build_config.yaml
INFO:conda_build.metadata:Attempting to finalize metadata for sire
Attempting to finalize metadata for sire
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): conda.anaconda.org:443
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /conda-forge/linux-64/repodata.json HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /conda-forge/noarch/repodata.json HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /michellab/linux-64/repodata.json HTTP/1.1" 200 2162
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /michellab/noarch/repodata.json HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:[https://repo.anaconda.com:443](https://repo.anaconda.com/) "GET /pkgs/main/linux-64/repodata.json HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:[https://repo.anaconda.com:443](https://repo.anaconda.com/) "GET /pkgs/main/noarch/repodata.json HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:[https://repo.anaconda.com:443](https://repo.anaconda.com/) "GET /pkgs/r/linux-64/repodata.json HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:[https://repo.anaconda.com:443](https://repo.anaconda.com/) "GET /pkgs/r/noarch/repodata.json HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:[https://repo.anaconda.com:443](https://repo.anaconda.com/) "GET /pkgs/main/channeldata.json HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:[https://repo.anaconda.com:443](https://repo.anaconda.com/) "GET /pkgs/r/channeldata.json HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /michellab/channeldata.json HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /conda-forge/channeldata.json HTTP/1.1" 200 None
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /conda-forge/noarch/sysroot_linux-64-2.17-h4a8ded7_13.tar.bz2 HTTP/1.1" 200 35770729
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /conda-forge/linux-64/make-4.3-hd18ef5c_1.tar.bz2 HTTP/1.1" 200 518896
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /conda-forge/linux-64/gcc_linux-64-11.2.0-h39a9532_9.tar.bz2 HTTP/1.1" 200 25093
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /conda-forge/linux-64/gxx_linux-64-11.2.0-hacbe6df_9.tar.bz2 HTTP/1.1" 200 24791
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /conda-forge/linux-64/cmake-3.23.0-h5432695_1.tar.bz2 HTTP/1.1" 200 17110272
DEBUG:urllib3.connectionpool:[https://conda.anaconda.org:443](https://conda.anaconda.org/) "GET /conda-forge/linux-64/git-2.35.1-pl5321h04cb727_0.tar.bz2 HTTP/1.1" 200 14001323
Collecting package metadata (repodata.json): ...working... done
/home/runner/work/_temp/657dfb64-de98-4bd1-9070-edb9e7a5628c.sh: line 1:  2432 Killed                  conda-build -c conda-forge -c michellab Sire/recipes/sire
Solving environment: ...working... 
Error: Process completed with exit code 137.

(The failures might be for a different OS or Python variant, but the message will be similar.)

I've searched online and it's not clear what's triggering the error. (Possibly a memory or networking issue on the VM.) At present I'm seeing the same error repeatedly when trying to re-run the only failed job for the most recent build, i.e. Linux and Python 3.7. (Normally the issue is intermittent, so a simple re-run fixes things.)

Just thought I'd report here so we have a log of it. I'll see if I can figure out what's going on and will update the workflow file if needed. (It could just be a case of waiting and trying again later.)

Cheers.

chryswoods commented 2 years ago

I saw a few of these too. I agree that it looks like an out of memory during the conda solve. It fixed itself after I waited a few hours...

lohedges commented 2 years ago

This has now failed about 10 times over the course of 24 hours or so. I'm not sure what's going on. I can't imagine that the re-runs are using the same runner. Maybe I'll have to see if a fresh commit solves the problem.

lohedges commented 2 years ago

It looks like we have a consistent failure for the Linux Python 3.7 build. See the most recent actions here. I'm still trying to figure out whether there is a simple solution to this issue.

lohedges commented 2 years ago

This issue reports a similar failure. (Similar DEBUG messages seen in the output.) In this case, the failure was the result of a silent segmentation fault that was triggered when memory ran out.

Looking at the GitHub runner docs the Linux and macOS images have 7GB and 14GB of RAM, respectively. I would assume that this would be plenty. If this was the issue, then it's weird that only the Python 3.7 Linux variant is failing. (I guess some base package within the Python 3.7 conda environment might have a memory issue, which is fixed in later variants.)

lohedges commented 2 years ago

I think the DEBUG messages are potentially misleading, since they occur on the successful CI runs too. The one that errors does so with exit code 137, so it's definitely a memory issue.

lohedges commented 2 years ago

I've retried the build using both the Miniforge and Mambaforge variants of Miniconda (these can be enabled with the setup-miniconda action) and they both fail with the same memory error. I think the issue is that the failure is triggered by the dependency resolution during the conda-build stage, i.e. even if you specify that mamba should be prioritised, I don't think it would be used by conda-build, only for regular conda install type commands.

Not sure what to do about this. I'll poke around the docs to see if it's possible to tweak the runner's memory settings, or to add swap space. One option would be no longer supporting Python 3.7, although I'm not sure if any users are tied to this for other reasons, e.g. if other packages in their environment are only available for this variant. @msuruzhon: What Python variant do you use internally?

msuruzhon commented 2 years ago

Hi @lohedges, we still use Python 3.7 internally - this version is still quite popular in other scientific libraries it seems.

Could mamba not be installed and used as part of the CI? I find regular conda unusable for installing larger packages, so I am not surprised there is a memory issue.

lohedges commented 2 years ago

Yes, mamba can be installed and used by the action. The issue is that it isn't used by the conda-build command behind the scenes, which is what is used to build Sire and create the conda package. I wonder if there's a way to "trick it" to use mamba, e.g. symlinking the conda binary, or something. I'll have a play around locally to see if I can get something to work.

msuruzhon commented 2 years ago

Ah yes sorry I didn't read properly. I have used boa before during conda-build to do that. It's technically "experimental" but when I tested it it was seamless, so you might want to try it. It basically uses mamba as a resolver. You can find it here: https://github.com/mamba-org/boa.

lohedges commented 2 years ago

Ah, yes, I forgot about boa, will try that now. Cheers!

lohedges commented 2 years ago

Great, using boa and conda mambabuild has got past the memory error. I'll let the CI run to completion then close this assuming all is okay, i.e. that it's possible to build BioSimSpace on top of the resulting packages.

Phew!

lohedges commented 2 years ago

Will need to debug since SireUnitTests are now failing against devel, e.g. with the following:

Traceback (most recent call last):
  File "/Users/runner/miniconda3/envs/sire_build/conda-bld/sire_1651680443805/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/runner/miniconda3/envs/sire_build/conda-bld/sire_1651680443805/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/share/Sire/test/SireUnitTests/unittests/SireMol/test_names_numbers.py", line 10, in test_names_numbers
    assert mol.nResidues() == len(mol.residues().names())
UnboundLocalError: local variable 'mol' referenced before assignment

@chryswoods: I think this is due to additions for the feat_web branch. (I also needed to add pytest to the test requirements section of our conda recipe.) Are any other changes to the tests imminent? Although your test has a check to see if an older version of Sire is being used it still appears to be failing. (Perhaps this is an issue of running the tests using sire_test, rather than pytest as you are presumably doing locally.) I also thought that tests for a feature should be placed in a matching feature branch on the SireUnitTests repo, or is this no longer the case. (I know that you mentioned moving them into the main Sire repo, but what should we do until that time?)

Cheers.

chryswoods commented 2 years ago

Sorry about that. I thought I had masked things out correctly, but obviously not.

I have reverted the SireUnitTests repo back to its state from before I made changes to feat_web. I have copied the files I needed to another directory, and will bring those into the main repo when I issue the pull request for feat_web

Yes, normally we should add tests to the corresponding feature branch. I was trying to do the more advanced step, which is adding the tests in a way that they don't run if used against an older version of Sire (in case someone wanted to run the test suite against a version they installed themselves). It was too complex, hence why I think we should move to putting the tests with the repo.

I think we should keep SireUnitTests though both for posterity, and also as a test that the code always supports the old API (run with sr.use_old_api()).

lohedges commented 2 years ago

Thanks for sorting this, I'll re-run the build now. To be honest, I'm not exactly sure why they failed, since it didn't happen with all of the new tests you added, despite the logic to check for the new API being the same in all cases.

lohedges commented 2 years ago

The CI passed. Will now test BioSiimSpace, also building using conda mambabuild for consistency.

lohedges commented 2 years ago

Closing as everything is working as expected :+1: