Minor Updates (`n_train/val` as percent, pre-commit, restart handling, warnings)

kavanase commented 2 weeks ago

Description & Motivation and Context

This PR implements some minor updates to the code:

n_train/n_val can now be set as percentage strings. The behaviour of this is shown in the _parse_n_train_n_val function; in short, the floored int corresponding to the chosen percent is taken, and if the percent coverage sums to 100% but the flooring results in one frame being omitted, that n_train/n_val is increased by 1. e.g. for the test case of 8 frames in the full dataset, n_train = 70% = 5.6 frames, n_val = 30% = 2.4 frames, so final n_train is 6 and n_val is 2. Tests added for this, and example shown in full.yaml config. Also tested on HPCs.
Pre-commit fix (flake8 no longer on Gitlab) and formatting
In train.py, only attempt a restart when trainer.pth is present, not if the results folder exists (e.g. if training crashed during first epoch, during data loading (due to memory...), hitting walltime etc before model saved) – had a few different runs crash because of this.
Avoids an unnecessary and verbose e3nn JIT warning in nequip-train outputs

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds or improves functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Documentation improvement (updates to user guides, docstrings, or developer docs)

Checklist:

[x] My code follows the code style of this project and has been formatted using black.
[x] All new and existing tests passed, including on GPU (if relevant).
[x] I have added tests that cover my changes (if relevant).
[ ] The option documentation (docs/options) has been updated with new or changed options. – I don't see this?
[x] I have updated CHANGELOG.md.
[ ] I have updated the documentation (if relevant).

kavanase commented 1 week ago

@Linux-cpp-lisp one minor thing to flag (doesn't really matter now, just for whenever the next release is I guess). For the changelog, it says it adheres to semantic versioning, which I think would make the next version 0.7.0 as functionality has been / is being added from this and other PRs. Just didn't want to change myself without checking as you may have other reasons for this!

kavanase commented 1 week ago

@Linux-cpp-lisp I'm not totally sure what's causing the 2 test failures here. They all pass locally. This seems to be the main issue for at least one of them:

self = CompletedProcess(args=['nequip-train', 'conf.yaml', '--warn-unused'], returncode=1, stdout=b'', stderr=b'/opt/hostedto...ate\n    raise RuntimeError(\nRuntimeError: Failed to build object with prefix `dataset` using builder `NpzDataset`\n')

    def check_returncode(self):
        """Raise CalledProcessError if the exit code is non-zero."""
        if self.returncode:
>           raise CalledProcessError(self.returncode, self.args, self.stdout,
                                     self.stderr)
E           subprocess.CalledProcessError: Command '['nequip-train', 'conf.yaml', '--warn-unused']' returned non-zero exit status 1.

/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/subprocess.py:4[60](https://github.com/mir-group/nequip/actions/runs/9699603663/job/26768963349?pr=438#step:7:61): CalledProcessError
---------------------------- Captured stderr setup -----------------------------
/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/utils/_global_options.py:95: UserWarning: Do NOT manually set PYTORCH_JIT_USE_NNC_NOT_NVFUSER=0 unless you know exactly what you're doing!
  warnings.warn(
Torch device: cpu
Using existing file aspirin_ccsd.zip
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.19/x[64](https://github.com/mir-group/nequip/actions/runs/9699603663/job/26768963349?pr=438#step:7:65)/lib/python3.9/site-packages/nequip/utils/auto_init.py", line 243, in instantiate
    instance = builder(**positional_args, **final_optional_args)
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/data/_dataset/_npz_dataset.py", line 81, in __init__
    super().__init__(
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/data/_dataset/_base_datasets.py", line 152, in __init__
    super().__init__(root=root, type_mapper=type_mapper)
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/data/_dataset/_base_datasets.py", line 43, in __init__
    super().__init__(root=root, transform=type_mapper)
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/utils/torch_geometric/dataset.py", line 88, in __init__
    self._download()
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/utils/torch_geometric/dataset.py", line 149, in _download
    self.download()
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/data/_dataset/_base_datasets.py", line 197, in download
    extract_zip(download_path, self.raw_dir)
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/utils/torch_geometric/utils.py", line 55, in extract_zip
    f.extractall(folder)
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/zipfile.py", line 1[65](https://github.com/mir-group/nequip/actions/runs/9699603663/job/26768963349?pr=438#step:7:66)4, in extractall
    self._extract_member(zipinfo, path, pwd)
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/zipfile.py", line 1704, in _extract_member
    os.mkdir(targetpath)
FileExistsError: [Errno 17] File exists: '/home/runner/work/nequip/nequip/benchmark_data/__MACOSX'

But not sure where the __MACOSX folder is coming from as the tests run on ubuntu and I don't see this in the test data zip file online

Linux-cpp-lisp commented 1 week ago

@kavanase re semantic versioning, good catch on the CHANGELOG. I've been sticking to a sort of "almost" semantic versioning where the middle digit gets incremented for backwards compatibility-breaking releases, and the final for other releases. We should discuss whether this should be changed.

Linux-cpp-lisp commented 1 week ago

@kavanase re the failed test, I believe this occurs due to a race condition: the tests on GitHub actions are run in parallel on two (if I remember right) workers, and if they both try to unzip the downloaded data at the same time (instead of one first and the other using the cache) this can occur. It should self resolve, I think the first time it came up it just wasn't worth dealing with since re-running the tests usually resolves it by chance.

kavanase commented 1 week ago

@Linux-cpp-lisp ok cool! For the failed test, was thinking it was something like that.

For the semantic versioning, that makes sense, though I think it differs a bit from the standard semver format (where compatiblity-breaking changes are meant for MAJOR versions).

Given a version number MAJOR.MINOR.PATCH, increment the: MAJOR version when you make incompatible API changes MINOR version when you add functionality in a backward compatible manner PATCH version when you make backward compatible bug fixes

Tbh it's probably a fairly minor point and I don't think one has to stick to a given format, but I guess best to update the "this project adheres to Semantic Versioning" (2.0.0) statement in the CHANGELOG if using a different format, or swap to using its format?

kavanase commented 1 week ago

@Linux-cpp-lisp I think all the remaining issues with this have now been addressed

mir-group / nequip