Closed kavanase closed 1 week ago
@Linux-cpp-lisp one minor thing to flag (doesn't really matter now, just for whenever the next release is I guess). For the changelog, it says it adheres to semantic versioning, which I think would make the next version 0.7.0 as functionality has been / is being added from this and other PRs. Just didn't want to change myself without checking as you may have other reasons for this!
@Linux-cpp-lisp I'm not totally sure what's causing the 2 test failures here. They all pass locally. This seems to be the main issue for at least one of them:
self = CompletedProcess(args=['nequip-train', 'conf.yaml', '--warn-unused'], returncode=1, stdout=b'', stderr=b'/opt/hostedto...ate\n raise RuntimeError(\nRuntimeError: Failed to build object with prefix `dataset` using builder `NpzDataset`\n')
def check_returncode(self):
"""Raise CalledProcessError if the exit code is non-zero."""
if self.returncode:
> raise CalledProcessError(self.returncode, self.args, self.stdout,
self.stderr)
E subprocess.CalledProcessError: Command '['nequip-train', 'conf.yaml', '--warn-unused']' returned non-zero exit status 1.
/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/subprocess.py:4[60](https://github.com/mir-group/nequip/actions/runs/9699603663/job/26768963349?pr=438#step:7:61): CalledProcessError
---------------------------- Captured stderr setup -----------------------------
/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/utils/_global_options.py:95: UserWarning: Do NOT manually set PYTORCH_JIT_USE_NNC_NOT_NVFUSER=0 unless you know exactly what you're doing!
warnings.warn(
Torch device: cpu
Using existing file aspirin_ccsd.zip
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.19/x[64](https://github.com/mir-group/nequip/actions/runs/9699603663/job/26768963349?pr=438#step:7:65)/lib/python3.9/site-packages/nequip/utils/auto_init.py", line 243, in instantiate
instance = builder(**positional_args, **final_optional_args)
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/data/_dataset/_npz_dataset.py", line 81, in __init__
super().__init__(
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/data/_dataset/_base_datasets.py", line 152, in __init__
super().__init__(root=root, type_mapper=type_mapper)
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/data/_dataset/_base_datasets.py", line 43, in __init__
super().__init__(root=root, transform=type_mapper)
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/utils/torch_geometric/dataset.py", line 88, in __init__
self._download()
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/utils/torch_geometric/dataset.py", line 149, in _download
self.download()
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/data/_dataset/_base_datasets.py", line 197, in download
extract_zip(download_path, self.raw_dir)
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nequip/utils/torch_geometric/utils.py", line 55, in extract_zip
f.extractall(folder)
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/zipfile.py", line 1[65](https://github.com/mir-group/nequip/actions/runs/9699603663/job/26768963349?pr=438#step:7:66)4, in extractall
self._extract_member(zipinfo, path, pwd)
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/zipfile.py", line 1704, in _extract_member
os.mkdir(targetpath)
FileExistsError: [Errno 17] File exists: '/home/runner/work/nequip/nequip/benchmark_data/__MACOSX'
But not sure where the __MACOSX
folder is coming from as the tests run on ubuntu and I don't see this in the test data zip file online
@kavanase re semantic versioning, good catch on the CHANGELOG
. I've been sticking to a sort of "almost" semantic versioning where the middle digit gets incremented for backwards compatibility-breaking releases, and the final for other releases. We should discuss whether this should be changed.
@kavanase re the failed test, I believe this occurs due to a race condition: the tests on GitHub actions are run in parallel on two (if I remember right) workers, and if they both try to unzip the downloaded data at the same time (instead of one first and the other using the cache) this can occur. It should self resolve, I think the first time it came up it just wasn't worth dealing with since re-running the tests usually resolves it by chance.
@Linux-cpp-lisp ok cool! For the failed test, was thinking it was something like that.
For the semantic versioning, that makes sense, though I think it differs a bit from the standard semver format (where compatiblity-breaking changes are meant for MAJOR versions).
Given a version number MAJOR.MINOR.PATCH, increment the: MAJOR version when you make incompatible API changes MINOR version when you add functionality in a backward compatible manner PATCH version when you make backward compatible bug fixes
Tbh it's probably a fairly minor point and I don't think one has to stick to a given format, but I guess best to update the "this project adheres to Semantic Versioning" (2.0.0) statement in the CHANGELOG if using a different format, or swap to using its format?
@Linux-cpp-lisp I think all the remaining issues with this have now been addressed
Description & Motivation and Context
This PR implements some minor updates to the code:
n_train
/n_val
can now be set as percentage strings. The behaviour of this is shown in the_parse_n_train_n_val
function; in short, the flooredint
corresponding to the chosen percent is taken, and if the percent coverage sums to 100% but the flooring results in one frame being omitted, thatn_train
/n_val
is increased by 1. e.g. for the test case of 8 frames in the full dataset,n_train
= 70% = 5.6 frames,n_val
= 30% = 2.4 frames, so finaln_train
is 6 andn_val
is 2. Tests added for this, and example shown infull.yaml
config. Also tested on HPCs.Pre-commit fix (
flake8
no longer on Gitlab) and formattingIn
train.py
, only attempt a restart whentrainer.pth
is present, not if the results folder exists (e.g. if training crashed during first epoch, during data loading (due to memory...), hitting walltime etc before model saved) – had a few different runs crash because of this.Avoids an unnecessary and verbose
e3nn
JIT warning innequip-train
outputsTypes of changes
Checklist:
black
.docs/options
) has been updated with new or changed options. – I don't see this?CHANGELOG.md
.