mir-group / nequip

NequIP is a code for building E(3)-equivariant interatomic potentials
https://www.nature.com/articles/s41467-022-29939-5
MIT License
565 stars 124 forks source link

❓ [QUESTION] Restart run #343

Closed IZugec closed 1 year ago

IZugec commented 1 year ago

Hello,

I have a situation in which I have really huge dataset so much so that even with multiprocessing it still takes day and a half/two days to preprocess it. Now, it happened that due to the unexpected crash on the node I would like to continue training starting from the best_model.pth weights. However I would really like to avoid processing this huge dataset again.

I tried both initial_model_state / initialize_from_state and load_model_state / load_model_state

however, when I started training initial model the key for append was false so now when I try to put it to false the error is

Traceback (most recent call last): File "/home/user/.conda/envs/nequip_stress/bin/nequip-train", line 8, in sys.exit(main()) File "/home/user/.conda/envs/nequip_stress/lib/python3.10/site-packages/nequip/scripts/train.py", line 65, in main raise RuntimeError( RuntimeError: Training instance exists at /path_to_traning_dir; either set append to True or use a different root or runname

However when I start it with append equal to true I get following error

Traceback (most recent call last): File "/home/user/.conda/envs/nequip_stress/bin/nequip-train", line 8, in sys.exit(main()) File "/home/user/.conda/envs/nequip_stress/lib/python3.10/site-packages/nequip/scripts/train.py", line 74, in main trainer = restart(config) File "/home/user/.conda/envs/nequip_stress/lib/python3.10/site-packages/nequip/scripts/train.py", line 220, in restart raise ValueError( ValueError: Key "append" is different in config and the result trainer.pth file. Please double check

I guess the question is if there is a way to pass already processed dataset along with model state?

Thanks in advance on any advice, Ivan

Linux-cpp-lisp commented 1 year ago

Hi @IZugec ,

I tried both initial_model_state / initialize_from_state and load_model_state / load_model_state

This will be the easiest way forward, and will load the cached processed dataset unless something goes wrong. I think there should be a full discussion of how to do this here--- you want initialize_from_state and a new run name:

https://github.com/mir-group/nequip/discussions/235