Open gcassone-cnr opened 2 weeks ago
Dear Giuseppe,
Thank you for your interest in our code. Apologies for the delayed response, but the nequip
framework and allegro
model (that runs in the nequip
infrastructure) are undergoing a major overhaul. We are close to the end of the revamps, and things look very different from what you see on main
. The code related to your problem has been deleted in the revamps. If you're just starting on a project, it may be better to try using the new nequip
infrastructure and corresponding allegro code, both on the develop
branches of the respective git repositories. The configs/tutorial.yaml
on both repos should be helpful, as well as the new docs https://nequip.readthedocs.io/en/develop/guide/workflow.html (note the develop
in the url), for getting started.
If you really need to get things to work with the current public code and want to try to debug this issue, some comments. The error thrown comes from
https://github.com/mir-group/nequip/blob/1e150cdc8614e640116d11e085d8e5e45b21e94d/nequip/scripts/train.py#L290,
which just checks the original config file used for training and the config saved in best_model.pth
. I assume this is a perplexing problem because you didn't change model_builders
in your restart (because that's what one would expect to cause the error). A reasonable approach would be to investigate why the error got thrown by maybe inspecting best_model.pth
and figuring out why it's erroring out at that part of the code highlighted earlier (inspect what's in the dicts being inspected).
I would advise trying to migrate to the new infrastructure if possible, but it's understandable if it's more favorable to continue using the current public infrastructure if you're in the middle of a project and have various models trained in that framework. Happy to give further advice.
Chuin Wei
Dear Wei,
thanks a lot for your reply! I've a couple of questions. I've installed both develop branches of nequip and and allegro. However, when I try to run the configs/tutorial.yaml (e.g., nequip-train configs/tutorial.yaml) I systematically get the following error:
File "/home/cassone/anaconda3/lib/python3.12/site-packages/hydra/core/override_parser/overrides_parser.py", line 96, in parse_overrides raise OverrideParseException( hydra.errors.OverrideParseException: mismatched input 'tutorial.yaml' expecting ID See https://hydra.cc/docs/1.2/advanced/override_grammar/basic for details
Also, if I run "pytest tests/" to make an extensive testing of the nequip installation I get this error:
ERROR tests/integration/test_deploy.py ERROR tests/integration/test_train.py !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Could you please tell me why and how to fix these errors related to the develop branches of nequip and allegro? Thanks a lot in advance.
One final thing related to my previous issue: how could I inspect the best_model.pth file?
Many thanks in advance and best wishes, Giuseppe
See https://nequip.readthedocs.io/en/develop/guide/workflow.html on "Training". The command to train is
nequip-train -cp full/path/to/config/directory -cn config_name.yaml
with several caveats depending on where you're running it from, etc.
For loading, best_model.pth
, you can use https://pytorch.org/docs/stable/generated/torch.load.html
Dear Chuin Wei,
thank you for the prompt reply. Additionally to the tutorial.yaml file, are there some other templates to be used for training/testing, specifically adapted for water systems?
Thank you again and best wishes, Giuseppe
Hi Giuseppe,
For learning about the training infrastructure, nequip
's tutorial.yaml
is what you wanna look at https://github.com/mir-group/nequip/blob/develop/configs/tutorial.yaml. The model
part of the config would be different for allegro
, i.e. the nequip
GNN model and allegro
are fundamentally different models, so you'd need different hyperparameters.
This paper should have the relevant Allegro details for water systems. https://pubs.acs.org/doi/10.1021/acs.jpclett.4c00605. The SI has an allegro config for the old infrastructure, you'd have to translate it to the new infrastructure (carefully separating what are training hyperparameters and model architecture hyperparamaters, since the old infrastructure uses a flat list of configuration arguments while the new infrastructure has them separated into sections).
Chuin Wei
Dear Chuin Wei,
thanks a lot for your important suggestions. Since the documentation on the development version of allegro is not yet available (isn't it?), could you please tell me how, e.g., to exploit either TensorBoard or WandB in conjunction with the new development branches? Is there any blog on these new development versions?
Thanks a lot in advance and best wishes, Giuseppe
Hi Giuseppe,
nequip
is the main package that handles all the training infrastructure. allegro
is a choice of model (in contrast to nequip
the GNN model, which is often confused with nequip
the overall software package for training deep equivariant potentials). So everything you need would be in the nequip
tutorials.
You can find the relevant line here. https://github.com/mir-group/nequip/blob/ece09b587ab1082c2c806a094fb5cc1dc5489b60/configs/tutorial.yaml#L125
We've migrated to using lightning
, so the stuff you see there are the arguments to instantiate a lightning.Trainer
object, and how you configure can be learned by studying the lightning.Trainer
API (https://lightning.ai/docs/pytorch/stable/common/trainer.html#trainer-class-api). Besides wandb
, here are the various other loggers https://lightning.ai/docs/pytorch/stable/api_references.html#loggers .
Last bit of warning -- I think I was over-optimistic in my suggestions to use the new developments and wish to now backtrack (sorry!). To put it bluntly, it's not stable enough for me to recommend migrating over for production use at this point in time (but definitely fine if you wanna test it, with the expectation that things will change in breaking ways in the coming months, such that you might have to reinstall everything/retrain all your models, etc). That being said, if you want to use the new developments and face problems, we can be reached at allegro-nequip@g.harvard.edu.
Dear developers,
I'm trying to restart a train that crashed by means of inserting the following lines in the input:
However, it systematically gives the following error: ValueError: Key "model_builders" is different in config and the result trainer.pth file. Please double check
Does anyone knows why?
Thanks in advance and best wishes, Giuseppe