mir-group / allegro

Allegro is an open-source code for building highly scalable and accurate equivariant deep learning interatomic potentials
https://www.nature.com/articles/s41467-023-36329-y
MIT License
350 stars 46 forks source link

ValueError: Key "model_builders" is different in config and the result trainer.pth file. Please double check #111

Open gcassone-cnr opened 2 weeks ago

gcassone-cnr commented 2 weeks ago

Dear developers,

I'm trying to restart a train that crashed by means of inserting the following lines in the input:

However, it systematically gives the following error: ValueError: Key "model_builders" is different in config and the result trainer.pth file. Please double check

Does anyone knows why?

Thanks in advance and best wishes, Giuseppe

cw-tan commented 1 week ago

Dear Giuseppe,

Thank you for your interest in our code. Apologies for the delayed response, but the nequip framework and allegro model (that runs in the nequip infrastructure) are undergoing a major overhaul. We are close to the end of the revamps, and things look very different from what you see on main. The code related to your problem has been deleted in the revamps. If you're just starting on a project, it may be better to try using the new nequip infrastructure and corresponding allegro code, both on the develop branches of the respective git repositories. The configs/tutorial.yaml on both repos should be helpful, as well as the new docs https://nequip.readthedocs.io/en/develop/guide/workflow.html (note the develop in the url), for getting started.

If you really need to get things to work with the current public code and want to try to debug this issue, some comments. The error thrown comes from https://github.com/mir-group/nequip/blob/1e150cdc8614e640116d11e085d8e5e45b21e94d/nequip/scripts/train.py#L290, which just checks the original config file used for training and the config saved in best_model.pth. I assume this is a perplexing problem because you didn't change model_builders in your restart (because that's what one would expect to cause the error). A reasonable approach would be to investigate why the error got thrown by maybe inspecting best_model.pth and figuring out why it's erroring out at that part of the code highlighted earlier (inspect what's in the dicts being inspected).

I would advise trying to migrate to the new infrastructure if possible, but it's understandable if it's more favorable to continue using the current public infrastructure if you're in the middle of a project and have various models trained in that framework. Happy to give further advice.

Chuin Wei

gcassone-cnr commented 5 days ago

Dear Wei,

thanks a lot for your reply! I've a couple of questions. I've installed both develop branches of nequip and and allegro. However, when I try to run the configs/tutorial.yaml (e.g., nequip-train configs/tutorial.yaml) I systematically get the following error:


File "/home/cassone/anaconda3/lib/python3.12/site-packages/hydra/core/override_parser/overrides_parser.py", line 96, in parse_overrides raise OverrideParseException( hydra.errors.OverrideParseException: mismatched input 'tutorial.yaml' expecting ID See https://hydra.cc/docs/1.2/advanced/override_grammar/basic for details


Also, if I run "pytest tests/" to make an extensive testing of the nequip installation I get this error:


ERROR tests/integration/test_deploy.py ERROR tests/integration/test_train.py !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Could you please tell me why and how to fix these errors related to the develop branches of nequip and allegro? Thanks a lot in advance.

One final thing related to my previous issue: how could I inspect the best_model.pth file?

Many thanks in advance and best wishes, Giuseppe

cw-tan commented 5 days ago
  1. See https://nequip.readthedocs.io/en/develop/guide/workflow.html on "Training". The command to train is

    nequip-train -cp full/path/to/config/directory -cn config_name.yaml

    with several caveats depending on where you're running it from, etc.

  2. For loading, best_model.pth, you can use https://pytorch.org/docs/stable/generated/torch.load.html

gcassone-cnr commented 5 days ago

Dear Chuin Wei,

thank you for the prompt reply. Additionally to the tutorial.yaml file, are there some other templates to be used for training/testing, specifically adapted for water systems?

Thank you again and best wishes, Giuseppe

cw-tan commented 5 days ago

Hi Giuseppe,

For learning about the training infrastructure, nequip's tutorial.yaml is what you wanna look at https://github.com/mir-group/nequip/blob/develop/configs/tutorial.yaml. The model part of the config would be different for allegro, i.e. the nequip GNN model and allegro are fundamentally different models, so you'd need different hyperparameters.

This paper should have the relevant Allegro details for water systems. https://pubs.acs.org/doi/10.1021/acs.jpclett.4c00605. The SI has an allegro config for the old infrastructure, you'd have to translate it to the new infrastructure (carefully separating what are training hyperparameters and model architecture hyperparamaters, since the old infrastructure uses a flat list of configuration arguments while the new infrastructure has them separated into sections).

Chuin Wei

gcassone-cnr commented 3 days ago

Dear Chuin Wei,

thanks a lot for your important suggestions. Since the documentation on the development version of allegro is not yet available (isn't it?), could you please tell me how, e.g., to exploit either TensorBoard or WandB in conjunction with the new development branches? Is there any blog on these new development versions?

Thanks a lot in advance and best wishes, Giuseppe

cw-tan commented 3 days ago

Hi Giuseppe,

nequip is the main package that handles all the training infrastructure. allegro is a choice of model (in contrast to nequip the GNN model, which is often confused with nequip the overall software package for training deep equivariant potentials). So everything you need would be in the nequip tutorials.

You can find the relevant line here. https://github.com/mir-group/nequip/blob/ece09b587ab1082c2c806a094fb5cc1dc5489b60/configs/tutorial.yaml#L125

We've migrated to using lightning, so the stuff you see there are the arguments to instantiate a lightning.Trainer object, and how you configure can be learned by studying the lightning.Trainer API (https://lightning.ai/docs/pytorch/stable/common/trainer.html#trainer-class-api). Besides wandb, here are the various other loggers https://lightning.ai/docs/pytorch/stable/api_references.html#loggers .

Last bit of warning -- I think I was over-optimistic in my suggestions to use the new developments and wish to now backtrack (sorry!). To put it bluntly, it's not stable enough for me to recommend migrating over for production use at this point in time (but definitely fine if you wanna test it, with the expectation that things will change in breaking ways in the coming months, such that you might have to reinstall everything/retrain all your models, etc). That being said, if you want to use the new developments and face problems, we can be reached at allegro-nequip@g.harvard.edu.