Open FranklinHu1 opened 7 months ago
Thanks for the thorough issue! https://github.com/torchmd/torchmd-net/commit/6694816860598ad787a74215c23af509330783f6 should not have broken old checkpoints, could you provide one so I can investigate? For your torchscript issue, try changing this 0 here: https://github.com/torchmd/torchmd-net/blob/6694816860598ad787a74215c23af509330783f6/torchmdnet/priors/zbl.py#L53-L55 to 0.0 I have seen similar things before with TorchScript.
Thanks as always for the quick response @RaulPPelaez!
Making that change in the ZBL prior did indeed fix the issue and I was able to generate the torchscript module and run some dynamics using it. I will continue testing that to see if I stumble on any other bugs.
As for the old checkpoint loading problem, I have attached a checkpoint file and the yaml file used to run the experiment to this issue as a zip file. This model was trained without ZBL on one A100 GPU. I get the following error if I try to load the model:
>>> from torchmdnet.models.model import load_model
>>> model = load_model("epoch=999-val_loss=0.0000-test_loss=0.0010.ckpt", derivative=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/global/cfs/cdirs/m4026/torchmd-net/torchmdnet/models/model.py", line 243, in load_model
model.load_state_dict(state_dict)
File "/global/cfs/cdirs/m4026/torchmd-net/.conda/envs/torchmd-net/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TorchMD_Net:
Missing key(s) in state_dict: "output_model.output_network.layers.0.weight", "output_model.output_network.layers.0.bias", "output_model.output_network.layers.2.weight", "output_model.output_network.layers.2.bias".
Unexpected key(s) in state_dict: "output_model.output_network.0.weight", "output_model.output_network.0.bias", "output_model.output_network.2.weight", "output_model.output_network.2.bias".
Because the most recent commit at 6694816860598ad787a74215c23af509330783f6 involved reformatting some keys of the state dictionary, I thought that this would be related to that.
Thanks for looking into it!
@FranklinHu1, I am able to load your model using #318.
Hello,
I am running into a problem with using torchscript to integrate a trained tensornet model with openmm for dynamics. This is in the newest version of the code as of writing (hash 6694816860598ad787a74215c23af509330783f6).
The system
I am running this code on NERSC Perlmutter, which uses A100 GPUs (either 40GB or 80GB). My anaconda environment is as follows. I set this environment up following the documentation available at https://torchmd-net.readthedocs.io/en/latest/installation.html using the install from source instructions:
Setup
I trained a tensornet model with the ZBL prior using the following configuration file. I included the ZBL prior since I am working with systems containing ions. Training was done on a single A100 GPU, and was restarted from the latest checkpoint after 1000 epochs.
I then used the following script to generate the force module. Since I am using periodic boundary conditions, I use the version
ForceModulePBC
:The error
I ran the following command using this script:
It seems the code to generate this torchscript module fails on the call to
torch.jit.script()
with the following error:Other info
Because of the MLP change introduced in commit 6694816860598ad787a74215c23af509330783f6, I cannot try to load old models since the keys have been mismatched. However, I did try downgrading my version of the repository to version 74702dad9431dc4ea71a3f40deb59b6da9c537b0 and the code above did all work (albeit with an older trained model).
As always, thank you so much for your time, and any help would be greatly appreciated!