Closed FranklinHu1 closed 2 years ago
You haven't followed the installation instructions (https://github.com/torchmd/torchmd-net#installation). The environment has pytorch-lightning
1.7.7, but TorchMD-NET needs 1.6.3 (https://github.com/torchmd/torchmd-net/blob/main/environment.yml).
Yes, that was the problem. After recreating the torchmd-net environment using the provided command and making sure that my pytorch-lightning version is 1.6.3, the model runs fine on Perlmutter GPU. I have been able to run the example config files for greater than 10 epochs as well as some in-house datasets.
Thank you very much for the help!
Hello,
I have been trying to train the torchmd-net model on the DOE NERSC Perlmutter system which is focused on GPU-accelerated applications. I am running into a strange error when training where the training will crash when it is time to test the model, with an index out of bounds error for one of the data loaders being the issue.
Installation
I installed torchmd-net into my Perlmutter environment following the instructions given in the README. Before doing any training with the model, I am always sure to activate the torchmd-net environment with
% mamba activate torchmd-net
My environment is as follows:
name: torchmd-net
channels:
dependencies:
Perlmutter workflow
I attempted to train the model using the example files included in torchmd-net/examples/, specifically the ET-SPICE.yaml config file. My workflow is as follows:
sbatch ET_job.sh
The contents of my job script for submitting to Perlmutter is as follows. Normally, I would stage my files from the SCRATCH directory but since I am trying to debug the issue, I am just staging from $HOME for now:
Using the most recent version of the torchmd-net repository
With the most recent version of the torchmd-net repository and following the above workflow, I ran into the following error:
To work around this, I rolled back my version of torchmd-net by 8 commits to the last verified version on Oct 21, 2022. The commit SHA is 35cb19acd35407f1debd914abaeb576b24102e74. I did the rollback using the command within the torchmd-net directory
% git reset --hard 35cb19acd35407f1debd914abaeb576b24102e74
Using the rolled back version of torchmd-net
After rolling back and repeating the workflow, I get the following results after running the model for 10 epochs:
I have been able to reproduce this error using the same workflow for the QM9 example, ANI-1 example, and MD17 example. I have also been able to reproduce this error using custom hdf5 datasets (obeying the constraints of the HDF5 class in torchmd-net/torchmdnet/datasets/hdf.py) I have created for modeling water systems with the above workflow.
I did experiment with increasing the test interval. It seems that this error occurs whenever the first testing stage happens, which happens to be at epoch 10 for all the example files in torchmd-net/examples. I checked the splits.npz files to ensure that idx_train, idx_val, and idx_test were all non-zero (i.e., there are configurations assigned to each of the three sets).
Things I have tried
I tried a few of the workarounds suggested on the NERSC website for known issues concerning machine learning applications. The link is https://docs.nersc.gov/machinelearning/known_issues/
I have tried the following things:
The fact that the code works just fine on CPU-only clusters suggests that this is not something wrong with the torchmd-net code but rather the way it interacts with the Perlmutter GPU environment.
Any help would be greatly appreciated. Thank you!