torchmd / torchmd-net

Training neural network potentials
MIT License
335 stars 75 forks source link

Torchmd-net training crashed on Perlmutter GPU #148

Closed FranklinHu1 closed 2 years ago

FranklinHu1 commented 2 years ago

Hello,

I have been trying to train the torchmd-net model on the DOE NERSC Perlmutter system which is focused on GPU-accelerated applications. I am running into a strange error when training where the training will crash when it is time to test the model, with an index out of bounds error for one of the data loaders being the issue.

Installation

I installed torchmd-net into my Perlmutter environment following the instructions given in the README. Before doing any training with the model, I am always sure to activate the torchmd-net environment with

% mamba activate torchmd-net

My environment is as follows:

name: torchmd-net

channels:

dependencies:

Perlmutter workflow

I attempted to train the model using the example files included in torchmd-net/examples/, specifically the ET-SPICE.yaml config file. My workflow is as follows:

The contents of my job script for submitting to Perlmutter is as follows. Normally, I would stage my files from the SCRATCH directory but since I am trying to debug the issue, I am just staging from $HOME for now:

#!/bin/bash
#SBATCH -A m2530_g
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 04:00:00
#SBATCH -n 1
#SBATCH --ntasks-per-node=4
#SBATCH -c 32
#SBATCH --gpus-per-task=1
#SBATCH -J ET_debug_SPICE

#OpenMP settings:
#export OMP_NUM_THREADS=1
#export OMP_PLACES=threads
#export OMP_PROC_BIND=true

#run the application:
#applications may performance better with --gpu-bind=none instead of --gpu-bind=single:1
mamba activate torchmd-net
cd $HOME/torchmd_examples
python $HOME/torchmd-net/scripts/train.py --conf ET-SPICE.yaml

Using the most recent version of the torchmd-net repository

With the most recent version of the torchmd-net repository and following the above workflow, I ran into the following error:

Traceback (most recent call last):
  File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 170, in <module>
    main()
  File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 113, in main
    args = get_args()
  File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 65, in get_args
    parser.add_argument('--prior-model', type=str, default=None, choices=priors.__all__, help='Which prior model to use')
AttributeError: module 'torchmdnet.priors' has no attribute '__all__'

To work around this, I rolled back my version of torchmd-net by 8 commits to the last verified version on Oct 21, 2022. The commit SHA is 35cb19acd35407f1debd914abaeb576b24102e74. I did the rollback using the command within the torchmd-net directory

% git reset --hard 35cb19acd35407f1debd914abaeb576b24102e74

Using the rolled back version of torchmd-net

After rolling back and repeating the workflow, I get the following results after running the model for 10 epochs:

Traceback (most recent call last):
File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 170, in <module>
    main()
  File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 161, in main
    trainer.fit(model, data)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 241, in on_advance_end
    self._run_validation()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 299, in _run_validation
    self.val_loop.run()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 156, in advance
    self.trainer._logger_connector.update_eval_step_metrics(self._dl_batch_idx[dataloader_idx])
IndexError: list index out of range

I have been able to reproduce this error using the same workflow for the QM9 example, ANI-1 example, and MD17 example. I have also been able to reproduce this error using custom hdf5 datasets (obeying the constraints of the HDF5 class in torchmd-net/torchmdnet/datasets/hdf.py) I have created for modeling water systems with the above workflow.

I did experiment with increasing the test interval. It seems that this error occurs whenever the first testing stage happens, which happens to be at epoch 10 for all the example files in torchmd-net/examples. I checked the splits.npz files to ensure that idx_train, idx_val, and idx_test were all non-zero (i.e., there are configurations assigned to each of the three sets).

Things I have tried

I tried a few of the workarounds suggested on the NERSC website for known issues concerning machine learning applications. The link is https://docs.nersc.gov/machinelearning/known_issues/

I have tried the following things:

The fact that the code works just fine on CPU-only clusters suggests that this is not something wrong with the torchmd-net code but rather the way it interacts with the Perlmutter GPU environment.

Any help would be greatly appreciated. Thank you!

raimis commented 2 years ago

You haven't followed the installation instructions (https://github.com/torchmd/torchmd-net#installation). The environment has pytorch-lightning 1.7.7, but TorchMD-NET needs 1.6.3 (https://github.com/torchmd/torchmd-net/blob/main/environment.yml).

FranklinHu1 commented 2 years ago

Yes, that was the problem. After recreating the torchmd-net environment using the provided command and making sure that my pytorch-lightning version is 1.6.3, the model runs fine on Perlmutter GPU. I have been able to run the example config files for greater than 10 epochs as well as some in-house datasets.

Thank you very much for the help!