Hello,

I have been trying to train the torchmd-net model on the DOE NERSC Perlmutter system which is focused on GPU-accelerated applications. I am running into a strange error when training where the training will crash when it is time to test the model, with an index out of bounds error for one of the data loaders being the issue.

Installation

I installed torchmd-net into my Perlmutter environment following the instructions given in the README. Before doing any training with the model, I am always sure to activate the torchmd-net environment with

% mamba activate torchmd-net

My environment is as follows:

name: torchmd-net

channels:

conda-forge
defaults

dependencies:

_libgcc_mutex=0.1=conda_forge
_openmp_mutex=4.5=2_kmp_llvm
absl-py=1.2.0=pyhd8ed1ab_0
aiohttp=3.8.3=py39hb9d737c_0
aiosignal=1.2.0=pyhd8ed1ab_0
alsa-lib=1.2.7.2=h166bdaf_0
ambertools=22.0=py39hcb9d944_3
arpack=3.7.0=hdefa2d7_2
astunparse=1.6.3=pyhd8ed1ab_0
async-timeout=4.0.2=py39h06a4308_0
attr=2.5.1=h166bdaf_1
attrs=22.1.0=pyh71513ae_1
blinker=1.4=py39h06a4308_0
blosc=1.21.1=h83bc5f7_3
boost=1.74.0=py39h5472131_5
boost-cpp=1.74.0=h75c5d50_8
brotli=1.0.9=h5eee18b_7
brotli-bin=1.0.9=h5eee18b_7
brotlipy=0.7.0=py39h27cfd23_1003
bzip2=1.0.8=h7b6447c_0
c-ares=1.18.1=h7f8727e_0
ca-certificates=2022.9.24=ha878542_0
cached-property=1.5.2=hd8ed1ab_1
cached_property=1.5.2=pyha770c72_1
cachetools=5.2.0=pyhd8ed1ab_0
cairo=1.16.0=ha61ee94_1014
certifi=2022.9.24=pyhd8ed1ab_0
cffi=1.15.1=py39h74dc2b5_0
cftime=1.6.2=py39h2ae25f5_1
charset-normalizer=2.1.1=pyhd8ed1ab_0
click=8.1.3=py39hf3d152e_0
colorama=0.4.5=py39h06a4308_0
contourpy=1.0.5=py39hdb19cb5_0
coverage=6.4.4=py39hb9d737c_0
cryptography=37.0.4=py39hd97740a_0
cudatoolkit=11.7.0=hd8887f6_10
cudnn=8.4.1.50=hed8a83a_0
curl=7.84.0=h5eee18b_0
cycler=0.11.0=pyhd8ed1ab_0
cython=0.29.32=py39h6a678d5_0
dbus=1.13.18=hb2f20db_0
expat=2.4.9=h6a678d5_0
fftw=3.3.10=nompi_hf0379b8_105
flake8=5.0.4=pyhd8ed1ab_0
font-ttf-dejavu-sans-mono=2.37=hab24e00_0
font-ttf-inconsolata=3.000=h77eed37_0
font-ttf-source-code-pro=2.038=h77eed37_0
font-ttf-ubuntu=0.83=hab24e00_0
fontconfig=2.14.0=hc2a2eb6_1
fonts-conda-ecosystem=1=0
fonts-conda-forge=1=0
fonttools=4.37.3=py39hb9d737c_0
freetype=2.12.1=hca18f0e_0
frozenlist=1.3.1=py39hb9d737c_0
fsspec=2022.8.2=pyhd8ed1ab_0
future=0.18.2=py39h06a4308_1
gettext=0.19.8.1=h9b4dc7a_1
giflib=5.2.1=h7b6447c_0
glib=2.74.0=h6239696_0
glib-tools=2.74.0=h6239696_0
google-auth=2.11.1=pyh1a96a4e_0
google-auth-oauthlib=0.4.6=pyhd8ed1ab_0
googledrivedownloader=0.4=pyhd3deb0d_1
greenlet=1.1.3=py39h5a03fae_0
grpc-cpp=1.48.1=hbad87ad_1
grpcio=1.48.1=py39h43a650c_1
gst-plugins-base=1.20.3=h57caac4_2
gstreamer=1.20.3=hd4edc92_2
h5py=3.7.0=nompi_py39hd51670d_101
hdf4=4.2.15=h9772cbc_4
hdf5=1.12.2=nompi_h2386368_100
html5lib=1.1=pyh9f0ad1d_0
icu=70.1=h27087fc_0
idna=3.4=pyhd8ed1ab_0
importlib-metadata=4.11.4=py39hf3d152e_0
importlib_metadata=4.11.4=hd8ed1ab_0
iniconfig=1.1.1=pyh9f0ad1d_0
intel-openmp=2022.1.0=h9e868ea_3769
isodate=0.6.1=pyhd8ed1ab_0
jack=1.9.18=h8c3723f_1003
jinja2=3.1.2=pyhd8ed1ab_1
joblib=1.2.0=pyhd8ed1ab_0
jpeg=9e=h7f8727e_0
keyutils=1.6.1=h166bdaf_0
kiwisolver=1.4.4=py39hf939315_0
krb5=1.19.3=h3790be6_0
lark-parser=0.12.0=pyhd8ed1ab_0
lcms2=2.12=h3be6417_0
ld_impl_linux-64=2.38=h1181459_1
lerc=4.0.0=h27087fc_0
libabseil=20220623.0=cxx17_h48a1fff_4
libblas=3.9.0=16_linux64_mkl
libbrotlicommon=1.0.9=h5eee18b_7
libbrotlidec=1.0.9=h5eee18b_7
libbrotlienc=1.0.9=h5eee18b_7
libcap=2.65=ha37c62d_0
libcblas=3.9.0=16_linux64_mkl
libclang=14.0.6=default_hc1a23ef_0
libclang13=14.0.6=default_h31cde19_0
libcups=2.3.3=h3e49a29_2
libcurl=7.84.0=h91b91d3_0
libdb=6.2.32=h6a678d5_1
libdeflate=1.14=h166bdaf_0
libedit=3.1.20210910=h7f8727e_0
libev=4.33=h7f8727e_1
libevent=2.1.10=h9b69904_4
libffi=3.4.2=h295c915_4
libflac=1.3.4=h27087fc_0
libgcc-ng=12.1.0=h8d9b700_16
libgfortran-ng=12.1.0=h69a702a_16
libgfortran5=12.1.0=hdcd56e2_16
libglib=2.74.0=h7a41b64_0
libgomp=12.1.0=h8d9b700_16
libiconv=1.17=h166bdaf_0
liblapack=3.9.0=16_linux64_mkl
libllvm11=11.1.0=he0ac6c6_4
libllvm14=14.0.6=he0ac6c6_0
libnetcdf=4.8.1=nompi_h21705cb_104
libnghttp2=1.47.0=hdcd2b5c_1
libnsl=2.0.0=h7f98852_0
libogg=1.3.5=h27cfd23_1
libopus=1.3.1=h7b6447c_0
libpng=1.6.38=h753d276_0
libpq=14.5=hd77ab85_0
libprotobuf=3.20.1=h4ff587b_0
libsndfile=1.0.31=h9c3ff4c_1
libsqlite=3.39.3=h753d276_0
libssh2=1.10.0=h8f2d780_0
libstdcxx-ng=12.1.0=ha89aaad_16
libtiff=4.4.0=h55922b4_4
libtool=2.4.6=h295c915_1008
libudev1=249=h166bdaf_4
libuuid=2.32.1=h7f98852_1000
libvorbis=1.3.7=h7b6447c_0
libwebp=1.2.4=h11a3e52_0
libwebp-base=1.2.4=h5eee18b_0
libxcb=1.13=h1bed415_1
libxkbcommon=1.0.3=he3ba5ed_0
libxml2=2.10.2=h4c7fe37_1
libzip=1.9.2=hc869a4a_1
libzlib=1.2.12=h166bdaf_3
llvm-openmp=14.0.6=h9e868ea_0
llvmlite=0.39.1=py39h7d9a04d_0
lz4-c=1.9.3=h295c915_1
lzo=2.10=h7b6447c_2
magma=2.5.4=h6103c52_2
markdown=3.4.1=pyhd8ed1ab_0
markupsafe=2.1.1=py39h7f8727e_0
matplotlib=3.6.0=py39hf3d152e_0
matplotlib-base=3.6.0=py39hf9fd14e_0
mccabe=0.7.0=pyhd8ed1ab_0
mdtraj=1.9.7=py39hc79b4f4_2
mkl=2022.1.0=hc2b9512_224
mpiplus=v0.0.1=pyhd8ed1ab_1003
multidict=6.0.2=py39h5eee18b_0
munkres=1.1.4=pyh9f0ad1d_0
mysql-common=8.0.30=haf5c9bc_1
mysql-libs=8.0.30=h28c427c_1
nccl=2.14.3.1=h0800d71_0
ncurses=6.3=h5eee18b_3
netcdf-fortran=4.6.0=nompi_hc402ea5_100
netcdf4=1.6.1=nompi_py39hfaa66c4_100
networkx=2.8.6=pyhd8ed1ab_0
ninja=1.11.0=h924138e_0
nnpops=0.2=cuda112py39hcdb0232_4
nose=1.3.7=py_1006
nspr=4.33=h295c915_0
nss=3.78=h2350873_0
numba=0.56.3=py39h61ddf18_0
numexpr=2.8.3=mkl_py39h25e7801_0
numpy=1.23.3=py39hba7629e_0
oauthlib=3.2.1=py39h06a4308_0
ocl-icd=2.3.1=h7f98852_0
ocl-icd-system=1.0.0=1
openff-forcefields=2.0.0=pyh6c4a22f_0
openff-toolkit=0.10.6=pyhd8ed1ab_0
openff-toolkit-base=0.10.6=pyhd8ed1ab_0
openjpeg=2.5.0=h7d73246_1
openmm=7.7.0=py39h15fbce5_1
openmm-torch=0.8=cuda112py39h83a068c_2
openmmtools=0.21.5=pyhd8ed1ab_0
openssl=1.1.1s=h166bdaf_0
packaging=21.3=pyhd8ed1ab_0
packmol=20.010=h86c2bf4_0
pandas=1.5.0=py39h4661b88_0
parmed=3.4.3=py39h5a03fae_2
pcre2=10.37=he7ceb23_1
pdbfixer=1.8.1=pyh6c4a22f_0
perl=5.32.1=2_h7f98852_perl5
pillow=9.2.0=py39hace64e9_1
pip=22.2.2=py39h06a4308_0
pixman=0.40.0=h7f8727e_1
pluggy=1.0.0=py39h06a4308_1
ply=3.11=py39h06a4308_0
portaudio=19.6.0=h7b6447c_4
protobuf=3.20.1=py39h295c915_0
psutil=5.9.2=py39hb9d737c_0
pthread-stubs=0.4=h36c2ea0_1001
pulseaudio=14.0=h0868958_9
py=1.11.0=pyh6c4a22f_0
pyasn1=0.4.8=py_0
pyasn1-modules=0.2.8=py_0
pycairo=1.21.0=py39h287db57_0
pycodestyle=2.9.1=pyhd8ed1ab_0
pycparser=2.21=pyhd8ed1ab_0
pydeprecate=0.3.2=pyhd8ed1ab_0
pyflakes=2.5.0=pyhd8ed1ab_0
pyjwt=2.5.0=pyhd8ed1ab_0
pymbar=3.1.0=py39hd257fcd_0
pyopenssl=22.0.0=pyhd8ed1ab_1
pyparsing=3.0.9=py39h06a4308_0
pyqt=5.15.7=py39h18e9c17_0
pyqt5-sip=12.11.0=py39h5a03fae_0
pysocks=1.7.1=py39h06a4308_0
pytables=3.7.0=py39hee6f8ba_2
pytest=7.1.3=py39hf3d152e_0
pytest-cov=3.0.0=pyhd8ed1ab_0
python=3.9.13=h9a8a25e_0_cpython
python-constraint=1.4.0=py_0
python-dateutil=2.8.2=pyhd8ed1ab_0
python-louvain=0.15=pyhd8ed1ab_1
python_abi=3.9=2_cp39
pytorch=1.11.0=cuda112py39ha0cca9b_1
pytorch-lightning=1.7.7=pyhd8ed1ab_0
pytorch_cluster=1.5.9=py39h99bd56b_2
pytorch_geometric=2.0.3=pyhd8ed1ab_0
pytorch_scatter=2.0.8=cuda112py39h99bd56b_2
pytorch_sparse=0.6.10=py39h99bd56b_4
pytz=2022.2.1=pyhd8ed1ab_0
pyu2f=0.1.5=pyhd8ed1ab_0
pyyaml=6.0=py39h7f8727e_1
qt-main=5.15.6=hc525480_0
rdflib=6.2.0=pyhd8ed1ab_0
rdkit=2022.03.5=py39h89e00b9_0
re2=2022.06.01=h27087fc_0
readline=8.1.2=h7f8727e_1
reportlab=3.5.68=py39he59360d_1
requests=2.28.1=py39h06a4308_0
requests-oauthlib=1.3.1=pyhd8ed1ab_0
rsa=4.9=pyhd8ed1ab_0
scikit-learn=1.1.2=py39h6a678d5_0
scipy=1.9.1=py39h8ba3f38_0
setuptools=59.5.0=py39hf3d152e_0
setuptools-scm=6.3.2=pyhd8ed1ab_0
setuptools_scm=6.3.2=hd8ed1ab_0
sip=6.6.2=py39h6a678d5_0
six=1.16.0=pyhd3eb1b0_1
sleef=3.5.1=h9b69904_2
smirnoff99frosst=1.1.0=pyh44b312d_0
snappy=1.1.9=h295c915_0
sqlalchemy=1.4.41=py39hb9d737c_0
sqlite=3.39.3=h5082296_0
tbb=2021.6.0=hdb19cb5_0
tensorboard=2.10.0=pyhd8ed1ab_2
tensorboard-data-server=0.6.0=py39hca6d32c_0
tensorboard-plugin-wit=1.8.1=py39h06a4308_0
threadpoolctl=3.1.0=pyh8a188c0_0
tk=8.6.12=h1ccaba5_0
toml=0.10.2=pyhd8ed1ab_0
tomli=2.0.1=py39h06a4308_0
torchani=2.2.2=cuda112py39h0dd23f4_5
torchmetrics=0.8.2=pyhd8ed1ab_0
tornado=6.2=py39h5eee18b_0
tqdm=4.64.1=py39h06a4308_0
typing-extensions=4.3.0=py39h06a4308_0
typing_extensions=4.3.0=py39h06a4308_0
tzdata=2022e=h04d1e81_0
unicodedata2=14.0.0=py39hb9d737c_1
urllib3=1.26.11=py39h06a4308_0
webencodings=0.5.1=py39h06a4308_1
werkzeug=2.2.2=pyhd8ed1ab_0
wheel=0.37.1=pyhd8ed1ab_0
xcb-util=0.4.0=h516909a_0
xcb-util-image=0.4.0=h166bdaf_0
xcb-util-keysyms=0.4.0=h516909a_0
xcb-util-renderutil=0.3.9=h166bdaf_0
xcb-util-wm=0.4.1=h516909a_0
xmltodict=0.13.0=pyhd8ed1ab_0
xorg-kbproto=1.0.7=h7f98852_1002
xorg-libice=1.0.10=h7f98852_0
xorg-libsm=1.2.3=hd9c2040_1000
xorg-libx11=1.7.2=h7f98852_0
xorg-libxau=1.0.9=h7f98852_0
xorg-libxdmcp=1.1.3=h7f98852_0
xorg-libxext=1.3.4=h7f98852_1
xorg-libxrender=0.9.10=h7f98852_1003
xorg-libxt=1.2.1=h7f98852_2
xorg-renderproto=0.11.1=h7f98852_1002
xorg-xextproto=7.3.0=h7f98852_1002
xorg-xproto=7.0.31=h27cfd23_1007
xz=5.2.6=h5eee18b_0
yaml=0.2.5=h7b6447c_0
yarl=1.8.1=py39h5eee18b_0
zipp=3.8.1=pyhd8ed1ab_0
zlib=1.2.12=h5eee18b_3
zstd=1.5.2=ha4553b6_0
pip:
- amberlite==22.0
- amberutils==21.0
- mmpbsa-py==16.0
- packmol-memgen==1.2.3rc0
- pdb4amber==22.0
- pytraj==2.0.6
- sander==22.0 prefix: /global/homes/f/frankhu/.conda/envs/torchmd-net

Perlmutter workflow

I attempted to train the model using the example files included in torchmd-net/examples/, specifically the ET-SPICE.yaml config file. My workflow is as follows:

Copy the ET-SPICE.yaml file from torchmd-net/examples into a new directory, ~/torchmd_examples, outside of torchmd-net
Change the working directory to ~/torchmd_examples
Change the log_dir of the ET-SPICE.yaml copy to point to SPICE_logs, a created subdirectory within ~/torchmd_examples
Change the redirect of the ET-SPICE.yaml copy from false to true
Submit a batch job to Perlmutter GPU with the sbatch command and the job script, e.g. sbatch ET_job.sh

The contents of my job script for submitting to Perlmutter is as follows. Normally, I would stage my files from the SCRATCH directory but since I am trying to debug the issue, I am just staging from $HOME for now:

#!/bin/bash
#SBATCH -A m2530_g
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 04:00:00
#SBATCH -n 1
#SBATCH --ntasks-per-node=4
#SBATCH -c 32
#SBATCH --gpus-per-task=1
#SBATCH -J ET_debug_SPICE

#OpenMP settings:
#export OMP_NUM_THREADS=1
#export OMP_PLACES=threads
#export OMP_PROC_BIND=true

#run the application:
#applications may performance better with --gpu-bind=none instead of --gpu-bind=single:1
mamba activate torchmd-net
cd $HOME/torchmd_examples
python $HOME/torchmd-net/scripts/train.py --conf ET-SPICE.yaml

Using the most recent version of the torchmd-net repository

With the most recent version of the torchmd-net repository and following the above workflow, I ran into the following error:

Traceback (most recent call last):
  File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 170, in <module>
    main()
  File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 113, in main
    args = get_args()
  File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 65, in get_args
    parser.add_argument('--prior-model', type=str, default=None, choices=priors.__all__, help='Which prior model to use')
AttributeError: module 'torchmdnet.priors' has no attribute '__all__'

To work around this, I rolled back my version of torchmd-net by 8 commits to the last verified version on Oct 21, 2022. The commit SHA is 35cb19acd35407f1debd914abaeb576b24102e74. I did the rollback using the command within the torchmd-net directory

% git reset --hard 35cb19acd35407f1debd914abaeb576b24102e74

Using the rolled back version of torchmd-net

After rolling back and repeating the workflow, I get the following results after running the model for 10 epochs:

Traceback (most recent call last):
File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 170, in <module>
    main()
  File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 161, in main
    trainer.fit(model, data)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 241, in on_advance_end
    self._run_validation()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 299, in _run_validation
    self.val_loop.run()
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 156, in advance
    self.trainer._logger_connector.update_eval_step_metrics(self._dl_batch_idx[dataloader_idx])
IndexError: list index out of range

I have been able to reproduce this error using the same workflow for the QM9 example, ANI-1 example, and MD17 example. I have also been able to reproduce this error using custom hdf5 datasets (obeying the constraints of the HDF5 class in torchmd-net/torchmdnet/datasets/hdf.py) I have created for modeling water systems with the above workflow.

I did experiment with increasing the test interval. It seems that this error occurs whenever the first testing stage happens, which happens to be at epoch 10 for all the example files in torchmd-net/examples. I checked the splits.npz files to ensure that idx_train, idx_val, and idx_test were all non-zero (i.e., there are configurations assigned to each of the three sets).

Things I have tried

I tried a few of the workarounds suggested on the NERSC website for known issues concerning machine learning applications. The link is https://docs.nersc.gov/machinelearning/known_issues/

I have tried the following things:

Requesting Slingshot 10 nodes instead of Slingshot 11 by adding the "_ss10" suffix. This did not work and made it so that sbatch did not let me submit my job.
Setting the environment variable export NCCL_IB_DISABLE=1 in the job file. This did not work.
Train on a CPU only cluster without GPUs. This works, and has been the current workaround for running experiments.

The fact that the code works just fine on CPU-only clusters suggests that this is not something wrong with the torchmd-net code but rather the way it interacts with the Perlmutter GPU environment.

Any help would be greatly appreciated. Thank you!

torchmd / torchmd-net

Torchmd-net training crashed on Perlmutter GPU #148

Installation

Perlmutter workflow

Using the most recent version of the torchmd-net repository

Using the rolled back version of torchmd-net

Things I have tried