michellab / Sire

Sire Molecular Simulations Framework
http://siremol.org
GNU General Public License v3.0
95 stars 26 forks source link

The requested CUDA device could not be loaded #357

Closed kexul closed 3 years ago

kexul commented 3 years ago

Dear Sire developers: Recently I encountered the following error frequently in my cluster.

Traceback (most recent call last):
  File "/root/miniconda3/envs/biosimspace/share/Sire/scripts/somd-freenrg.py", line 146, in <module>
    OpenMMMD.runFreeNrg(params)
  File "/root/miniconda3/envs/biosimspace/lib/python3.7/site-packages/Sire/Tools/__init__.py", line 176, in inner
    retval = func()
  File "/root/miniconda3/envs/biosimspace/lib/python3.7/site-packages/Sire/Tools/OpenMMMD.py", line 1659, in runFreeNrg
    system = integrator.minimiseEnergy(system, minimise_tol.val, minimise_max_iter.val)
RuntimeError: The requested CUDA device could not be loaded

The simulation was performed by: somd-freenrg -C somd.cfg -t somd.prm7 -c somd.rst7 -m somd.pert -p CUDA, one lambda per GPU, in nodes with spec:

CPU:  192  AMD EPYC 7K62 48-Core Processor

GPU: NVIDIA A100-SXM4-40GB * 8 

CUDA runtime: 11.2.152

openMM version: 7.4.2,  build py37_cuda101_rc_1 from omnia

It's weird that only one GPU would fail in my 8 GPUs node.

I googled a while and found this https://github.com/openmm/openmm/issues/1728 might help, but I don't know where I should put the parameter 'DisablePmeStream':'true', do you have any suggestions? Thanks!

lohedges commented 3 years ago

Hi there,

Could you let us know how the CUDA_VISIBLE_DEVICES environment variable is set for your node? For simplicity, SOMD assumes that you have exclusive access to the node and defaults to a device index of 0 (which is actually an index into the CUDA_VISIBLE_DEVICES_ARRAY). This is worked well for us on various clusters where the scheduler makes sure to set CUDA_VISIBLE_DEVICES for you. Sometimes it can mask the values so you only ever need to use index 0, i.e. you have a single device visible to each GPU job on the node and it takes care of mapping to the correct GPU index.

It's possible that you aren't running with exclusive access, so someone else might have nabbed one of the free GPUs before your job started. Alternatively, the indexing could be wrong so you might need to manually set gpu = X in the SOMD configuration file for each GPU job on the node. (Probably 0 through 7, depending on what CUDA_VISIBLE_DEVICES reports when nothing is running.) We've not yet automated things for multi-GPU nodes, since the set-up is rather specific to the scheduler and cluster in question.

Unfortunately the 'DisablePmeStream':'true' option would need to be set in Sire's C++ layer, since that part of the OpenMM interface hasn't been exposed to Python.

kexul commented 3 years ago

Hi @lohedges , Thanks for the quick reply, appreciated it!

Could you let us know how the CUDA_VISIBLE_DEVICES environment variable is set for your node?

I did not explicitly set CUDA_VISIBLE_DEVICES in my node and used gpu=X in SOMD configuration file to control which GPU to use.

It's possible that you aren't running with exclusive access

Yes, I'm checking this, which I thought I should have. I just ran CUDA_VISIBLE_DEVICES=xxx python -m simtk.testInstallation in that specific GPU which somd-freenrg failed to run, it failed too, it seems like an OpenMM installation problem rather than a Sire specific problem.

# CUDA_VISIBLE_DEVICES=7 python -m simtk.testInstallation

OpenMM Version: 7.4.2
Git Revision: dc9d188939ad630d240e89806b185061f7cd661a

There are 3 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Error computing forces with CUDA platform

CUDA platform error: No compatible CUDA device is available

Median difference in forces between platforms:

Reference vs. CPU: 6.31451e-06

All differences are within tolerance.
lohedges commented 3 years ago

Is this an issue with GPU number 7 on all equivalent nodes, or just this one? I wonder if there was a bad job and the GPU has ended up in a funny state, so that it is no longer showing as available.

kexul commented 3 years ago

Is this an issue with GPU number 7 on all equivalent nodes, or just this one?

Just this one.

I wonder if there was a bad job and the GPU has ended up in a funny state, so that it is no longer showing as available.

Thanks, I'll check that. Currently, I've figured out a temp fix: check GPUs using testInstallation and remove bad GPU from my resource queue 🤣.

kexul commented 3 years ago

Closing this since it seems not a Sire specific problem, I'll post an update here if I found anything that would benefit the Sire community.