Closed kexul closed 3 years ago
Hi there,
Could you let us know how the CUDA_VISIBLE_DEVICES
environment variable is set for your node? For simplicity, SOMD assumes that you have exclusive access to the node and defaults to a device index of 0 (which is actually an index into the CUDA_VISIBLE_DEVICES_ARRAY
). This is worked well for us on various clusters where the scheduler makes sure to set CUDA_VISIBLE_DEVICES
for you. Sometimes it can mask the values so you only ever need to use index 0, i.e. you have a single device visible to each GPU job on the node and it takes care of mapping to the correct GPU index.
It's possible that you aren't running with exclusive access, so someone else might have nabbed one of the free GPUs before your job started. Alternatively, the indexing could be wrong so you might need to manually set gpu = X
in the SOMD configuration file for each GPU job on the node. (Probably 0 through 7, depending on what CUDA_VISIBLE_DEVICES
reports when nothing is running.) We've not yet automated things for multi-GPU nodes, since the set-up is rather specific to the scheduler and cluster in question.
Unfortunately the 'DisablePmeStream':'true'
option would need to be set in Sire's C++ layer, since that part of the OpenMM interface hasn't been exposed to Python.
Hi @lohedges , Thanks for the quick reply, appreciated it!
Could you let us know how the CUDA_VISIBLE_DEVICES environment variable is set for your node?
I did not explicitly set CUDA_VISIBLE_DEVICES
in my node and used gpu=X
in SOMD configuration file to control which GPU to use.
It's possible that you aren't running with exclusive access
Yes, I'm checking this, which I thought I should have. I just ran CUDA_VISIBLE_DEVICES=xxx python -m simtk.testInstallation
in that specific GPU which somd-freenrg
failed to run, it failed too, it seems like an OpenMM installation problem rather than a Sire specific problem.
# CUDA_VISIBLE_DEVICES=7 python -m simtk.testInstallation
OpenMM Version: 7.4.2
Git Revision: dc9d188939ad630d240e89806b185061f7cd661a
There are 3 Platforms available:
1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Error computing forces with CUDA platform
CUDA platform error: No compatible CUDA device is available
Median difference in forces between platforms:
Reference vs. CPU: 6.31451e-06
All differences are within tolerance.
Is this an issue with GPU number 7 on all equivalent nodes, or just this one? I wonder if there was a bad job and the GPU has ended up in a funny state, so that it is no longer showing as available.
Is this an issue with GPU number 7 on all equivalent nodes, or just this one?
Just this one.
I wonder if there was a bad job and the GPU has ended up in a funny state, so that it is no longer showing as available.
Thanks, I'll check that. Currently, I've figured out a temp fix: check GPUs using testInstallation
and remove bad GPU from my resource queue 🤣.
Closing this since it seems not a Sire specific problem, I'll post an update here if I found anything that would benefit the Sire community.
Dear Sire developers: Recently I encountered the following error frequently in my cluster.
The simulation was performed by:
somd-freenrg -C somd.cfg -t somd.prm7 -c somd.rst7 -m somd.pert -p CUDA
, one lambda per GPU, in nodes with spec:It's weird that only one GPU would fail in my 8 GPUs node.
I googled a while and found this https://github.com/openmm/openmm/issues/1728 might help, but I don't know where I should put the parameter
'DisablePmeStream':'true'
, do you have any suggestions? Thanks!