michellab / Sire

Sire Molecular Simulations Framework
http://siremol.org
GNU General Public License v3.0
95 stars 26 forks source link

CUDA platform cannot be recognized by Sire in docker container #351

Closed kexul closed 3 years ago

kexul commented 3 years ago

Hi, I've installed Sire by

mamba create -n uii -c conda-forge -c omnia -c michellab biosimspace

Then created some perturbated system by BioSimSpace and run simulation by

 somd-freenrg -C somd.cfg -t somd.prm7 -c somd.rst7 -m somd.pert -p CUDA

It complained:

###=======================Minimisation========================###
Running minimisation.
Energy before the minimisation: 2.77399e+10 kcal mol-1
Tolerance for minimisation: 1
Maximum number of minimisation iterations: 1000
Traceback (most recent call last):
  File "/root/miniconda3/envs/uii/share/Sire/scripts/somd-freenrg.py", line 146, in <module>
    OpenMMMD.runFreeNrg(params)
  File "/root/miniconda3/envs/uii/lib/python3.7/site-packages/Sire/Tools/__init__.py", line 176, in inner
    retval = func()
  File "/root/miniconda3/envs/uii/lib/python3.7/site-packages/Sire/Tools/OpenMMMD.py", line 1640, in runFreeNrg
    system = integrator.minimiseEnergy(system, minimise_tol.val, minimise_max_iter.val)
RuntimeError: There is no registered Platform called "CUDA"

I've looked OpenMM's documentation and used its self-test command:

python -m simtk.testInstallation

which showed:

OpenMM Version: 7.4.2
Git Revision: dc9d188939ad630d240e89806b185061f7cd661a

There are 3 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Successfully computed forces

Median difference in forces between platforms:

Reference vs. CPU: 6.30096e-06
Reference vs. CUDA: 6.72867e-06
CPU vs. CUDA: 7.40012e-07

All differences are within tolerance.

When optimise_openmm was used, it showed:

Starting optimise_openmm: number of threads equals 10CUDA platform is not recognised by OpenMM!available platforms are: 
['Reference', 'CPU']
Let's see if we can do something about this....

Found a CUDA toolkit release version: 11.0
Trying to update OpenMM to match your CUDA version 11.0 for your OpenMM version 7.4.2
This may take a little while. Please hold tight!
................................................

==============================================================
Sending anonymous Sire usage statistics to http://siremol.org.
For more information, see http://siremol.org/analytics
To disable, set the environment variable 'SIRE_DONT_PHONEHOME' to 1
To see the information sent, set the environment variable 
SIRE_VERBOSE_PHONEHOME equal to 1. To silence this message, set
the environment variable SIRE_SILENT_PHONEHOME to 1.
==============================================================

['Reference', 'CPU', 'CPU']
Something didn't work out with the update of OpenMM via conda, have a look at the output.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

I'm using centos 7.2 with cuda11, here is the output of nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
lohedges commented 3 years ago

Looking at the OpenMM packages available on the Omnia channel here it doesn't look like there is a version built against CUDA 11. I think only the conda-forge version of OpenMM supports more recent CUDA versions. It is possible to build against this version, but it takes a little work. (We have a conda-forge recipe for Sire here which can be used to build a local conda package against OpenMM 7.5.1.)

I'll try to work out why the optimise_openmm script is failing to give you a sensible error message.

kexul commented 3 years ago

Hi @lohedges , I looked the Omnia channel you gave and tried install Sire in a cuda 10.1 environment. Now the output of optimise_openmm is:

Starting optimise_openmm: number of threads equals 10                                                           
CUDA platform is not recognised by OpenMM!                                                                      
available platforms are:                                                                                        
['Reference', 'CPU']                                                                                            
Let's see if we can do something about this....                                                                 

Found a CUDA toolkit release version: 10.1                                                                      
Trying to update OpenMM to match your CUDA version 10.1 for your OpenMM version 7.4.2                           
This may take a little while. Please hold tight!                                                                
................................................                                                                

==============================================================                                                  
Sending anonymous Sire usage statistics to http://siremol.org.                                                  
For more information, see http://siremol.org/analytics                                                          
To disable, set the environment variable 'SIRE_DONT_PHONEHOME' to 1                                             
To see the information sent, set the environment variable                                                       
SIRE_VERBOSE_PHONEHOME equal to 1. To silence this message, set                                                 
the environment variable SIRE_SILENT_PHONEHOME to 1.                                                            
==============================================================                                                  

['Reference', 'CPU', 'CPU']                                                                                     
Something didn't work out with the update of OpenMM via conda, have a look at the output.                       
Collecting package metadata (current_repodata.json): ...working... done                                         
Solving environment: ...working... done                                                                         

## Package Plan ##                                                                                              

  environment location: /root/miniconda3/envs/biosimspace                                                       

  added / updated specs:                                                                                        
    - openmm=7.4.2                                                                                              

The following packages will be downloaded:                                                                      

    package                    |            build                                                               
    ---------------------------|-----------------                                                               
    openmm-7.4.2               |py37_cuda101_rc_1        11.9 MB  omnia/label/cuda101                           
    ------------------------------------------------------------                                                
                                           Total:        11.9 MB                                                

The following packages will be SUPERSEDED by a higher-priority channel:                                         

  openmm                                              omnia --> omnia/label/cuda101                             

Downloading and Extracting Packages
openmm-7.4.2         | 11.9 MB   | ########## | 100% 
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done

Still not able to run simulation though.

ps: I'm doing all these stuff in a docker container, its host machine which has the same version installed runs fine.

kexul commented 3 years ago

Ah, I could not get the simulation running with the official docker image biosimspace/biosimspace-devel:latest, does it have CUDA support?

lohedges commented 3 years ago

Ah, I see. The docker image is built as part of our Azure CI pipeline on a VM without a GPU. This shouldn't matter though, since the available OpenMM platforms are detected at run-time.

Just to check: How did you install BioSImSpace on the host? Using the conda-package? When you say "same version installed", do you mean the host and Docker container are using the same version of OpenMM? (Same OpenMM version and same CUDA driver build.) I assume that the host has CUDA drivers installed, but the Docker container doesn't.

kexul commented 3 years ago

How did you install BioSImSpace on the host? Using the conda-package?

Yes, I installed BioSimSpace on host by conda.

When you say "same version installed", do you mean the host and Docker container are using the same version of OpenMM?

Yes, the same version of OpenMM, which is

# Name                    Version                   Build  Channel
openmm                    7.4.2           py37_cuda101_rc_1    omnia

My host have cuda11, I've tested cuda10.1 and cuda11 in docker container, both failed.

I assume that the host has CUDA drivers installed, but the Docker container doesn't.

I'm afraid that not the case, I've installed other GPU powered packages in the container such as pytorch, tensorflow, etc... they can harness the GPU well. Besides, running the following example code of OpenMM with GPU acceleration was fine in the container:

from simtk.openmm.app import *
from simtk.openmm import *
from simtk.unit import *
from sys import stdout

pdb = PDBFile('input.pdb')
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')
system = forcefield.createSystem(pdb.topology, nonbondedMethod=PME,
        nonbondedCutoff=1*nanometer, constraints=HBonds)
integrator = LangevinIntegrator(300*kelvin, 1/picosecond, 0.004*picoseconds)
platform = Platform.getPlatformByName('CUDA')
simulation = Simulation(pdb.topology, system, integrator, platform)
simulation.context.setPositions(pdb.positions)
simulation.minimizeEnergy()
simulation.reporters.append(PDBReporter('output.pdb', 1000))
simulation.reporters.append(StateDataReporter(stdout, 1000, step=True,
        potentialEnergy=True, temperature=True))
simulation.step(10000)
lohedges commented 3 years ago

Are you just seeing the following error:

RuntimeError: There is no registered Platform called "CUDA"

If so, could you try manually setting OPENMM_PLUGIN_DIR before running your SOMD command. It should set the correct path for you, but perhaps it's not working in the Docker image. I imagine that you'd need to do something like:

export OPENMM_PLUGIN_DIR=/home/sireuser/sire.app/lib/plugins

I don't know too much about using Docker containers with CUDA, i.e. I'm not sure if you need to do something clever for the drivers on the host to be visible to the container. However, the fact that running OpenMM using CUDA outside of Sire works suggests that they are indeed being picked up.

When updating things inside of the container, have you been using the installed Sire MiniConda, i.e. sire.app? You would be using commands like:

~/sire.app/bin/conda install ...
kexul commented 3 years ago

Hi @lohedges , thanks for your continued help! I've noticed this special environment variable OPENMM_PLUGIN_DIR in the documentation, but did not quite understand the meaning of it. I installed sire via mamba in miniconda: mamba create -n biosimspace -c conda-forge -c omnia -c michellab biosimspace. The location of Sire in my environment is /root/miniconda3/envs/biosimspace/lib/python3.7/site-packages/Sire, I've navigated to the folder but no plugins in it. Here is the content of the folder:

Analysis  Base  CAS  Cluster  Config  Error  FF  ID  IO  MM  Maths 
Mol  Move  Qt  Squire  Stream  System  Tools  Units  Vol  __init__.py  __pycache__
lohedges commented 3 years ago

I'm confused, you are using mamba to install BioSimSpace within the BioSimSpace Docker image? The OpenMM libraries are not Python libraries, so you need to look in /root/miniconda3/envs/biosimspace/lib/plugins.

kexul commented 3 years ago

I'm confused, you are using mamba to install BioSimSpace within the BioSimSpace Docker image?

Nope, just installed biosimspace in a clean centos docker with cuda enabled.

so you need to look in /root/miniconda3/envs/biosimspace/lib/plugins

/root/miniconda3/envs/biosimspace/lib/plugins seems to be the right path, but setting OPENMM_PLUGIN_DIR did not work 😔

企业微信截图_16254989049939

lohedges commented 3 years ago

And just to confirm, the following works fine when run in the active biosimspace conda environment:

# test.py
from simtk.openmm import *
platform = Platform.getPlatformByName("CUDA")
python test.py

(You've suggested this earlier, but just want to confirm that we're using the exact same environment.)

If so, the only difference is that the Python version is using the OpenMM Python API, whereas the somd-freenrg code is calling into Sire, which is using the C++ API. Assuming they are using the same version of libOpenMM, and the CUDA driver is the same, then there shouldn't be a difference, unless somehow the way the C++ API works means that it's unable to see the drivers on the host.

kexul commented 3 years ago

And just to confirm, the following works fine when run in the active biosimspace conda environment:

Yes, it works fine, no error, no warning.

it's unable to see the drivers on the host

I have the same opinion, it could be some problems with my hardware or docker configuration, I'll dig into it and report back if I get further information. Anyway, thanks for your help so far. much appreciated!

kexul commented 3 years ago

Managed to get it working in a clean ubuntu docker image, maybe my centos image is broken...

lohedges commented 3 years ago

Thanks for the update. Glad to hear that you got things working. I still find it strange that OpenMM worked directly, but not via SOMD. It would be good to know if there was something different in the Docker setup (other than CentOS vs Ubuntu) so that we could document the issue for other users.

Cheers.

kexul commented 3 years ago

I'll post my finding here if there is any update.

kexul commented 3 years ago

Hi @lohedges , I've tested several images pulled from nvidia's official docker hub(including ubuntu, centos, cuda10, cuda10.1...), all of them runs fine after optimise_openmm, which shows that somd-freenrg is quite robust, and I believe most of users might not meet this problem if their cuda is setting up correctly.

In fact, I was using a linux distribution based on centos but with some modification in kernel and docker, which should most probably be blamed for.

lohedges commented 3 years ago

Thanks for the update, that's really helpful. I'm also pleased to hear that things seem to be quite reliable. I'll update the docs regarding our Docker image so that users know that it won't work with CUDA. We never really intended it to be used in this way, rather it's a minimal base environment with an old glibc which we use for our CI and to build our manylinux binary installer.