Can not run OpenMM simulations with the plugin on CUDA platform

yaoyic commented 5 years ago

I've successfully compiled and installed your plugin, and it works out fine when I select the platform as CPU for simulation. For this kind of setup, the network force calculation is performed on GPU, while the main simulation routine is on CPU.

The problem is that I would like to run the simulation also on GPU to accelerate the whole procedure. But I got an Exception at the line where the openmm.app.Simulation object is initialized, when the systemI contained network force and platform is 'CUDA'. Here is the traceback:

Traceback (most recent call last):
  File "xxx.py", line 21, in <module>
    simulationI = Simulation(pdb_wet.getTopology(), systemI, integrator, platform)
  File "/srv/public/yaoyic/miniconda3/envs/simu/lib/python3.7/site-packages/simtk/openmm/app/simulation.py", line 103, in __init__
    self.context = mm.Context(self.system, self.integrator, platform)
  File "/srv/public/yaoyic/miniconda3/envs/simu/lib/python3.7/site-packages/simtk/openmm/openmm.py", line 10057, in __init__
    _openmm.Context_swiginit(self, _openmm.new_Context(*args))
Exception: Error creating TensorFlow session: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

Also tried with the 'OpenCL' platform of OpenMM. It failed with the same traceback information. From my perspective, it may be because the simulation and the plugin can not run on the same graphics card.

I did find any explicit message about the platform choices in the README. I would like to know, if this is something wrong with my compilation/simulation setups, or it is just a feature of the current plugin.

For your information, my workstation has 8 CPU cores and 1x GTX1080. CUDA version is 9.0. If I run my simulation on the 'CPU' platform, nvidia-smi shows that the GPU usage is around 15%.

peastman commented 5 years ago

Perhaps your GPU is set to exclusive mode? If so, it will only allow one context to be created on it at a time. You can check and set the compute mode with nvidia-smi.

yaoyic commented 5 years ago

Good point, I will try to change the compute mode when I get the sudo permission from the IT. Thanks!

yaoyic commented 5 years ago

The compute mode has been changed to "Default", now running simulations with this plugin on CPU and OpenCL platforms are OK, but another error occurs if I run simulation with the plugin on CUDA platform. The output from stderr reads:

2019-09-04 16:08:58.153624: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.9.0
2019-09-04 16:08:58.153789: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.9.0
2019-09-04 16:08:58.154645: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.9.0
2019-09-04 16:08:58.155437: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.9.0
2019-09-04 16:08:58.157792: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-04 16:08:58.158904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-09-04 16:08:58.257389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-04 16:08:58.257417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-09-04 16:08:58.257426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-09-04 16:08:58.259155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4059 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-09-04 16:08:58.449728: F tensorflow/stream_executor/cuda/cuda_driver.cc:175] Check failed: err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: incompatible driver context

The last line appears during simulation.minimizeEnergy(), while the lines above appear when the Simulation object is initialized.

If the platform is CPU or OpenCL, then the last line won't appear. Instead, there will be these two lines:

2019-09-04 16:08:08.164449: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.9.0
2019-09-04 16:08:08.739570: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

Running normal simulations on the CUDA platform does not lead to any problem. So I guess maybe there's still something wrong.

peastman commented 5 years ago

I haven't seen that one before. I'm trying to puzzle out what it means. Here's my best guess.

The error description it provides is "incompatible driver context". I assume that means a CUDA function returned the error code cudaErrorIncompatibleDriverContext. The documentation at https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html describes it like this.

This indicates that the current context is not compatible with this the CUDA Runtime. This can only occur if you are using CUDA Runtime/Driver interoperability and have created an existing Driver context using the driver API. The Driver context may be incompatible either because the Driver context was created using an older version of the API, because the Runtime API call expects a primary driver context and the Driver context is not primary, or because the Driver context has been destroyed. Please see Interactions with the CUDA Driver API" for more information.

CUDA has two different APIs: the Driver API and the Runtime API. OpenMM uses the Driver API. I gather that TensorFlow must use the RuntimeAPI. The two APIs can interoperate as described at https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html#group__CUDART__DRIVER. According to that documentation, if you've already created a context with the Driver API (which OpenMM will have done), then use the Runtime API, it will automatically use that existing context. But apparently it's finding that context is incompatible in some way. The error description above lists three reasons the context might be incompatible. I suspect the first one is probably the case: "the Driver context was created using an older version of the API." That is (perhaps), OpenMM and TensorFlow were compiled against different versions of CUDA. We provide versions of OpenMM that were compiled against various versions of CUDA. Which one are you using? And what version of CUDA do you actually have installed on your computer?

yaoyic commented 5 years ago

Thanks for replying!

I built the OpenMM 7.3.1, this plugin and TensorFlow 1.14 all with CUDA version 9.0.176 and gcc 4.9.2. Also tried to build and run the plugin with the official version of OpenMM 7.3.1 from omnia/cuda90. Unfortunately the same error occurred.

I can not test out building with official TF C API, as it is provided implicitly for CUDA 10.1, which is not compatible to the current setup of my workstation.

I have to note that the plugin was built with CMake option -D_GLIBCXX_USE_CXX11_ABI=0, otherwise it does not build, which is the same as in issue #8.

openmm / openmm-tensorflow

Can not run OpenMM simulations with the plugin on CUDA platform #12