Closed xiaowei-xie2 closed 3 months ago
That should be all you need to do. CUDA graphs won't always help performance. It depends on whether the overhead from launching kernels is a significant bottleneck for your model. It has the most benefit for calculations that involve running a lot of very short kernels. If you use Nsight Compute to profile your code, I think there are ways to tell whether it's using graphs or not.
Thank you for the reply. I did some tests with a simple example, and I think in my case CUDA graph was not used at all because I used a workaround to create the force instead of using it directly for Hamiltonian REMD to work(see https://github.com/openmm/openmm-torch/issues/147).
Specifically I did
force = TorchForce('../model_simple.pt')
force.setProperty("useCUDAGraphs", "true")
cv = openmm.CustomCVForce("")
tempSystem = openmm.System()
tempSystem.addForce(force)
interactingVarNames = []
for idx, force in enumerate(tempSystem.getForces()):
name = f"allForce{idx+1}"
cv.addCollectiveVariable(name, copy.deepcopy(force))
interactingVarNames.append(name)
assert len(interactingVarNames) > 0
interactingSum = "+".join(interactingVarNames)
cv.setEnergyFunction(
f"({interactingSum})"
)
system.addForce(cv)
If I just do
force = TorchForce('../model_simple.pt')
force.setProperty("useCUDAGraphs", "true")
and run a regular MD, it was twice faster with CUDA graph. But when I use the above workaround, it's not faster at all.
Any idea why my workaround is a problem?
Thank you, Xiaowei
I can't think of any reason that wrapping it in a CustomCVForce would affect this. Aside from CUDA graphs, how much does wrapping it affect the speed? CustomCVForce does add overhead and require extra synchronization.
A less expensive workaround for #147 is to also define the same global parameter in another force. For example, you could use an empty CustomBondForce with no bonds.
force = CustomBondForce('0')
force.addGlobalParameter('myparam', 0)
system.addForce(force)
The parameter should have the same name and default value as in the TorchForce. You're making a second force that uses the same parameter so openmmtools will be able to identify it.
I was only testing on a toy force and wrapping it with CustomCVForce didn't affect the speed much. I am not sure how much it affects the speed for the actual ML force field yet.
I tried your other workaround, but it's giving me another error:
Traceback (most recent call last):
File "/scr/xie1/test_REMD/test_cuda_graph_regularmd/torchforce_workaround.py", line 146, in <module>
simulation.run()
File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/multistate/multistatesampler.py", line 755, in run
raise e
File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/multistate/multistatesampler.py", line 745, in run
self._compute_energies()
File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/utils/utils.py", line 95, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/multistate/multistatesampler.py", line 1425, in _compute_energies
new_energies, replica_ids = mpiplus.distribute(self._compute_replica_energies, range(self.n_replicas),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xie1/miniconda3/lib/python3.12/site-packages/mpiplus/mpiplus.py", line 523, in distribute
all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/multistate/multistatesampler.py", line 1458, in _compute_replica_energies
context, integrator = self.energy_context_cache.get_context(compatible_group[0])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/cache.py", line 451, in get_context
context = thermodynamic_state.create_context(integrator, self._platform, self._platform_properties)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/states.py", line 1177, in create_context
return openmm.Context(system, integrator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmm/openmm.py", line 12171, in __init__
_openmm.Context_swiginit(self, _openmm.new_Context(*args))
^^^^^^^^^^^^^^^^^^^^^^^^^^
openmm.OpenMMException: Two Forces define different default values for the parameter 'param_a'
although I have set the default parameters to be the same through
force = TorchForce(module)
force.addGlobalParameter('param_a', 0.5)
force.addGlobalParameter('param_b', 0.5)
system.addForce(force)
bond_force = CustomBondForce('0')
bond_force.addGlobalParameter('param_a', 0.5)
bond_force.addGlobalParameter('param_b', 0.5)
system.addForce(bond_force)
Any idea what is going wrong?
The full test files are also attached.
I think this is openmmtools confusing it. You also use it to specify a different default value:
param_a = GlobalParameterState.GlobalParameter('param_a', standard_value=1.0)
param_b = GlobalParameterState.GlobalParameter('param_b', standard_value=1.0)
As far as I can tell from the code, I think that causes it to loop over all the forces it has identified as having a global parameter with that name and call setGlobalParameterDefaultValue()
on them. It changes the default value for the CustomBondForce, but not for the TorchForce, leading to that error. You need to give them same default value in both places.
I changed those two lines to the following but still got the same error...
param_a = GlobalParameterState.GlobalParameter('param_a', standard_value=0.5)
param_b = GlobalParameterState.GlobalParameter('param_b', standard_value=0.5)
I tried serializing the System, and I found openmmtools had changed the default values for the two parameters to 1 and 4:
<Force energy="0" forceGroup="0" name="CustomBondForce" type="CustomBondForce" usesPeriodic="0" version="3">
<PerBondParameters/>
<GlobalParameters>
<Parameter default="1" name="param_a"/>
<Parameter default="4" name="param_b"/>
</GlobalParameters>
<EnergyParameterDerivatives/>
<Bonds/>
</Force>
I think it's because those are the first values in your schedule:
lambda_schedule_a = np.array([1, 2, 3])
lambda_schedule_b = np.array([4, 5, 6])
If I change it to use those values both for the TorchForce and for the GlobalParameter objects, then it runs successfully.
Thank you so much, and you are right that this workaround is less expensive than wrapping with CustomCVForce (~0.85 the cost for this toy example on my desktop).
Would you mind also testing turning useCUDAGraphs
on with this script to see if there is any speed-up? I did not see any speed-up on my end. If I run regular MD with the same potential, I did see twice speed-up using CUDA graph.
I am pretty sure CUDA graph is not used in this workaround either. If I change my torchForce to contain some offending operations (torch.inverse in this example), it is not erroring out.
class ForceModule(torch.nn.Module):
"""A central harmonic force with a user-defined global scale parameter"""
def forward(self, positions, boxvectors, param_a, param_b):
"""The forward method returns the energy computed from positions.
Parameters
----------
positions : torch.Tensor with shape (nparticles,3)
positions[i,k] is the position (in nanometers) of spatial dimension k of particle i
scale : torch.Scalar
A scalar tensor defined by 'TorchForce.addGlobalParameter'.
Here, it scales the contribution to the potential.
Note that parameters are passed in the order defined by `TorchForce.addGlobalParameter`, not by name.
Returns
-------
potential : torch.Scalar
The potential energy (in kJ/mol)
"""
cell_inverse = torch.inverse(boxvectors)
boxsize = cell_inverse.diag()
periodicPositions = positions - torch.floor(positions/boxsize)*boxsize
return param_a*torch.sum(periodicPositions**2) + param_b
Whereas if I run a regular MD, it gives the following error:
Traceback (most recent call last):
File "/scr/xie1/test_REMD/test_cuda_graph_regularmd/torchforce_workaround_regularmd.py", line 131, in <module>
simulation.context.setVelocitiesToTemperature(temperature) # This does not work (https://github.com/openmm/openm>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmm/openmm.py", line 7800, in setVelocitiesToTemperature
return _openmm.Context_setVelocitiesToTemperature(self, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
openmm.OpenMMException: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /home/conda/feedstock_root/build_artifacts/libtorch_1718580525958/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xaa (0x7fecbcf53b5a in /home/xie1/miniconda3/lib/python3.12/site-packages/../.././libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7fecbcefec90 in /home/xie1/miniconda3/lib/python3.12/site-packages/../.././libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3fe (0x7fec5d49494e in /home/xie1/miniconda3/lib/python3.12/site-packages/../../././libc10_cuda.so)
frame #3: at::cuda::CUDAGraph::capture_end() + 0xad (0x7fec661ab1fd in /home/xie1/miniconda3/lib/python3.12/site-packages/../.././libtorch_cuda.so)
frame #4: <unknown function> + 0x9cd7 (0x7feb723f4cd7 in /home/xie1/miniconda3/lib/plugins/libOpenMMTorchCUDA.so)
frame #5: OpenMM::ContextImpl::calcForcesAndEnergy(bool, bool, int) + 0xc9 (0x7fecbd2ec159 in /home/xie1/miniconda3/lib/python3.12/site-packages/../../libOpenMM.so.8.1)
frame #6: OpenMM::Context::setVelocitiesToTemperature(double, int) + 0xcc (0x7fecbd2e8c3c in /home/xie1/miniconda3/lib/python3.12/site-packages/../../libOpenMM.so.8.1)
frame #7: <unknown function> + 0x12b834 (0x7febf5655834 in /home/xie1/miniconda3/lib/python3.12/site-packages/openmm/_openmm.cpython-312-x86_64-linux-gnu.so)
<omitting python frames>
frame #19: <unknown function> + 0x29d90 (0x7fecbdba2d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: __libc_start_main + 0x80 (0x7fecbdba2e40 in /lib/x86_64-linux-gnu/libc.so.6)
I think I see the problem. CustomCVForce uses XmlSerializer.clone()
to make copies of the forces to add to its inner context. The serialization proxy for TorchForce doesn't copy the properties, so the useCUDAGraphs
property doesn't get included on the copy. Let me fix that!
The fix is in #152. Can you try it out and see if it fixes the problem for you?
Thank you so much for the fix! Sorry for replying late - I was on vacation last week.
I am trying to test out your solution, but I am having trouble compiling the package from source (I assume conda install will not incorporate your fix?). Specifically I am getting the following error:
CMake Warning at
/home/xie1/miniconda3/lib/python3.12/site-packages/torch/share/cmake/Caffe2/FindCUDAToolkit.cmake:957
(message):
Could not find librt library, needed by CUDA::cudart_static
Call Stack (most recent call first):
/home/xie1/miniconda3/lib/python3.12/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:59
(find_package)
/home/xie1/miniconda3/lib/python3.12/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:86 (include)
/home/xie1/miniconda3/lib/python3.12/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68
(find_package)
CMakeLists.txt:15 (FIND_PACKAGE)
I think I have librt already installed:
xie1@desk-lu463:~/openmm-torch$ ls /usr/lib/x86_64-linux-gnu/librt.* /usr/lib/x86_64-linux-gnu/librt.a /usr/lib/x86_64-linux-gnu/librt.so.1
Any idea how to get around this error?
Thank you!
I don't think librt has any connection to cudart. Do you have the CUDA toolkit installed? See http://docs.openmm.org/latest/userguide/library/02_compiling.html#cuda-or-opencl-support.
I started over and I don't see that error anymore, but I saw another error close to the end of the build.
[ 19%] Built target OpenMMTorch
[ 19%] Built target CopyTestFiles
[ 26%] Built target TestSerializeTorchForce
[ 38%] Built target OpenMMTorchReference
[ 46%] Built target TestReferenceTorchForce
[ 50%] Linking CXX shared library ../../libOpenMMTorchOpenCL.so
[ 65%] Built target OpenMMTorchOpenCL
[ 69%] Linking CXX executable ../../../TestOpenCLTorchForce
[ 73%] Built target TestOpenCLTorchForce
[ 76%] Building CXX object platforms/cuda/CMakeFiles/OpenMMTorchCUDA.dir/src/CudaTorchKernelFactory.cpp.o
In file included from /home/xie1/miniconda3/include/c10/cuda/CUDADeviceAssertionHost.h:3,
from /home/xie1/miniconda3/include/c10/cuda/CUDAException.h:3,
from /home/xie1/miniconda3/include/c10/cuda/CUDAFunctions.h:12,
from /home/xie1/miniconda3/include/c10/cuda/CUDAStream.h:10,
from /home/xie1/miniconda3/include/c10/cuda/CUDAGraphsC10Utils.h:3,
from /home/xie1/miniconda3/include/ATen/cuda/CUDAGraph.h:5,
from /home/xie1/openmm-torch/platforms/cuda/src/CudaTorchKernels.h:39,
from /home/xie1/openmm-torch/platforms/cuda/src/CudaTorchKernelFactory.cpp:35:
/home/xie1/miniconda3/include/c10/cuda/CUDAMacros.h:8:10: fatal error: c10/cuda/impl/cuda_cmake_macros.h: No such file or directory
8 | #include <c10/cuda/impl/cuda_cmake_macros.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [platforms/cuda/CMakeFiles/OpenMMTorchCUDA.dir/build.make:83: platforms/cuda/CMakeFiles/OpenMMTorchCUDA.dir/src/CudaTorchKernelFactory.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:409: platforms/cuda/CMakeFiles/OpenMMTorchCUDA.dir/all] Error 2
make: *** [Makefile:146: all] Error 2
I installed cudatoolkit, libtorch and pytorch through conda conda install pytorch pytorch-cuda=12.4 cuda-toolkit=12.4 libtorch=2.3.0 -c pytorch-nightly -c nvidia -c conda-forge
. And here are the configurations I used for cmake.
CMAKE_BUILD_TYPE
CMAKE_INSTALL_PREFIX /home/xie1/miniconda3
CUDA_HOST_COMPILER /usr/bin/cc
CUDA_SDK_ROOT_DIR CUDA_SDK_ROOT_DIR-NOTFOUND
CUDA_TOOLKIT_ROOT_DIR /home/xie1/miniconda3
CUDA_USE_STATIC_CUDA_RUNTIME ON
Caffe2_DIR /home/xie1/miniconda3/share/cmake/Caffe2
MKLDNN_DIR MKLDNN_DIR-NOTFOUND
NN_BUILD_CUDA_LIB ON
NN_BUILD_OPENCL_LIB ON
NN_BUILD_PYTHON_WRAPPERS ON
OPENCL_INCLUDE_DIR /home/xie1/miniconda3/include
OPENCL_LIBRARY /home/xie1/miniconda3/lib/libOpenCL.so
OPENMM_DIR /home/xie1/miniconda3
PYTHON_EXECUTABLE /home/xie1/miniconda3/bin/python
PYTORCH_DIR
Protobuf_DIR /home/xie1/miniconda3/lib/cmake/protobuf
SWIG_EXECUTABLE /home/xie1/miniconda3/bin/swig
TORCH_LIBRARY /home/xie1/miniconda3/lib/libtorch.so
Torch_DIR /home/xie1/miniconda3/share/cmake/Torch
absl_DIR /home/xie1/miniconda3/lib/cmake/absl
c10_LIBRARY /home/xie1/miniconda3/lib/libc10.so
utf8_range_DIR /home/xie1/miniconda3/lib/cmake/utf8_range
I don't see c10/cuda/impl/cuda_cmake_macros.h
in the miniconda3/include
directory, whereas if I download libtorch from the official pytorch website I do see that file. But I cannot figure out how to use the downloaded libtorch (setting c10_LIBRARY and TORCH_LIBRARY doesn't seem to work).
Is conda installing cudatoolkit/libtorch a problem? Would you mind sharing some details on how to install libtorch from the downloaded zip file from the official website?
-c pytorch-nightly -c nvidia -c conda-forge
That isn't going to work correctly. Packages in conda-forge tend to be compiled differently than in other channels. It can't be mixed with other channels. It has its own builds of both PyTorch and the CUDA libraries, so you shouldn't need to mix.
Thank you, I was able to compile it by only using the conda-forge channel. And I think it is using CUDA graph correctly now. I see a slight speed up and it is also erroring out for the torch.inverse
operation.
Thank you so much for the fix!
Great, thanks!
Hi,
I would like to use the CUDA graph option to accelerate my simulation, and I was wondering is there anything I need to do when converting the model to torchscript and saving to a file in order to use CUDA graph?
I tried simply adding
force.setProperty("useCUDAGraphs", "true")
to the simulation script but did not see any performance improvement compared to without CUDA graph. Is there any way I can investigate whether it is indeed using CUDA graph?Thank you, Xiaowei