Question about how to use CUDA graph

xiaowei-xie2 commented 3 months ago

Hi,

I would like to use the CUDA graph option to accelerate my simulation, and I was wondering is there anything I need to do when converting the model to torchscript and saving to a file in order to use CUDA graph?

I tried simply adding force.setProperty("useCUDAGraphs", "true") to the simulation script but did not see any performance improvement compared to without CUDA graph. Is there any way I can investigate whether it is indeed using CUDA graph?

Thank you, Xiaowei

peastman commented 3 months ago

That should be all you need to do. CUDA graphs won't always help performance. It depends on whether the overhead from launching kernels is a significant bottleneck for your model. It has the most benefit for calculations that involve running a lot of very short kernels. If you use Nsight Compute to profile your code, I think there are ways to tell whether it's using graphs or not.

xiaowei-xie2 commented 3 months ago

Thank you for the reply. I did some tests with a simple example, and I think in my case CUDA graph was not used at all because I used a workaround to create the force instead of using it directly for Hamiltonian REMD to work(see https://github.com/openmm/openmm-torch/issues/147).

Specifically I did

force = TorchForce('../model_simple.pt')
force.setProperty("useCUDAGraphs", "true")

cv = openmm.CustomCVForce("")

tempSystem = openmm.System()
tempSystem.addForce(force)
interactingVarNames = []
for idx, force in enumerate(tempSystem.getForces()):
    name = f"allForce{idx+1}"
    cv.addCollectiveVariable(name, copy.deepcopy(force))
    interactingVarNames.append(name)

assert len(interactingVarNames) > 0 

interactingSum = "+".join(interactingVarNames)

cv.setEnergyFunction(
    f"({interactingSum})"
)
system.addForce(cv)

If I just do

force = TorchForce('../model_simple.pt')
force.setProperty("useCUDAGraphs", "true")

and run a regular MD, it was twice faster with CUDA graph. But when I use the above workaround, it's not faster at all.

Any idea why my workaround is a problem?

Thank you, Xiaowei

peastman commented 3 months ago

I can't think of any reason that wrapping it in a CustomCVForce would affect this. Aside from CUDA graphs, how much does wrapping it affect the speed? CustomCVForce does add overhead and require extra synchronization.

peastman commented 3 months ago

A less expensive workaround for #147 is to also define the same global parameter in another force. For example, you could use an empty CustomBondForce with no bonds.

force = CustomBondForce('0')
force.addGlobalParameter('myparam', 0)
system.addForce(force)

The parameter should have the same name and default value as in the TorchForce. You're making a second force that uses the same parameter so openmmtools will be able to identify it.

xiaowei-xie2 commented 3 months ago

I was only testing on a toy force and wrapping it with CustomCVForce didn't affect the speed much. I am not sure how much it affects the speed for the actual ML force field yet.

I tried your other workaround, but it's giving me another error:

Traceback (most recent call last):
  File "/scr/xie1/test_REMD/test_cuda_graph_regularmd/torchforce_workaround.py", line 146, in <module>
    simulation.run()
  File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/multistate/multistatesampler.py", line 755, in run
    raise e
  File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/multistate/multistatesampler.py", line 745, in run
    self._compute_energies()
  File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/utils/utils.py", line 95, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/multistate/multistatesampler.py", line 1425, in _compute_energies
    new_energies, replica_ids = mpiplus.distribute(self._compute_replica_energies, range(self.n_replicas),
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xie1/miniconda3/lib/python3.12/site-packages/mpiplus/mpiplus.py", line 523, in distribute
    all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/multistate/multistatesampler.py", line 1458, in _compute_replica_energies
    context, integrator = self.energy_context_cache.get_context(compatible_group[0])
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/cache.py", line 451, in get_context
    context = thermodynamic_state.create_context(integrator, self._platform, self._platform_properties)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmmtools/states.py", line 1177, in create_context
    return openmm.Context(system, integrator)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmm/openmm.py", line 12171, in __init__
    _openmm.Context_swiginit(self, _openmm.new_Context(*args))
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
openmm.OpenMMException: Two Forces define different default values for the parameter 'param_a'

although I have set the default parameters to be the same through

force = TorchForce(module)
force.addGlobalParameter('param_a', 0.5)
force.addGlobalParameter('param_b', 0.5)
system.addForce(force)
bond_force = CustomBondForce('0')
bond_force.addGlobalParameter('param_a', 0.5)
bond_force.addGlobalParameter('param_b', 0.5)
system.addForce(bond_force)

Any idea what is going wrong?

The full test files are also attached.

test.tar.gz

peastman commented 3 months ago

I think this is openmmtools confusing it. You also use it to specify a different default value:

    param_a = GlobalParameterState.GlobalParameter('param_a', standard_value=1.0)
    param_b = GlobalParameterState.GlobalParameter('param_b', standard_value=1.0)

As far as I can tell from the code, I think that causes it to loop over all the forces it has identified as having a global parameter with that name and call setGlobalParameterDefaultValue() on them. It changes the default value for the CustomBondForce, but not for the TorchForce, leading to that error. You need to give them same default value in both places.

xiaowei-xie2 commented 3 months ago

I changed those two lines to the following but still got the same error...

    param_a = GlobalParameterState.GlobalParameter('param_a', standard_value=0.5)
    param_b = GlobalParameterState.GlobalParameter('param_b', standard_value=0.5)

peastman commented 3 months ago

I tried serializing the System, and I found openmmtools had changed the default values for the two parameters to 1 and 4:

<Force energy="0" forceGroup="0" name="CustomBondForce" type="CustomBondForce" usesPeriodic="0" version="3">
  <PerBondParameters/>
  <GlobalParameters>
    <Parameter default="1" name="param_a"/>
    <Parameter default="4" name="param_b"/>
  </GlobalParameters>
  <EnergyParameterDerivatives/>
  <Bonds/>
</Force>

I think it's because those are the first values in your schedule:

lambda_schedule_a = np.array([1, 2, 3])
lambda_schedule_b = np.array([4, 5, 6])

If I change it to use those values both for the TorchForce and for the GlobalParameter objects, then it runs successfully.

xiaowei-xie2 commented 3 months ago

Thank you so much, and you are right that this workaround is less expensive than wrapping with CustomCVForce (~0.85 the cost for this toy example on my desktop).

Would you mind also testing turning useCUDAGraphs on with this script to see if there is any speed-up? I did not see any speed-up on my end. If I run regular MD with the same potential, I did see twice speed-up using CUDA graph.

xiaowei-xie2 commented 3 months ago

I am pretty sure CUDA graph is not used in this workaround either. If I change my torchForce to contain some offending operations (torch.inverse in this example), it is not erroring out.

class ForceModule(torch.nn.Module):
    """A central harmonic force with a user-defined global scale parameter"""
    def forward(self, positions, boxvectors, param_a, param_b):
        """The forward method returns the energy computed from positions.

        Parameters
        ----------
        positions : torch.Tensor with shape (nparticles,3)
           positions[i,k] is the position (in nanometers) of spatial dimension k of particle i
        scale : torch.Scalar
           A scalar tensor defined by 'TorchForce.addGlobalParameter'.
           Here, it scales the contribution to the potential.
           Note that parameters are passed in the order defined by `TorchForce.addGlobalParameter`, not by name.

        Returns
        -------
        potential : torch.Scalar
           The potential energy (in kJ/mol)
        """
        cell_inverse = torch.inverse(boxvectors)
        boxsize = cell_inverse.diag()
        periodicPositions = positions - torch.floor(positions/boxsize)*boxsize

        return param_a*torch.sum(periodicPositions**2) + param_b

Whereas if I run a regular MD, it gives the following error:

        Traceback (most recent call last):
  File "/scr/xie1/test_REMD/test_cuda_graph_regularmd/torchforce_workaround_regularmd.py", line 131, in <module>
    simulation.context.setVelocitiesToTemperature(temperature) # This does not work (https://github.com/openmm/openm>
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xie1/miniconda3/lib/python3.12/site-packages/openmm/openmm.py", line 7800, in setVelocitiesToTemperature
    return _openmm.Context_setVelocitiesToTemperature(self, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
openmm.OpenMMException: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /home/conda/feedstock_root/build_artifacts/libtorch_1718580525958/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xaa (0x7fecbcf53b5a in /home/xie1/miniconda3/lib/python3.12/site-packages/../.././libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7fecbcefec90 in /home/xie1/miniconda3/lib/python3.12/site-packages/../.././libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3fe (0x7fec5d49494e in /home/xie1/miniconda3/lib/python3.12/site-packages/../../././libc10_cuda.so)
frame #3: at::cuda::CUDAGraph::capture_end() + 0xad (0x7fec661ab1fd in /home/xie1/miniconda3/lib/python3.12/site-packages/../.././libtorch_cuda.so)
frame #4: <unknown function> + 0x9cd7 (0x7feb723f4cd7 in /home/xie1/miniconda3/lib/plugins/libOpenMMTorchCUDA.so)
frame #5: OpenMM::ContextImpl::calcForcesAndEnergy(bool, bool, int) + 0xc9 (0x7fecbd2ec159 in /home/xie1/miniconda3/lib/python3.12/site-packages/../../libOpenMM.so.8.1)
frame #6: OpenMM::Context::setVelocitiesToTemperature(double, int) + 0xcc (0x7fecbd2e8c3c in /home/xie1/miniconda3/lib/python3.12/site-packages/../../libOpenMM.so.8.1)
frame #7: <unknown function> + 0x12b834 (0x7febf5655834 in /home/xie1/miniconda3/lib/python3.12/site-packages/openmm/_openmm.cpython-312-x86_64-linux-gnu.so)
<omitting python frames>
frame #19: <unknown function> + 0x29d90 (0x7fecbdba2d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: __libc_start_main + 0x80 (0x7fecbdba2e40 in /lib/x86_64-linux-gnu/libc.so.6)

peastman commented 3 months ago

I think I see the problem. CustomCVForce uses XmlSerializer.clone() to make copies of the forces to add to its inner context. The serialization proxy for TorchForce doesn't copy the properties, so the useCUDAGraphs property doesn't get included on the copy. Let me fix that!

peastman commented 3 months ago

The fix is in #152. Can you try it out and see if it fixes the problem for you?

xiaowei-xie2 commented 3 months ago

Thank you so much for the fix! Sorry for replying late - I was on vacation last week.

I am trying to test out your solution, but I am having trouble compiling the package from source (I assume conda install will not incorporate your fix?). Specifically I am getting the following error:

 CMake Warning at
 /home/xie1/miniconda3/lib/python3.12/site-packages/torch/share/cmake/Caffe2/FindCUDAToolkit.cmake:957
 (message):
   Could not find librt library, needed by CUDA::cudart_static
 Call Stack (most recent call first):
   /home/xie1/miniconda3/lib/python3.12/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:59
 (find_package)
   /home/xie1/miniconda3/lib/python3.12/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:86 (include)
   /home/xie1/miniconda3/lib/python3.12/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68
 (find_package)
   CMakeLists.txt:15 (FIND_PACKAGE)

I think I have librt already installed:

xie1@desk-lu463:~/openmm-torch$ ls /usr/lib/x86_64-linux-gnu/librt.* /usr/lib/x86_64-linux-gnu/librt.a /usr/lib/x86_64-linux-gnu/librt.so.1

Any idea how to get around this error?

Thank you!

peastman commented 3 months ago

I don't think librt has any connection to cudart. Do you have the CUDA toolkit installed? See http://docs.openmm.org/latest/userguide/library/02_compiling.html#cuda-or-opencl-support.

xiaowei-xie2 commented 3 months ago

I started over and I don't see that error anymore, but I saw another error close to the end of the build.

[ 19%] Built target OpenMMTorch
[ 19%] Built target CopyTestFiles
[ 26%] Built target TestSerializeTorchForce
[ 38%] Built target OpenMMTorchReference
[ 46%] Built target TestReferenceTorchForce
[ 50%] Linking CXX shared library ../../libOpenMMTorchOpenCL.so
[ 65%] Built target OpenMMTorchOpenCL
[ 69%] Linking CXX executable ../../../TestOpenCLTorchForce
[ 73%] Built target TestOpenCLTorchForce
[ 76%] Building CXX object platforms/cuda/CMakeFiles/OpenMMTorchCUDA.dir/src/CudaTorchKernelFactory.cpp.o
In file included from /home/xie1/miniconda3/include/c10/cuda/CUDADeviceAssertionHost.h:3,
                 from /home/xie1/miniconda3/include/c10/cuda/CUDAException.h:3,
                 from /home/xie1/miniconda3/include/c10/cuda/CUDAFunctions.h:12,
                 from /home/xie1/miniconda3/include/c10/cuda/CUDAStream.h:10,
                 from /home/xie1/miniconda3/include/c10/cuda/CUDAGraphsC10Utils.h:3,
                 from /home/xie1/miniconda3/include/ATen/cuda/CUDAGraph.h:5,
                 from /home/xie1/openmm-torch/platforms/cuda/src/CudaTorchKernels.h:39,
                 from /home/xie1/openmm-torch/platforms/cuda/src/CudaTorchKernelFactory.cpp:35:
/home/xie1/miniconda3/include/c10/cuda/CUDAMacros.h:8:10: fatal error: c10/cuda/impl/cuda_cmake_macros.h: No such file or directory
    8 | #include <c10/cuda/impl/cuda_cmake_macros.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [platforms/cuda/CMakeFiles/OpenMMTorchCUDA.dir/build.make:83: platforms/cuda/CMakeFiles/OpenMMTorchCUDA.dir/src/CudaTorchKernelFactory.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:409: platforms/cuda/CMakeFiles/OpenMMTorchCUDA.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

I installed cudatoolkit, libtorch and pytorch through conda conda install pytorch pytorch-cuda=12.4 cuda-toolkit=12.4 libtorch=2.3.0 -c pytorch-nightly -c nvidia -c conda-forge. And here are the configurations I used for cmake.

CMAKE_BUILD_TYPE                                                                                               
 CMAKE_INSTALL_PREFIX             /home/xie1/miniconda3                                                         
 CUDA_HOST_COMPILER               /usr/bin/cc                                                                   
 CUDA_SDK_ROOT_DIR                CUDA_SDK_ROOT_DIR-NOTFOUND                                                    
 CUDA_TOOLKIT_ROOT_DIR            /home/xie1/miniconda3                                                         
 CUDA_USE_STATIC_CUDA_RUNTIME     ON                                                                            
 Caffe2_DIR                       /home/xie1/miniconda3/share/cmake/Caffe2                                      
 MKLDNN_DIR                       MKLDNN_DIR-NOTFOUND                                                           
 NN_BUILD_CUDA_LIB                ON                                                                            
 NN_BUILD_OPENCL_LIB              ON                                                                            
 NN_BUILD_PYTHON_WRAPPERS         ON                                                                            
 OPENCL_INCLUDE_DIR               /home/xie1/miniconda3/include                                                 
 OPENCL_LIBRARY                   /home/xie1/miniconda3/lib/libOpenCL.so                                        
 OPENMM_DIR                       /home/xie1/miniconda3                                                         
 PYTHON_EXECUTABLE                /home/xie1/miniconda3/bin/python                                              
 PYTORCH_DIR                                                                                                    
 Protobuf_DIR                     /home/xie1/miniconda3/lib/cmake/protobuf                                      
 SWIG_EXECUTABLE                  /home/xie1/miniconda3/bin/swig                                                
 TORCH_LIBRARY                    /home/xie1/miniconda3/lib/libtorch.so                                         
 Torch_DIR                        /home/xie1/miniconda3/share/cmake/Torch                                       
 absl_DIR                         /home/xie1/miniconda3/lib/cmake/absl                                          
 c10_LIBRARY                      /home/xie1/miniconda3/lib/libc10.so                                           
 utf8_range_DIR                   /home/xie1/miniconda3/lib/cmake/utf8_range

I don't see c10/cuda/impl/cuda_cmake_macros.h in the miniconda3/include directory, whereas if I download libtorch from the official pytorch website I do see that file. But I cannot figure out how to use the downloaded libtorch (setting c10_LIBRARY and TORCH_LIBRARY doesn't seem to work).

Is conda installing cudatoolkit/libtorch a problem? Would you mind sharing some details on how to install libtorch from the downloaded zip file from the official website?

peastman commented 3 months ago

-c pytorch-nightly -c nvidia -c conda-forge

That isn't going to work correctly. Packages in conda-forge tend to be compiled differently than in other channels. It can't be mixed with other channels. It has its own builds of both PyTorch and the CUDA libraries, so you shouldn't need to mix.

xiaowei-xie2 commented 2 months ago

Thank you, I was able to compile it by only using the conda-forge channel. And I think it is using CUDA graph correctly now. I see a slight speed up and it is also erroring out for the torch.inverse operation.

Thank you so much for the fix!

peastman commented 2 months ago

Great, thanks!

openmm / openmm-torch

Question about how to use CUDA graph #151