Segmentation fault upon creating the Context when adding both RMSD biased force and Torch Force

JustinAiras commented 2 years ago

I've been using OpenMM 7.7.0 and OpenMM-Torch 0.8 successfully to run a PyTorch model, however, when I add an RMSD biasing force to the system as well as the TorchForce, I get a segmentation fault upon creating the Context. This RMSD biasing force has also worked independently without issue. My system setup is as follows:

# Import openmm libraries
from openmm.app import *
from openmm import *
from openmm.unit import *
from sys import stdout

# Import OpenMM-Torch
from openmmtorch import TorchForce

# Import torch_cluster (from PyTorch-Geometric)
from torch_cluster import radius_graph

# Import struct / force fields
pdb = PDBFile('struct.pdb')
ff = ForceField('amber14-all.xml')

# Build system
system = ff.createSystem(pdb.topology, nonbondedMethod=NoCutoff, constraints=HBonds)

# Initialize the TorceForce
ml_model = TorchForce('model.pt')
scaler = 1

# Create TorchForce as a CustomCVForce
U_ml = CustomCVForce('scaler*ml_model')

# Add parameters to the CustomCVForce
U_ml.addCollectiveVariable('ml_model', ml_model)
U_ml.addGlobalParameter('scaler', scaler)

# Add force to the system
system.addForce(U_ml)

# Loading reference positions for RMSD force
ref_coords = pdb.positions

# Get atom indices of backbone heavy atoms for RMSD calculation
atom_idx = []
idx = 0
for atom in pdb.topology.atoms():
    if atom.name == 'CA':
        atom_idx.append(idx)
    if atom.name == 'C':
        atom_idx.append(idx)
    if atom.name == 'N':
        atom_idx.append(idx)
    if atom.name == 'O':
        atom_idx.append(idx)
    idx = idx + 1

# Set RMSD calculation / initialize k_rmsd / rmsd_0
rmsd = RMSDForce(ref_coords, atom_idx)
k_rmsd = 1000  # (kJ / mol / nm^2)
rmsd_0 = 0.2   # (nm)

# Create harmonic RMSD-biasing force as CustomCVForce 
U_rmsd = CustomCVForce('0.5*k_rmsd*(rmsd - rmsd_0)^2')

# Add parameters to the CustomCVForce
U_rmsd.addCollectiveVariable('rmsd', rmsd)
U_rmsd.addGlobalParameter('k_rmsd', k_rmsd)
U_rmsd.addGlobalParameter('rmsd_0', rmsd_0)

# Add force to the system
system.addForce(U_rmsd)

# Create the integrator / platform
integrator = LangevinMiddleIntegrator(340*kelvin, 1/picosecond, 0.0025*picoseconds)
platform = Platform.getPlatformByName('Reference')

# Build simulation
sim = Simulation(pdb.topology, system, integrator, platform)

As stated above, building the Context with Simulation results in a segmentation fault. I've tried implementing this in various other ways that have led to the same result. The following lists other ways of implementing these forces that I've tried:

Using OpenMM 8.0 Beta and OpenMM-Torch 1.0 Beta
Adding the TorchForce directly without using CustomCVForce system.addForce(ml_model)
Adding the TorchForce and RMSD force as collective variables of a single CustomCVForce U_rmsd_ml = CustomCVForce('scaler*ml_model + 0.5*k_rmsd*(rmsd - rmsd_0)^2')
Effectively turning off the TorchForce by setting scaler = 0
Building the Context without using Simulation context = Context(system, integrator, platform)
Using the CPU platform
Switching the order in which I add the forces

All of this results in the same segmentation fault when the Context is built. Again, the model will run without issue when added independently to the system, as will the RMSD-biasing force. Any help with this issue would be greatly appreciated!

The files struct.pdb and model.pt can be found in the following zipped folder: struct_model.zip

raimis commented 2 years ago

Could you share struct.pdb and a script to generate model.pt. So, it is possible to reproduce the issue.

raimis commented 2 years ago

Also, could you add the imports to the script? So it is possible to run it.

JustinAiras commented 2 years ago

I've edited my original post to include the imports and the files struct.pdb and model.pt.

peastman commented 2 years ago

Your script runs fine for me using the latest code for OpenMM and for this plugin. I notice your model uses the torch_cluster package. How did you install it? Possibly it was compiled in a way that's incompatible with this plugin. Can you post the output of conda list?

Try running your script inside gdb. Let it run until it hits the segfault, then type bt to get a stack trace for where it happened and post it here.

JustinAiras commented 2 years ago

I installed torch_cluster into a clean conda environment with OpenMM 8.0 beta and OpenMM-Torch 1.0 beta as follows:

conda create -n torch_omm8b openmm openmm-torch -c "conda-forge/label/openmm_rc" -c "conda-forge/label/openmm-torch_rc"

conda install scipy
conda install mdtraj -c conda-forge

pip install torch-cluster -f https://data.pyg.org/whl/torch-1.11.0+cu112.html

The following text file contains the output from conda list: conda_list_omm8b_env.txt

and the following text file contains the backtrace from running my script in gdb: gdb_bt_omm8b_env.txt

peastman commented 2 years ago

That build is likely incompatible with packages from conda-forge. Try installing it like this instead.

conda install -c conda-forge pytorch_cluster

raimis commented 2 years ago

I have created the environment:

conda env create mmh/openmm-8-beta-linux
conda activate openmm-8-beta-linux
conda install -c conda-forge pytorch_cluster

The scirt works with problem.

@JustinAiras try to create a new environment as indicated with the latest (22.9.0) conda.

JustinAiras commented 2 years ago

I've run the exact set of commands you've provided using conda 22.9.0, but after from torch_cluster import radius_graph I get the following error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/airasj/anaconda3/envs/openmm-8-beta-linux/lib/python3.10/site-packages/torch_cluster/__init__.py", line 18, in <module>
    torch.ops.load_library(spec.origin)
  File "/home/airasj/anaconda3/envs/openmm-8-beta-linux/lib/python3.10/site-packages/torch/_ops.py", line 220, in load_library
    ctypes.CDLL(path)
  File "/home/airasj/anaconda3/envs/openmm-8-beta-linux/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/airasj/anaconda3/envs/openmm-8-beta-linux/lib/python3.10/site-packages/torch_cluster/_grid_cuda.so: undefined symbol: _ZN3c106detail19maybe_wrap_dim_slowEllb

raimis commented 2 years ago

@JustinAiras this might be a conda issue (https://github.com/openmm/openmm-torch/issues/88#issuecomment-1310477870). Could you try to install with mamba?

JustinAiras commented 2 years ago

Thank you, installing with mamba solved my most immediate issue, and I now can run MD with a TorchForce and RMSD-biasing force without encountering a segmentation fault.

I installed mamba into the base environment of a clean miniconda install, and created a new environment as follows:

mamba create -n torch_omm8b openmm openmm-torch pytorch_cluster -c "conda-forge/label/openmm_rc" -c "conda-forge/label/openmm-torch_rc" -c conda-forge

Note that this also worked with a mambaforge installation, but differences in cluster permissions required me to use miniconda. Also note that pytorch_cluster needs to be installed at the same time as openmm-torch as I get the following error if doing otherwise:

- nothing provides __cuda needed by pytorch-1.12.1-cuda102py310ha664643_201

For my purposes (I only need to use the CPU platform), installing with the above command resolves my issue. However, I still get issues if I try to use the CUDA platform. Upon building the simulation, I get the following error:

  File "/home/gridsan/jairas/work/small_prot_MD/chignolin/MD/torch_md/best_model/umbrella/rmsd_bias/GPU/torch_umb.py", line 79, in <module>
    sim = Simulation(pdb.topology, system, integrator, platform)
  File "/home/gridsan/jairas/miniconda3/envs/torch_omm8b/lib/python3.9/site-packages/openmm/app/simulation.py", line 101, in __init__
    self.context = mm.Context(self.system, self.integrator, platform)
  File "/home/gridsan/jairas/miniconda3/envs/torch_omm8b/lib/python3.9/site-packages/openmm/openmm.py", line 3530, in __init__
    _openmm.Context_swiginit(self, _openmm.new_Context(*args))
openmm.OpenMMException: Error loading CUDA module: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222)

Given similarities to how CUDA is installed on the cluster I use and those discussed in issue https://github.com/openmm/openmm-torch/issues/88#issuecomment-1310625318, I suspect the solution to this problem might lie somewhere there.

sef43 commented 2 years ago

This sounds like an issue with the CUDA toolkit version, see this issue from OpenMM: 3585 You will need to find out what drivers and CUDA version are installed on the cluster you are using, probably by running nvidia-smi on a compute node. And then tell conda to install a compatible cudatoolkit. e.g. mamba install -c conda-forge openmm cudatoolkit=10.X

openmm / openmm-torch

Segmentation fault upon creating the Context when adding both RMSD biased force and Torch Force #87