Open JustinAiras opened 2 years ago
Could you share struct.pdb
and a script to generate model.pt
. So, it is possible to reproduce the issue.
Also, could you add the imports to the script? So it is possible to run it.
I've edited my original post to include the imports and the files struct.pdb
and model.pt
.
Your script runs fine for me using the latest code for OpenMM and for this plugin. I notice your model uses the torch_cluster
package. How did you install it? Possibly it was compiled in a way that's incompatible with this plugin. Can you post the output of conda list
?
Try running your script inside gdb
. Let it run until it hits the segfault, then type bt
to get a stack trace for where it happened and post it here.
I installed torch_cluster
into a clean conda environment with OpenMM 8.0 beta and OpenMM-Torch 1.0 beta as follows:
conda create -n torch_omm8b openmm openmm-torch -c "conda-forge/label/openmm_rc" -c "conda-forge/label/openmm-torch_rc"
conda install scipy
conda install mdtraj -c conda-forge
pip install torch-cluster -f https://data.pyg.org/whl/torch-1.11.0+cu112.html
The following text file contains the output from conda list
:
conda_list_omm8b_env.txt
and the following text file contains the backtrace from running my script in gdb
:
gdb_bt_omm8b_env.txt
That build is likely incompatible with packages from conda-forge. Try installing it like this instead.
conda install -c conda-forge pytorch_cluster
I have created the environment:
conda env create mmh/openmm-8-beta-linux
conda activate openmm-8-beta-linux
conda install -c conda-forge pytorch_cluster
The scirt works with problem.
@JustinAiras try to create a new environment as indicated with the latest (22.9.0) conda
.
I've run the exact set of commands you've provided using conda
22.9.0, but after from torch_cluster import radius_graph
I get the following error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/airasj/anaconda3/envs/openmm-8-beta-linux/lib/python3.10/site-packages/torch_cluster/__init__.py", line 18, in <module>
torch.ops.load_library(spec.origin)
File "/home/airasj/anaconda3/envs/openmm-8-beta-linux/lib/python3.10/site-packages/torch/_ops.py", line 220, in load_library
ctypes.CDLL(path)
File "/home/airasj/anaconda3/envs/openmm-8-beta-linux/lib/python3.10/ctypes/__init__.py", line 374, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /home/airasj/anaconda3/envs/openmm-8-beta-linux/lib/python3.10/site-packages/torch_cluster/_grid_cuda.so: undefined symbol: _ZN3c106detail19maybe_wrap_dim_slowEllb
@JustinAiras this might be a conda
issue (https://github.com/openmm/openmm-torch/issues/88#issuecomment-1310477870). Could you try to install with mamba
?
Thank you, installing with mamba solved my most immediate issue, and I now can run MD with a TorchForce and RMSD-biasing force without encountering a segmentation fault.
I installed mamba into the base environment of a clean miniconda install, and created a new environment as follows:
mamba create -n torch_omm8b openmm openmm-torch pytorch_cluster -c "conda-forge/label/openmm_rc" -c "conda-forge/label/openmm-torch_rc" -c conda-forge
Note that this also worked with a mambaforge installation, but differences in cluster permissions required me to use miniconda. Also note that pytorch_cluster
needs to be installed at the same time as openmm-torch
as I get the following error if doing otherwise:
- nothing provides __cuda needed by pytorch-1.12.1-cuda102py310ha664643_201
For my purposes (I only need to use the CPU platform), installing with the above command resolves my issue. However, I still get issues if I try to use the CUDA platform. Upon building the simulation, I get the following error:
File "/home/gridsan/jairas/work/small_prot_MD/chignolin/MD/torch_md/best_model/umbrella/rmsd_bias/GPU/torch_umb.py", line 79, in <module>
sim = Simulation(pdb.topology, system, integrator, platform)
File "/home/gridsan/jairas/miniconda3/envs/torch_omm8b/lib/python3.9/site-packages/openmm/app/simulation.py", line 101, in __init__
self.context = mm.Context(self.system, self.integrator, platform)
File "/home/gridsan/jairas/miniconda3/envs/torch_omm8b/lib/python3.9/site-packages/openmm/openmm.py", line 3530, in __init__
_openmm.Context_swiginit(self, _openmm.new_Context(*args))
openmm.OpenMMException: Error loading CUDA module: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222)
Given similarities to how CUDA is installed on the cluster I use and those discussed in issue https://github.com/openmm/openmm-torch/issues/88#issuecomment-1310625318, I suspect the solution to this problem might lie somewhere there.
This sounds like an issue with the CUDA toolkit version, see this issue from OpenMM: 3585
You will need to find out what drivers and CUDA version are installed on the cluster you are using, probably by running nvidia-smi
on a compute node.
And then tell conda to install a compatible cudatoolkit.
e.g. mamba install -c conda-forge openmm cudatoolkit=10.X
I've been using OpenMM 7.7.0 and OpenMM-Torch 0.8 successfully to run a PyTorch model, however, when I add an RMSD biasing force to the system as well as the TorchForce, I get a segmentation fault upon creating the Context. This RMSD biasing force has also worked independently without issue. My system setup is as follows:
As stated above, building the Context with Simulation results in a segmentation fault. I've tried implementing this in various other ways that have led to the same result. The following lists other ways of implementing these forces that I've tried:
system.addForce(ml_model)
U_rmsd_ml = CustomCVForce('scaler*ml_model + 0.5*k_rmsd*(rmsd - rmsd_0)^2')
scaler = 0
context = Context(system, integrator, platform)
All of this results in the same segmentation fault when the Context is built. Again, the model will run without issue when added independently to the system, as will the RMSD-biasing force. Any help with this issue would be greatly appreciated!
The files
struct.pdb
andmodel.pt
can be found in the following zipped folder: struct_model.zip