Closed wiederm closed 2 years ago
Which platform are you using?
This probably means a plugin is failing to load, most likely because a dependent library can't be found. What is the value of Platform.getPluginLoadFailures()
?
I am using the CUDA platform. The output is:
'Error loading library /data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchCUDA.so: libtorch.so: cannot open shared object file: No such file or directory', 'Error loading library /data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchOpenCL.so: libtorch.so: cannot open shared object file: No such file or directory', 'Error loading library /data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchReference.so: libtorch.so: cannot open shared object file: No such file or directory'
and
ldd libOpenMMTorchCUDA.so
shows
/data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchCUDA.so: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /data/shared/software/python_env/anaconda3/envs/rew/lib/libOpenMM.so.7.7)
/data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchCUDA.so: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /data/shared/software/python_env/anaconda3/envs/rew/lib/libOpenMM.so.7.7)
linux-vdso.so.1 (0x00007ffe07194000)
libcudart.so.10.2 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.10.2 (0x00007f2e3f3bc000)
libOpenMM.so.7.7 => not found
libOpenMMCUDA.so => not found
libOpenMMTorch.so (0x00007f2e3f1b2000)
libtorch.so => not found
libtorch_cpu.so => not found
libtorch_cuda.so => not found
libc10.so => not found
[...]
it seems a few shared objects can't be found. I will investigate!
See the last paragraph of https://github.com/openmm/openmm-torch/issues/67. You need to add the pytorch lib
directory to your LD_LIBRARY_PATH
. Assuming you installed it with conda, that's probably something like /data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/torch/lib
.
that solved it! thank you for your help!
If I change the device from Reference
to CUDA
I see the following error:
File "/data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/torchani/aev.py", line 114, in compute_shifts
inv_distances = reciprocal_cell.norm(2, -1)
num_repeats = torch.ceil(cutoff * inv_distances).to(torch.long)
num_repeats = torch.where(pbc, num_repeats, num_repeats.new_zeros(()))
~~~~~~~~~~~ <--- HERE
r1 = torch.arange(1, num_repeats[0].item() + 1, device=cell.device)
r2 = torch.arange(1, num_repeats[1].item() + 1, device=cell.device)
RuntimeError: Expected condition, x and y to be on the same device, but condition is on cpu and x and y are on cuda:0 and cuda:0 respectively
----------------------------------------------------------------------
Ran 1 test in 26.320s
FAILED (errors=1)
I think this is consistent with what has been reported here, and a fix has been merged here as far as I can tell.
Is that fix included in the omnia
dev build?
Yes, the fix ought to be in the latest dev build.
just to make sure I do this correct:
I installed the dev
build with:
conda install -c omina-dev openmm
and that's the version that is installed:
# Name Version Build Channel
openmm 7.8 py39_cuda102_debug_1 omnia-dev
openmmml 1.0 pypi_0 pypi
openmmtorch 1.0 pypi_0 pypi
omnia-dev
, not omina-dev
. But otherwise, yes. The dev builds are broken at the moment, so the most recent one is from a few weeks ago. But that should still have the fix.
I think the fix might not be in the omnia-dev
build I am using.
As far as I can tell the openmm-7.8
build for py39 and cuda102 was uploaded 2 months ago (march 17.).
While the fix was merged on march 28.
I tried to install the linux-64/openmm-7.8-py39_cuda110_1.tar.bz2
build with conda
, but the usual commands fail to achieve this.
So, e.g. conda install -c omnia-dev openmm cudatoolkit=11.0
will still try to install py39_cuda102_debug_1
.
Am I missing something here?
It looks like for the last couple of months, it was only creating dev builds for CUDA 11. We really need to get them building again.
I have now compiled the openMM
master
branch and openmm-torch
from source, but the error is still the same:
File "/data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/torchani/aev.py", line 114, in compute_shifts
inv_distances = reciprocal_cell.norm(2, -1)
num_repeats = torch.ceil(cutoff * inv_distances).to(torch.long)
num_repeats = torch.where(pbc, num_repeats, num_repeats.new_zeros(()))
~~~~~~~~~~~ <--- HERE
r1 = torch.arange(1, num_repeats[0].item() + 1, device=cell.device)
r2 = torch.arange(1, num_repeats[1].item() + 1, device=cell.device)
RuntimeError: Expected condition, x and y to be on the same device, but condition is on cpu and x and y are on cuda:0 and cuda:0 respectively
when I compile from source and install using make install
and make PythonInstall
it takes the current state of the master
branch, right?
It ought to have the fix. Make sure you're really using the version you compiled, and that conda hasn't installed another copy automatically.
I think it is using the compiled version. I was careful not to install anything that would bring in openMM as a dependency. Also, the package build/channel tags indicate pypi, which I guess was used in make PythonInstall
.
conda list openmm
returns:
# packages in environment at /data/shared/software/python_env/anaconda3/envs/rew:
#
# Name Version Build Channel
openmm 7.7.0 pypi_0 pypi
openmmml 1.0 pypi_0 pypi
openmmtorch 1.0 pypi_0 pypi
I also double-checked that the correct openMM
version is loaded in the script and it all points to the correct conda environment. Is there anything else that I can check?
I did some double-checking just to make sure that I am not using a different openMM
version behind the scene.
With the conda
environment activated in which I installed openMM
from source openmm.__path__
points to /data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/openmm
. That's the correct path in the environment.
The file /data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/openmm/version.py
has the correct git_revision
hash: fb0360604800bba836be24cd6e8adce8b22b258a
(https://github.com/openmm/openmm/tree/fb0360604800bba836be24cd6e8adce8b22b258a).
I also incremented the version number to 7.7.1 in the Makefile
and after compiling and installing I got the updated version number when calling openmm.__version__
.
I think this all indicates that I am using the compiled openMM
version, right?
That sounds like you have the right version. I'd like to see if I can reproduce it. What versions of Pytorch and CUDA are you using?
To make matters a bit simpler I am now using the conda openmm_dev openMM
package, but the error is still the same. I have confirmed that mm.__path__
points to the correct conda environment and the full_version
tag is 7.7.0.dev-109f6b2
.
I am using cudatoolkit=11.3
and pytorch=1.10
, the exported conda environment & the pytest
error report are attached. The libtorch c++ library I am using is libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcu113.zip
It's working for me. What version of the OpenMM-ML code are you using? I'm testing with the latest code from the main branch.
Can you post the complete output of running the test?
yes, I am also testing with the lastest code from the main branch.
I am installing with pip install git+https://github.com/openmm/openmm-ml.git
.
And, just to make sure we are talking about the same thing: the test runs fine on Reference
or CPU
platform, but changing to CUDA
returns the described error.
The full output is:
(rew-test) [mwieder@a7srv5 test 💡 ](main)$ pytest TestMLPotential.py
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.9.12, pytest-7.1.2, pluggy-1.0.0
rootdir: /home/mwieder/openmm-ml
collected 1 item
TestMLPotential.py F [100%]
================================================================================================================== FAILURES ==================================================================================================================
___________________________________________________________________________________________________ TestMLPotential.testCreateMixedSystem ____________________________________________________________________________________________________
self = <TestMLPotential.TestMLPotential testMethod=testCreateMixedSystem>
def testCreateMixedSystem(self):
pdb = app.PDBFile('alanine-dipeptide-explicit.pdb')
ff = app.ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')
mmSystem = ff.createSystem(pdb.topology, nonbondedMethod=app.PME)
potential = MLPotential('ani2x')
mlAtoms = [a.index for a in next(pdb.topology.chains()).atoms()]
mixedSystem = potential.createMixedSystem(pdb.topology, mmSystem, mlAtoms, interpolate=False)
interpSystem = potential.createMixedSystem(pdb.topology, mmSystem, mlAtoms, interpolate=True)
# platform = mm.Platform.getPlatformByName('Reference')
platform = mm.Platform.getPlatformByName('CUDA')
mmContext = mm.Context(mmSystem, mm.VerletIntegrator(0.001), platform)
mixedContext = mm.Context(mixedSystem, mm.VerletIntegrator(0.001), platform)
interpContext = mm.Context(interpSystem, mm.VerletIntegrator(0.001), platform)
mmContext.setPositions(pdb.positions)
mixedContext.setPositions(pdb.positions)
interpContext.setPositions(pdb.positions)
mmEnergy = mmContext.getState(getEnergy=True).getPotentialEnergy().value_in_unit(unit.kilojoules_per_mole)
> mixedEnergy = mixedContext.getState(getEnergy=True).getPotentialEnergy().value_in_unit(unit.kilojoules_per_mole)
TestMLPotential.py:31:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <openmm.openmm.Context; proxy of <Swig Object of type 'OpenMM::Context *' at 0x7fb7a2b5cd50> >, getPositions = False, getVelocities = False, getForces = False, getEnergy = True, getParameters = False
getParameterDerivatives = False, getIntegratorParameters = False, enforcePeriodicBox = False, groups = -1
def getState(self, getPositions=False, getVelocities=False,
getForces=False, getEnergy=False, getParameters=False,
getParameterDerivatives=False, getIntegratorParameters=False,
enforcePeriodicBox=False, groups=-1):
"""Get a State object recording the current state information stored in this context.
Parameters
----------
getPositions : bool=False
whether to store particle positions in the State
getVelocities : bool=False
whether to store particle velocities in the State
getForces : bool=False
whether to store the forces acting on particles in the State
getEnergy : bool=False
whether to store potential and kinetic energy in the State
getParameters : bool=False
whether to store context parameters in the State
getParameterDerivatives : bool=False
whether to store parameter derivatives in the State
getIntegratorParameters : bool=False
whether to store integrator parameters in the State
enforcePeriodicBox : bool=False
if false, the position of each particle will be whatever position
is stored in the Context, regardless of periodic boundary conditions.
If true, particle positions will be translated so the center of
every molecule lies in the same periodic box.
groups : set={0,1,2,...,31}
a set of indices for which force groups to include when computing
forces and energies. The default value includes all groups. groups
can also be passed as an unsigned integer interpreted as a bitmask,
in which case group i will be included if (groups&(1<<i)) != 0.
"""
try:
# is the input integer-like?
groups_mask = int(groups)
except TypeError:
if isinstance(groups, set):
# nope, okay, then it should be an set
groups_mask = functools.reduce(operator.or_,
((1<<x) & 0xffffffff for x in groups))
else:
raise TypeError('%s is neither an int nor set' % groups)
if groups_mask >= 0x80000000:
groups_mask -= 0x100000000
types = 0
if getPositions:
types += State.Positions
if getVelocities:
types += State.Velocities
if getForces:
types += State.Forces
if getEnergy:
types += State.Energy
if getParameters:
types += State.Parameters
if getParameterDerivatives:
types += State.ParameterDerivatives
if getIntegratorParameters:
types += State.IntegratorParameters
> state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
E openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
E Traceback of TorchScript, serialized code (most recent call last):
E File "code/__torch__/openmmml/models/anipotential/___torch_mangle_14.py", line 34, in forward
E _6 = torch.mul(boxvectors1, 10.)
E pbc = self.pbc
E _7, energy1, = (model0).forward(_5, _6, pbc, )
E ~~~~~~~~~~~~~~~ <--- HERE
E energy = energy1
E energyScale = self.energyScale
E File "code/__torch__/torchani/models.py", line 32, in forward
E pass
E aev_computer = self.aev_computer
E species_aevs = (aev_computer).forward(species_coordinates0, cell, pbc, )
E ~~~~~~~~~~~~~~~~~~~~~ <--- HERE
E neural_networks = self.neural_networks
E species_energies = (neural_networks).forward(species_aevs, None, None, )
E File "code/__torch__/torchani/aev.py", line 68, in forward
E ops.prim.RaiseException("AssertionError: ")
E cell3, pbc0 = _1, _1
E shifts = _0(cell3, pbc0, 5.0999999999999996, )
E ~~ <--- HERE
E triu_index0 = self.triu_index
E aev1 = __torch__.torchani.aev.compute_aev(species, coordinates, triu_index0, (self).constants(), (7, 16, 112, 32, 896), (cell3, shifts), )
E File "code/__torch__/torchani/aev.py", line 163, in compute_shifts
E num_repeats = torch.to(_34, 4)
E _35 = torch.new_zeros(num_repeats, annotate(List[int], []))
E num_repeats0 = torch.where(pbc, num_repeats, _35)
E ~~~~~~~~~~~ <--- HERE
E _36 = torch.item(torch.select(num_repeats0, 0, 0))
E r1 = torch.arange(1, torch.add(_36, 1), dtype=None, layout=None, device=ops.prim.device(cell))
E
E Traceback of TorchScript, original code (most recent call last):
E File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/openmmml/models/anipotential.py", line 111, in forward
E else:
E boxvectors = boxvectors.to(torch.float32)
E _, energy = self.model((self.species, 10.0*positions.unsqueeze(0)), cell=10.0*boxvectors, pbc=self.pbc)
E ~~~~~~~~~~ <--- HERE
E return self.energyScale*energy
E File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/models.py", line 106, in forward
E raise ValueError(f'Unknown species found in {species_coordinates[0]}')
E
E species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc)
E ~~~~~~~~~~~~~~~~~ <--- HERE
E species_energies = self.neural_networks(species_aevs)
E return self.energy_shifter(species_energies)
E File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/aev.py", line 532, in forward
E assert (cell is not None and pbc is not None)
E cutoff = max(self.Rcr, self.Rca)
E shifts = compute_shifts(cell, pbc, cutoff)
E ~~~~~~~~~~~~~~ <--- HERE
E aev = compute_aev(species, coordinates, self.triu_index, self.constants(), self.sizes, (cell, shifts))
E
E File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/aev.py", line 114, in compute_shifts
E inv_distances = reciprocal_cell.norm(2, -1)
E num_repeats = torch.ceil(cutoff * inv_distances).to(torch.long)
E num_repeats = torch.where(pbc, num_repeats, num_repeats.new_zeros(()))
E ~~~~~~~~~~~ <--- HERE
E r1 = torch.arange(1, num_repeats[0].item() + 1, device=cell.device)
E r2 = torch.arange(1, num_repeats[1].item() + 1, device=cell.device)
E RuntimeError: Expected condition, x and y to be on the same device, but condition is on cpu and x and y are on cuda:0 and cuda:0 respectively
/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/openmm/openmm.py:9028: OpenMMException
------------------------------------------------------------------------------------------------------------ Captured stdout call ------------------------------------------------------------------------------------------------------------
/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/resources/
/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/resources/
------------------------------------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------------------------------------
WARNING root:__init__.py:5 Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
============================================================================================================== warnings summary ==============================================================================================================
test/TestMLPotential.py::TestMLPotential::testCreateMixedSystem
/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/__init__.py:55: UserWarning: Dependency not satisfied, torchani.ase will not be available
warnings.warn("Dependency not satisfied, torchani.ase will not be available")
test/TestMLPotential.py::TestMLPotential::testCreateMixedSystem
/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torch/functional.py:1069: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1645049332358/work/aten/src/ATen/native/TensorShape.cpp:2156.)
return _VF.cartesian_prod(tensors) # type: ignore[attr-defined]
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================================== short test summary info ===========================================================================================================
FAILED TestMLPotential.py::TestMLPotential::testCreateMixedSystem - openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
======================================================================================================= 1 failed, 2 warnings in 28.61s =======================================================================================================
Found it! This actually turned out to be unrelated to the fix in https://github.com/openmm/openmm/pull/3533. The problem was that when we created the module, we didn't register species
and pbc
as parameters. Because of that, when we called to(device)
on it to move the module to the GPU, those two didn't get moved.
The fix is in #28.
Thank you very much for your help and the quick fix!
Sorry for the cross package issue --- I think this might involve
openMM-torch
, but I get the error executing the test script ofopenmm-ml
, so I am posting here. Running the test script I getI am not super sure where the problem originates from. I have built
openMM-torch
from source with the nightly buildopenMM
and it seemed to have passed all the necessary tests. But when runningmake PythonInstall
I get a lot of warnings (it runs successfully though):is this expected?