TestMLPotential.py fails

wiederm commented 2 years ago

Sorry for the cross package issue --- I think this might involve openMM-torch, but I get the error executing the test script of openmm-ml, so I am posting here. Running the test script I get

======================================================================
ERROR: testCreateMixedSystem (__main__.TestMLPotential)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mwieder/openmm-ml/test/TestMLPotential.py", line 19, in testCreateMixedSystem
    mixedContext = mm.Context(mixedSystem, mm.VerletIntegrator(0.001), platform)
  File "/data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/openmm/openmm.py", line 16230, in __init__
    _openmm.Context_swiginit(self, _openmm.new_Context(*args))
openmm.OpenMMException: Specified a Platform for a Context which does not support all required kernels

I am not super sure where the problem originates from. I have built openMM-torch from source with the nightly build openMM and it seemed to have passed all the necessary tests. But when running make PythonInstall I get a lot of warnings (it runs successfully though):

[100%] Generating TorchPluginWrapper.cpp
/data/shared/software/python_env/anaconda3/envs/rew/include/swig/OpenMMSwigHeaders.i:2242: Warning 314: 'None' is a python keyword, renaming to '_None'
/data/shared/software/python_env/anaconda3/envs/rew/include/swig/OpenMMSwigHeaders.i:496: Warning 453: Can't apply (std::vector< double > &OUTPUT). No typemaps are defined.
/data/shared/software/python_env/anaconda3/envs/rew/include/swig/OpenMMSwigHeaders.i:503: Warning 453: Can't apply (OpenMM::Context &OUTPUT). No typemaps are defined.
/data/shared/software/python_env/anaconda3/envs/rew/include/swig/OpenMMSwigHeaders.i:538: Warning 453: Can't apply (std::vector< double > &OUTPUT). No typemaps are defined.
...

is this expected?

peastman commented 2 years ago

Which platform are you using?

This probably means a plugin is failing to load, most likely because a dependent library can't be found. What is the value of Platform.getPluginLoadFailures()?

wiederm commented 2 years ago

I am using the CUDA platform. The output is:

'Error loading library /data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchCUDA.so: libtorch.so: cannot open shared object file: No such file or directory', 'Error loading library /data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchOpenCL.so: libtorch.so: cannot open shared object file: No such file or directory', 'Error loading library /data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchReference.so: libtorch.so: cannot open shared object file: No such file or directory'

and ldd libOpenMMTorchCUDA.so shows

/data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchCUDA.so: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /data/shared/software/python_env/anaconda3/envs/rew/lib/libOpenMM.so.7.7)
/data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchCUDA.so: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /data/shared/software/python_env/anaconda3/envs/rew/lib/libOpenMM.so.7.7)
    linux-vdso.so.1 (0x00007ffe07194000)
    libcudart.so.10.2 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.10.2 (0x00007f2e3f3bc000)
    libOpenMM.so.7.7 => not found
    libOpenMMCUDA.so => not found
    libOpenMMTorch.so (0x00007f2e3f1b2000)
    libtorch.so => not found
    libtorch_cpu.so => not found
    libtorch_cuda.so => not found
    libc10.so => not found
[...]

it seems a few shared objects can't be found. I will investigate!

peastman commented 2 years ago

See the last paragraph of https://github.com/openmm/openmm-torch/issues/67. You need to add the pytorch lib directory to your LD_LIBRARY_PATH. Assuming you installed it with conda, that's probably something like /data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/torch/lib.

wiederm commented 2 years ago

that solved it! thank you for your help!

wiederm commented 2 years ago

If I change the device from Reference to CUDA I see the following error:

  File "/data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/torchani/aev.py", line 114, in compute_shifts
    inv_distances = reciprocal_cell.norm(2, -1)
    num_repeats = torch.ceil(cutoff * inv_distances).to(torch.long)
    num_repeats = torch.where(pbc, num_repeats, num_repeats.new_zeros(()))
                  ~~~~~~~~~~~ <--- HERE
    r1 = torch.arange(1, num_repeats[0].item() + 1, device=cell.device)
    r2 = torch.arange(1, num_repeats[1].item() + 1, device=cell.device)
RuntimeError: Expected condition, x and y to be on the same device, but condition is on cpu and x and y are on cuda:0 and cuda:0 respectively

----------------------------------------------------------------------
Ran 1 test in 26.320s

FAILED (errors=1)

I think this is consistent with what has been reported here, and a fix has been merged here as far as I can tell. Is that fix included in the omnia dev build?

peastman commented 2 years ago

Yes, the fix ought to be in the latest dev build.

wiederm commented 2 years ago

just to make sure I do this correct: I installed the dev build with: conda install -c omina-dev openmm and that's the version that is installed:

 # Name                    Version                   Build  Channel
openmm                    7.8             py39_cuda102_debug_1    omnia-dev
openmmml                  1.0                      pypi_0    pypi
openmmtorch               1.0                      pypi_0    pypi

peastman commented 2 years ago

omnia-dev, not omina-dev. But otherwise, yes. The dev builds are broken at the moment, so the most recent one is from a few weeks ago. But that should still have the fix.

wiederm commented 2 years ago

I think the fix might not be in the omnia-dev build I am using. As far as I can tell the openmm-7.8 build for py39 and cuda102 was uploaded 2 months ago (march 17.). While the fix was merged on march 28.

wiederm commented 2 years ago

I tried to install the linux-64/openmm-7.8-py39_cuda110_1.tar.bz2 build with conda, but the usual commands fail to achieve this. So, e.g. conda install -c omnia-dev openmm cudatoolkit=11.0 will still try to install py39_cuda102_debug_1. Am I missing something here?

peastman commented 2 years ago

It looks like for the last couple of months, it was only creating dev builds for CUDA 11. We really need to get them building again.

wiederm commented 2 years ago

I have now compiled the openMM master branch and openmm-torch from source, but the error is still the same:

  File "/data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/torchani/aev.py", line 114, in compute_shifts
    inv_distances = reciprocal_cell.norm(2, -1)
    num_repeats = torch.ceil(cutoff * inv_distances).to(torch.long)
    num_repeats = torch.where(pbc, num_repeats, num_repeats.new_zeros(()))
                  ~~~~~~~~~~~ <--- HERE
    r1 = torch.arange(1, num_repeats[0].item() + 1, device=cell.device)
    r2 = torch.arange(1, num_repeats[1].item() + 1, device=cell.device)
RuntimeError: Expected condition, x and y to be on the same device, but condition is on cpu and x and y are on cuda:0 and cuda:0 respectively

when I compile from source and install using make install and make PythonInstall it takes the current state of the master branch, right?

peastman commented 2 years ago

It ought to have the fix. Make sure you're really using the version you compiled, and that conda hasn't installed another copy automatically.

wiederm commented 2 years ago

I think it is using the compiled version. I was careful not to install anything that would bring in openMM as a dependency. Also, the package build/channel tags indicate pypi, which I guess was used in make PythonInstall. conda list openmm returns:

# packages in environment at /data/shared/software/python_env/anaconda3/envs/rew:
#
# Name                    Version                   Build  Channel
openmm                    7.7.0                    pypi_0    pypi
openmmml                  1.0                      pypi_0    pypi
openmmtorch               1.0                      pypi_0    pypi

I also double-checked that the correct openMM version is loaded in the script and it all points to the correct conda environment. Is there anything else that I can check?

wiederm commented 2 years ago

I did some double-checking just to make sure that I am not using a different openMM version behind the scene. With the conda environment activated in which I installed openMM from source openmm.__path__ points to /data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/openmm. That's the correct path in the environment. The file /data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/openmm/version.py has the correct git_revision hash: fb0360604800bba836be24cd6e8adce8b22b258a (https://github.com/openmm/openmm/tree/fb0360604800bba836be24cd6e8adce8b22b258a). I also incremented the version number to 7.7.1 in the Makefile and after compiling and installing I got the updated version number when calling openmm.__version__. I think this all indicates that I am using the compiled openMM version, right?

peastman commented 2 years ago

That sounds like you have the right version. I'd like to see if I can reproduce it. What versions of Pytorch and CUDA are you using?

wiederm commented 2 years ago

To make matters a bit simpler I am now using the conda openmm_dev openMM package, but the error is still the same. I have confirmed that mm.__path__ points to the correct conda environment and the full_version tag is 7.7.0.dev-109f6b2. I am using cudatoolkit=11.3 and pytorch=1.10, the exported conda environment & the pytest error report are attached. The libtorch c++ library I am using is libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcu113.zip

env_error.zip

peastman commented 2 years ago

It's working for me. What version of the OpenMM-ML code are you using? I'm testing with the latest code from the main branch.

Can you post the complete output of running the test?

wiederm commented 2 years ago

yes, I am also testing with the lastest code from the main branch. I am installing with pip install git+https://github.com/openmm/openmm-ml.git.

And, just to make sure we are talking about the same thing: the test runs fine on Reference or CPU platform, but changing to CUDA returns the described error.

The full output is:

(rew-test) [mwieder@a7srv5 test 💡 ](main)$ pytest TestMLPotential.py 
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.9.12, pytest-7.1.2, pluggy-1.0.0
rootdir: /home/mwieder/openmm-ml
collected 1 item                                                                                                                                                                                                                             

TestMLPotential.py F                                                                                                                                                                                                                   [100%]

================================================================================================================== FAILURES ==================================================================================================================
___________________________________________________________________________________________________ TestMLPotential.testCreateMixedSystem ____________________________________________________________________________________________________

self = <TestMLPotential.TestMLPotential testMethod=testCreateMixedSystem>

    def testCreateMixedSystem(self):
        pdb = app.PDBFile('alanine-dipeptide-explicit.pdb')
        ff = app.ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')
        mmSystem = ff.createSystem(pdb.topology, nonbondedMethod=app.PME)
        potential = MLPotential('ani2x')
        mlAtoms = [a.index for a in next(pdb.topology.chains()).atoms()]
        mixedSystem = potential.createMixedSystem(pdb.topology, mmSystem, mlAtoms, interpolate=False)
        interpSystem = potential.createMixedSystem(pdb.topology, mmSystem, mlAtoms, interpolate=True)
        # platform = mm.Platform.getPlatformByName('Reference')
        platform = mm.Platform.getPlatformByName('CUDA')
        mmContext = mm.Context(mmSystem, mm.VerletIntegrator(0.001), platform)
        mixedContext = mm.Context(mixedSystem, mm.VerletIntegrator(0.001), platform)
        interpContext = mm.Context(interpSystem, mm.VerletIntegrator(0.001), platform)
        mmContext.setPositions(pdb.positions)
        mixedContext.setPositions(pdb.positions)
        interpContext.setPositions(pdb.positions)
        mmEnergy = mmContext.getState(getEnergy=True).getPotentialEnergy().value_in_unit(unit.kilojoules_per_mole)
>       mixedEnergy = mixedContext.getState(getEnergy=True).getPotentialEnergy().value_in_unit(unit.kilojoules_per_mole)

TestMLPotential.py:31: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <openmm.openmm.Context; proxy of <Swig Object of type 'OpenMM::Context *' at 0x7fb7a2b5cd50> >, getPositions = False, getVelocities = False, getForces = False, getEnergy = True, getParameters = False
getParameterDerivatives = False, getIntegratorParameters = False, enforcePeriodicBox = False, groups = -1

    def getState(self, getPositions=False, getVelocities=False,
                 getForces=False, getEnergy=False, getParameters=False,
                 getParameterDerivatives=False, getIntegratorParameters=False,
                 enforcePeriodicBox=False, groups=-1):
        """Get a State object recording the current state information stored in this context.

        Parameters
        ----------
        getPositions : bool=False
            whether to store particle positions in the State
        getVelocities : bool=False
            whether to store particle velocities in the State
        getForces : bool=False
            whether to store the forces acting on particles in the State
        getEnergy : bool=False
            whether to store potential and kinetic energy in the State
        getParameters : bool=False
            whether to store context parameters in the State
        getParameterDerivatives : bool=False
            whether to store parameter derivatives in the State
        getIntegratorParameters : bool=False
            whether to store integrator parameters in the State
        enforcePeriodicBox : bool=False
            if false, the position of each particle will be whatever position
            is stored in the Context, regardless of periodic boundary conditions.
            If true, particle positions will be translated so the center of
            every molecule lies in the same periodic box.
        groups : set={0,1,2,...,31}
            a set of indices for which force groups to include when computing
            forces and energies. The default value includes all groups. groups
            can also be passed as an unsigned integer interpreted as a bitmask,
            in which case group i will be included if (groups&(1<<i)) != 0.
        """
        try:
    # is the input integer-like?
            groups_mask = int(groups)
        except TypeError:
            if isinstance(groups, set):
    # nope, okay, then it should be an set
                groups_mask = functools.reduce(operator.or_,
                        ((1<<x) & 0xffffffff for x in groups))
            else:
                raise TypeError('%s is neither an int nor set' % groups)
        if groups_mask >= 0x80000000:
            groups_mask -= 0x100000000
        types = 0
        if getPositions:
            types += State.Positions
        if getVelocities:
            types += State.Velocities
        if getForces:
            types += State.Forces
        if getEnergy:
            types += State.Energy
        if getParameters:
            types += State.Parameters
        if getParameterDerivatives:
            types += State.ParameterDerivatives
        if getIntegratorParameters:
            types += State.IntegratorParameters
>       state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
E       openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
E       Traceback of TorchScript, serialized code (most recent call last):
E         File "code/__torch__/openmmml/models/anipotential/___torch_mangle_14.py", line 34, in forward
E             _6 = torch.mul(boxvectors1, 10.)
E             pbc = self.pbc
E             _7, energy1, = (model0).forward(_5, _6, pbc, )
E                             ~~~~~~~~~~~~~~~ <--- HERE
E             energy = energy1
E           energyScale = self.energyScale
E         File "code/__torch__/torchani/models.py", line 32, in forward
E             pass
E           aev_computer = self.aev_computer
E           species_aevs = (aev_computer).forward(species_coordinates0, cell, pbc, )
E                           ~~~~~~~~~~~~~~~~~~~~~ <--- HERE
E           neural_networks = self.neural_networks
E           species_energies = (neural_networks).forward(species_aevs, None, None, )
E         File "code/__torch__/torchani/aev.py", line 68, in forward
E               ops.prim.RaiseException("AssertionError: ")
E               cell3, pbc0 = _1, _1
E             shifts = _0(cell3, pbc0, 5.0999999999999996, )
E                      ~~ <--- HERE
E             triu_index0 = self.triu_index
E             aev1 = __torch__.torchani.aev.compute_aev(species, coordinates, triu_index0, (self).constants(), (7, 16, 112, 32, 896), (cell3, shifts), )
E         File "code/__torch__/torchani/aev.py", line 163, in compute_shifts
E         num_repeats = torch.to(_34, 4)
E         _35 = torch.new_zeros(num_repeats, annotate(List[int], []))
E         num_repeats0 = torch.where(pbc, num_repeats, _35)
E                        ~~~~~~~~~~~ <--- HERE
E         _36 = torch.item(torch.select(num_repeats0, 0, 0))
E         r1 = torch.arange(1, torch.add(_36, 1), dtype=None, layout=None, device=ops.prim.device(cell))
E       
E       Traceback of TorchScript, original code (most recent call last):
E         File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/openmmml/models/anipotential.py", line 111, in forward
E                       else:
E                           boxvectors = boxvectors.to(torch.float32)
E                           _, energy = self.model((self.species, 10.0*positions.unsqueeze(0)), cell=10.0*boxvectors, pbc=self.pbc)
E                                       ~~~~~~~~~~ <--- HERE
E                       return self.energyScale*energy
E         File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/models.py", line 106, in forward
E                   raise ValueError(f'Unknown species found in {species_coordinates[0]}')
E           
E               species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc)
E                              ~~~~~~~~~~~~~~~~~ <--- HERE
E               species_energies = self.neural_networks(species_aevs)
E               return self.energy_shifter(species_energies)
E         File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/aev.py", line 532, in forward
E                   assert (cell is not None and pbc is not None)
E                   cutoff = max(self.Rcr, self.Rca)
E                   shifts = compute_shifts(cell, pbc, cutoff)
E                            ~~~~~~~~~~~~~~ <--- HERE
E                   aev = compute_aev(species, coordinates, self.triu_index, self.constants(), self.sizes, (cell, shifts))
E           
E         File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/aev.py", line 114, in compute_shifts
E           inv_distances = reciprocal_cell.norm(2, -1)
E           num_repeats = torch.ceil(cutoff * inv_distances).to(torch.long)
E           num_repeats = torch.where(pbc, num_repeats, num_repeats.new_zeros(()))
E                         ~~~~~~~~~~~ <--- HERE
E           r1 = torch.arange(1, num_repeats[0].item() + 1, device=cell.device)
E           r2 = torch.arange(1, num_repeats[1].item() + 1, device=cell.device)
E       RuntimeError: Expected condition, x and y to be on the same device, but condition is on cpu and x and y are on cuda:0 and cuda:0 respectively

/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/openmm/openmm.py:9028: OpenMMException
------------------------------------------------------------------------------------------------------------ Captured stdout call ------------------------------------------------------------------------------------------------------------
/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/resources/
/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/resources/
------------------------------------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------------------------------------
WARNING  root:__init__.py:5 Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
============================================================================================================== warnings summary ==============================================================================================================
test/TestMLPotential.py::TestMLPotential::testCreateMixedSystem
  /data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/__init__.py:55: UserWarning: Dependency not satisfied, torchani.ase will not be available
    warnings.warn("Dependency not satisfied, torchani.ase will not be available")

test/TestMLPotential.py::TestMLPotential::testCreateMixedSystem
  /data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torch/functional.py:1069: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1645049332358/work/aten/src/ATen/native/TensorShape.cpp:2156.)
    return _VF.cartesian_prod(tensors)  # type: ignore[attr-defined]

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================================== short test summary info ===========================================================================================================
FAILED TestMLPotential.py::TestMLPotential::testCreateMixedSystem - openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
======================================================================================================= 1 failed, 2 warnings in 28.61s =======================================================================================================

peastman commented 2 years ago

Found it! This actually turned out to be unrelated to the fix in https://github.com/openmm/openmm/pull/3533. The problem was that when we created the module, we didn't register species and pbc as parameters. Because of that, when we called to(device) on it to move the module to the GPU, those two didn't get moved.

The fix is in #28.

wiederm commented 2 years ago

Thank you very much for your help and the quick fix!

openmm / openmm-ml

TestMLPotential.py fails #27