`TestMLPotential.py` fails with `nnpops` implementation

dominicrufa commented 2 years ago

I'm not too familiar with torch tracebacks, but it seems like Torch isn't robust to the placement of arrays onto different devices:

ERROR: testCreateMixedSystem (__main__.TestMLPotential)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/lila/home/rufad/github/openmm-ml/test/TestMLPotential.py", line 27, in testCreateMixedSystem
    mixedEnergy = mixedContext.getState(getEnergy=True).getPotentialEnergy().value_in_unit(unit.kilojoules_per_mole)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/openmmml/models/anipotential/___torch_mangle_14.py", line 36, in forward
      _5 = torch.mul(boxvectors1, 10.)
      pbc0 = self.pbc
      _6, energy1, = (model0).forward(_4, _5, pbc0, )
                      ~~~~~~~~~~~~~~~ <--- HERE
      energy = energy1
    energyScale = self.energyScale
  File "code/__torch__/NNPOps/OptimizedTorchANI.py", line 19, in forward
    species_aevs = (aev_computer).forward(species_coordinates0, cell, pbc, )
    neural_networks = self.neural_networks
    species_energies = (neural_networks).forward(species_aevs, )
                        ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    energy_shifter = self.energy_shifter
    species_energies0 = (energy_shifter).forward(species_energies, None, None, )
  File "code/__torch__/NNPOps/BatchedNN.py", line 10, in forward
    species_aev: Tuple[Tensor, Tensor]) -> __torch__.NNPOps.EnergyShifter.SpeciesEnergies:
    _0 = getattr(self, "0")
    return (_0).forward(species_aev, )
            ~~~~~~~~~~~ <--- HERE
  def __len__(self: __torch__.NNPOps.BatchedNN.TorchANIBatchedNN) -> int:
    return 1
  File "code/__torch__/NNPOps/BatchedNN.py", line 33, in forward
    layer0_weights = self.layer0_weights
    layer0_biases = self.layer0_biases
    vectors0 = ops.NNPOpsBatchedNN.BatchedLinear(vectors, layer0_weights, layer0_biases)
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    vectors1 = __torch__.torch.nn.functional.celu(vectors0, 0.10000000000000001, False, )
    layer2_weights = self.layer2_weights

Traceback of TorchScript, original code (most recent call last):
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/models/anipotential.py", line 135, in forward
                    self.pbc = self.pbc.to(positions.device)
                    boxvectors = boxvectors.to(torch.float32)
                    _, energy = self.model((self.species, positions), cell=10.0*boxvectors, pbc=self.pbc)
                                ~~~~~~~~~~ <--- HERE

                return energy * self.energyScale # Hartree --> kJ/mol
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/OptimizedTorchANI.py", line 53, in forward
        species_coordinates = self.species_converter(species_coordinates)
        species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc)
        species_energies = self.neural_networks(species_aevs)
                           ~~~~~~~~~~~~~~~~~~~~ <--- HERE
        species_energies = self.energy_shifter(species_energies)

  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/BatchedNN.py", line 122, in forward
    def forward(self, species_aev: Tuple[Tensor, Tensor]) -> SpeciesEnergies:
        return self[0].forward(species_aev)
               ~~~~~~~~~~~~~~~ <--- HERE
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/BatchedNN.py", line 99, in forward
        vectors = aev.unsqueeze(-2).unsqueeze(-1)

        vectors = batchedLinear(vectors, self.layer0_weights, self.layer0_biases) # Linear 0
                  ~~~~~~~~~~~~~ <--- HERE
        vectors = F.celu(vectors, alpha=0.1)                                      # CELU   1
        vectors = batchedLinear(vectors, self.layer2_weights, self.layer2_biases) # Linear 2
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper__bmm)

----------------------------------------------------------------------
Ran 1 test in 32.967s

FAILED (errors=1)

@peastman , any idea what is going wrong here? or perhaps @raimis knows what is wrong.

alternatively, if i try to run this without GPUs, it throws a runtime error:

======================================================================
ERROR: testCreateMixedSystem (__main__.TestMLPotential)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/lila/home/rufad/github/openmm-ml/test/TestMLPotential.py", line 17, in testCreateMixedSystem
    mixedSystem = potential.createMixedSystem(pdb.topology, mmSystem, mlAtoms, interpolate=False)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/mlpotential.py", line 265, in createMixedSystem
    self._impl.addForces(topology, newSystem, atomList, forceGroup, **args)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/models/anipotential.py", line 91, in addForces
    model = OptimizedTorchANI(model, species).to(device)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 899, in to
    return self._apply(convert)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 616, in _apply
    self._buffers[key] = fn(buf)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 897, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

do we generally want to make this package robust to the platform type? or only to CUDA?

dominicrufa commented 2 years ago

also, when I add the interpolate=True argument to the createMixedSystem and equip to a Context, it fails with

Traceback (most recent call last):
  File "/lila/home/rufad/nnpops/run.py", line 45, in <module>
    context.getState(getEnergy=True).getPotentialEnergy()
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400)

when i call context.getState(getEnergy=True).getPotentialEnergy() which i don't know how to debug; however, when i leave interpolate=False, I can pull the state and the potential energy without issue.

peastman commented 2 years ago

It looks like a case where the model is on one device and the input tensor is on a different one.

alternatively, if i try to run this without GPUs, it throws a runtime error:

What exactly does that mean? Are you running it on a computer without a GPU? Or do you mean there is a GPU, but you're specifying the CPU platform when you create your context? The PyTorch plugin is supposed to work with all platforms.

dominicrufa commented 2 years ago

What exactly does that mean? Are you running it on a computer without a GPU? Or do you mean there is a GPU, but you're specifying the CPU platform when you create your context? The PyTorch plugin is supposed to work with all platforms.

running on a computer without a GPU.

It looks like a case where the model is on one device and the input tensor is on a different one.

right. I am just trying to figure out why this is the case/how to fix. the platform is Reference in the test. Is this an error you see if you run it locally?

peastman commented 2 years ago

TestMLPotential.py passes when I run it locally. Perhaps the problem is that you have CUDA installed (so PyTorch tries to use it), but you don't have any CUDA compatible GPUs (so it fails when it tries)?

dominicrufa commented 2 years ago

sorry, i should clarify; i am trying to run TestMLPotential with the nnpops mixin here and am observing the aforementioned errors. I'm not sure if this is an edge case, but I'm not sure how to solve the issue.

dominicrufa commented 2 years ago

but you don't have any CUDA compatible GPUs (so it fails when it tries)?

i definitely have both, and i can make nnpops implementation run without observing this issue. it only appears if i set interpolate=True in the createMixedSystem

peastman commented 2 years ago

I think we're talking about different things, since a few different errors are described above. I was referring to the No CUDA GPUs are available error.

dominicrufa commented 2 years ago

I think we're talking about different things, since a few different errors are described above. I was referring to the No CUDA GPUs are available error.

yes, i don't disagree that is the case. This is not the blocking issue, just a passing observation (which you correctly clarified).

I am primarily concerned with integrating the nnpop-equipped TorchANI force with the createMixedSystem with the interpolate=True argument; the above issue is not a concern since i cannot test nnpops without a GPU anyway.

dominicrufa commented 2 years ago

  1 #!/usr/bin/env python
  2 import torch
  3 import torchani
  4 from NNPOps import OptimizedTorchANI
  5 from openmmtools.testsystems import HostGuestExplicit
  6 from openmmml.mlpotential import MLPotential
  7 from simtk import openmm, unit
  8 import time
  9 import numpy as np
 10 from simtk.openmm import LangevinMiddleIntegrator
 11
 12 temperature = 298.15 * unit.kelvin
 13 frictionCoeff = 1. / unit.picosecond
 14 stepSize = 1. * unit.femtoseconds
 15 hgv = HostGuestExplicit(constraints=None)
 16
 17 potential = MLPotential('ani2x')
 18 system = potential.createMixedSystem(hgv.topology, system = hgv.system, atoms = list(range(126,156)), implementation='nnpops', interpolate=True)
 19 print(f"done making system")
 20 _int = LangevinMiddleIntegrator(temperature, frictionCoeff, stepSize)
 21 context = openmm.Context(system, _int)
 22 context.setPositions(hgv.positions)
 23 # query and print out the global parameters:
 24 swig_params = context.getParameters()
 25 print(f"context parameters:")
 26 for i in swig_params:
 27     print(i, swig_params[i])
 28 context.getState(getEnergy=True).getPotentialEnergy()

@peastman , if i reduce the problem to this code snippet, pull main into this PR (so that i can use nnpops on gpu), then this snippet works with interpolate=False, but not with True. if i set to True, i see:

Traceback (most recent call last):
  File "/lila/home/rufad/nnpops/run.py", line 45, in <module>
    context.getState(getEnergy=True).getPotentialEnergy()
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400)

and i'm not sure how to debug this.

peastman commented 2 years ago

Let me make sure I understand. This error happens when all of the following are true:

You use interpolate=True.
You use the optimized implementation from NNPOps.
You use the CUDA platform.

If any one of those is not true, it works. Is that correct?

How does this relate to the original problem you described up at the top? That one produced an exception about tensors being on different devices, while this one produces CUDA_ERROR_INVALID_HANDLE.

dominicrufa commented 2 years ago

@peastman

Let me make sure I understand. This error happens when all of the following are true:

correct.

If any one of those is not true, it works. Is that correct?

I don't know. i haven't tried all of the permutations; however, i need the latter two points to be True for my use cases. when the latter two points are True and the first is True, it fails. however, when the latter two are True and the first is False, it works.

How does this relate to the original problem you described up at the top? That one produced an exception about tensors being on different devices, while this one produces CUDA_ERROR_INVALID_HANDLE.

the precise issue is not the problem; after more digging, and showing this, I realized there is an edge case associated with running nnpops-implemented System with the interpolate argument. perhaps these should be different issues? the main thing is that i cannot seem to even use these two functionalities together, which is a prerequisite to performing the energy-matching assertion in the test you wrote

peastman commented 2 years ago

Here's the error I get when running your example.

Traceback (most recent call last):
  File "test.py", line 21, in <module>
    context = openmm.Context(system, _int)
  File "/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/openmm/openmm.py", line 5125, in __init__
    this = _openmm.new_Context(*args)
openmm.OpenMMException: Unknown device: 87. If you have recently updated the caffe2.proto file to add a new device type, did you forget to update the DeviceTypeName() function to reflect such recent changes?
Exception raised from DeviceTypeName at /tmp/pip-req-build-d1tk7kuo/c10/core/DeviceType.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6a (0x7f39ccd45dba in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd8 (0x7f39ccd42338 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::DeviceTypeName[abi:cxx11](c10::DeviceType, bool) + 0x309 (0x7f39ccd22169 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: torch::jit::Unpickler::readInstruction() + 0x1d53 (0x7f3a141cb0a3 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::jit::Unpickler::run() + 0xa9 (0x7f3a141cb599 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::jit::Unpickler::parse_ivalue() + 0x2f (0x7f3a141cb7cf in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&) + 0x42c (0x7f3a1416faac in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x325cda5 (0x7f3a1416fda5 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x32603cb (0x7f3a141733cb in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::jit::load(std::shared_ptr<caffe2::serialize::ReadAdapterInterface>, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x1c0 (0x7f3a14174560 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0xc7 (0x7f3a141812f7 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: TorchPlugin::TorchForceImpl::initialize(OpenMM::ContextImpl&) + 0x65 (0x7f398e74b1e5 in /usr/local/openmm/lib/libOpenMMTorch.so)
frame #12: OpenMM::ContextImpl::initialize() + 0x422 (0x7f39909b6a52 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #13: OpenMM::Context::Context(OpenMM::System const&, OpenMM::Integrator&, OpenMM::ContextImpl&) + 0xf8 (0x7f39909b1228 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #14: OpenMM::ContextImpl::createLinkedContext(OpenMM::System const&, OpenMM::Integrator&) + 0x31 (0x7f39909b4341 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #15: OpenMM::CustomCVForceImpl::initialize(OpenMM::ContextImpl&) + 0x3b2 (0x7f39909c5482 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #16: OpenMM::ContextImpl::initialize() + 0x422 (0x7f39909b6a52 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #17: OpenMM::Context::Context(OpenMM::System const&, OpenMM::Integrator&) + 0x78 (0x7f39909b0fa8 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #18: <unknown function> + 0x159676 (0x7f3990fca676 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/openmm/_openmm.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #36: __libc_start_main + 0xe7 (0x7f3a4fcf1c87 in /lib/x86_64-linux-gnu/libc.so.6)

Notice the message "Unknown device: 87" near the top. Each time I run it, there's a different number. That makes me think it might be a problem with uninitialized memory somewhere. I'm not sure where it's getting the number from though. The error happens in the first line of TorchForceImpl::initialize():

module = torch::jit::load(owner.getFile());

peastman commented 2 years ago

The above was using the main branch, so it actually wasn't using the NNPOps optimized version. Strange...

dominicrufa commented 2 years ago

that's especially strange. I haven't encountered that. (unintentionally closed issue); i can't tell if this is a version issue, but all of my packaged come from conda omm_dev.txt I'm going to play around with this a bit more before i forfeit.

peastman commented 2 years ago

I think this may be an issue with incompatible versions of pytorch. Investigating...

peastman commented 2 years ago

I was compiling OpenMM-Torch against a version of libtorch downloaded from https://pytorch.org, and I think it was incompatible with the one from conda. I needed to do that because the conda version was missing the CMake files needed to compile against it. I updated to the newest conda package (PyTorch 1.10.0), and now it does include the CMake files. But when I try to compile against it, all the test cases fail to build with the errors

/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so: undefined reference to `std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream()@GLIBCXX_3.4.26'
/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so: undefined reference to `std::__cxx11::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >::basic_stringstream()@GLIBCXX_3.4.26'
collect2: error: ld returned 1 exit status

jchodera commented 2 years ago

The packages installed to build pytorch can differ from the packages installed to run it when you just conda install the package. Is it possible that you need to install some of those to build things with pytorch?

peastman commented 2 years ago

I don't think so. The link errors refer to standard C++ functions. Usually that indicates a binary incompatibility of some sort, either libraries were compiled with different ABIs or different versions of libstdc++.

jchodera commented 2 years ago

I was thinking that it might be trying to use your system libraries instead of the conda-forge built libraries installed via the packages appearing in the build: dependencies that don't appear in the run: dependencies.

dominicrufa commented 2 years ago

@peastman : i was playing around with the nnpops-implementation, and discovered that the error thrown here might somehow be a consequence of placing the TorchForce into a CustomCVForce as you did here.

If I set interpolate=False and replace your ANIForce implementation with

         class ANIForce(torch.nn.Module):
101
102             def __init__(self, model, species, atoms):
103                 super(ANIForce, self).__init__()
104                 self.model = model
105                 self.species = species
106                 self.energyScale = torchani.units.hartree2kjoulemol(1)
107
108                 if atoms is None:
109                     self.indices = None
110                 else:
111                     self.indices = torch.tensor(atoms, dtype=torch.int64)
112
113                 self.model = model
114                 self.pbc = torch.tensor([True, True, True], dtype=torch.bool)
115
116             def forward(self, positions, boxvectors: Optional[torch.Tensor] = None, scale : Optional[torch.Tensor] = None):

and add a scale GlobalParameter like this:

149         force = openmmtorch.TorchForce(filename)
150         force.setForceGroup(forceGroup)
151         if topology.getPeriodicBoxVectors() is not None:
152             force.setUsesPeriodicBoundaryConditions(True)
153         force.addGlobalParameter('scale', 1.)
154         system.addForce(force)

i can manipulate the global parameter and make calls to the state.getPotentialEnergy() without seeing the openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400).

I'm not sure how easy it would be to find the root cause of the CustomCVForce error, but I wonder if the createMixedSystem function here might be modified to not place the TorchForce into the CustomCVForce, and just leave it as a separate force (with the scale GlobalParameter still equipped). It's a temporary workaround, but functionally, it would be no different, I don't think.

your thoughts?

jchodera commented 2 years ago

@peastman: Since it will take a while to establish why putting a TorchForce inside of CustomCVForce throws an OpenMMException. Could you make the change @dominicrufa suggests now so we can start using openmm-ml while this is being debugged?

peastman commented 2 years ago

@dominicrufa could you post the output of conda list in your environment? Also, what are CUDA_SDK_ROOT_DIR and CUDA_TOOLKIT_ROOT_DIR set to in CMake?

dominicrufa commented 2 years ago

@peastman, my conda list is in this comment.

Also, what are CUDA_SDK_ROOT_DIR and CUDA_TOOLKIT_ROOT_DIR set to in CMake?

if you are referring to my omm installation, I am using a nightly build from conda-forge; i'm not building from source.

peastman commented 2 years ago

I'm referring to the OpenMM-Torch plugin. Do you build it from source or install with conda?

dominicrufa commented 2 years ago

conda. everything is installed with conda. openmmtorch will pin the conda-forge release of openmm. once everything but omm is installed, you have to force install the omnia-dev version of omm so it plays nicely with openmm-torch

jchodera commented 2 years ago

Is there an issue with the build environments of openmm from omnia-dev not being fully matched with the conda-forge build infrastructure? Or do we think this issue is independent from build version incompatibilities?

dominicrufa commented 2 years ago

@jchodera: conda's openmmtorch requires omm7.7 (it pins the latest conda-forge version), but it is not compatible with the latest master release; we need omnia's dev version to operate it. this isn't a blocker at present.

@peastman : i can make a PR to fix the issue/make the test pass with nnpops equipped/unequipped if you are blocked on the installation part. let me know.

jchodera commented 2 years ago

What's the best way to proceed here? Sounds like we need another bugfix release of OpenMM?

@mikemhenry : Would it be super difficult to add recipes for these to https://github.com/omnia-md/conda-dev-recipes ? This might be an alternative to additional bugfix releases as we work out the kinks here.

peastman commented 2 years ago

I finally managed to get it to compile. The lessons I learned:

Don't install PyTorch from conda-forge. It's broken. Instead download libtorch from the PyTorch website.
Absolutely do not ever install the conda-forge compilers package. It breaks all sorts of things, and it's impossible to uninstall. You have to delete the whole environment and recreate it from scratch.

I'll see if I can figure out what's going on with the CustomCVForce.

jchodera commented 2 years ago

Don't install PyTorch from conda-forge. It's broken.

@mikemhenry : Can you address this with the feedstock maintainers?

mikemhenry commented 2 years ago

What's the best way to proceed here? Sounds like we need another bugfix release of OpenMM?

@mikemhenry : Would it be super difficult to add recipes for these to https://github.com/omnia-md/conda-dev-recipes ? This might be an alternative to additional bugfix releases as we work out the kinks here.

I'm not exactly sure what you mean, do you mean also getting builds of packages like openmm-ml, nnpops, built there and land on omnia? or something else?

Don't install PyTorch from conda-forge. It's broken.

@mikemhenry : Can you address this with the feedstock maintainers?

:upside_down_face: A few weeks ago I re-built the entire pytorch-gpu recipe https://github.com/conda-forge/pytorch-cpu-feedstock/pull/89#issuecomment-1042665168 and got it uploaded to conda-forge, so I am working on getting the pytorch eco system on conda-forge fixed, its kinda complicated why it was broken but we are slowly getting things rebuild and working again. A handful of packages have been re-built, so it might be working now depending on what has been rebuilt.

peastman commented 2 years ago

Still no success at creating a working environment. Here are some things I've tried so far.

If I download libtorch from https://pytorch.org/ I can successfully compile OpenMM-Torch, and the test cases work. But that doesn't provide the Python torch package. To get that, I need to install something else.
The official conda package from the pytorch channel is compiled with the old pre-C++11 ABI. I can't build plugins to use it.
If I install PyTorch from the conda-forge channel and try to compile the plugin against it, I get the errors described above.
I also tried building against the standalone libtorch, but using the conda-forge PyTorch package at runtime. That fails due to missing libraries. For example, the standalone libtorch has a library called libc10_cuda.so that isn't present in the conda-forge package.
I'm kind of confused about the conda-forge packages. There are three different packages called pytorch, pytorch-gpu, and pytorch-cpu. What is the relationship between them? Is it harmful to have all of them installed at once? The torchani package lists pytorch-cpu as a dependency, so that gets installed whether I want it or not.

dominicrufa commented 2 years ago

@peastman

Don't install PyTorch from conda-forge. It's broken. Instead download libtorch from the PyTorch website.

what is wrong with pytorch on conda-forge?

peastman commented 2 years ago

See https://github.com/openmm/openmm-ml/issues/25#issuecomment-1069556407 and https://github.com/openmm/openmm-ml/issues/25#issuecomment-1062383986.

dominicrufa commented 2 years ago

@peastman: I didn't run into this trouble a few weeks ago conda-installing everything. perhaps versions on conda-forge changed since? have you tried creating a yaml from my conda-list text file here and making an env from that?

peastman commented 2 years ago

I need to be able to build OpenMM and plugins from source. Otherwise, I can't debug and fix the problem. Just installing conda packages isn't an option.

dominicrufa commented 2 years ago

@peastman, if building all of these packages to debug the CustomCVForce issue is the blocker, might it be easier in the interim to just modify the createMixedSystem function so that the TorchForce is outside of the CustomCVForce like i mentioned here? that only requires conda-installing everything to fix rather than building from source.

peastman commented 2 years ago

That's possible, but I think it would be a less clean implementation. Since the compilation errors are a showstopper in any case, I'd prefer to focus on them. Once that's figured out and we understand the cause of the CUDA error, we can change the implementation if absolutely necessary. But hopefully the fix will turn out to be something much simpler.

peastman commented 2 years ago

I finally figured it out! It turned out to be the same problem as in https://github.com/openmm/openmm/pull/3520: I needed to update to a newer version of the package containing glibc. Having done that, I can compile OpenMM-Torch against the PyTorch package installed from conda-forge.

Moving on to the next problem, which is that I can't compile NNPOps.

peastman commented 2 years ago

Success! I can finally compile all the libraries and reproduce the problem above!

dominicrufa commented 2 years ago

@peastman , I'd be curious to hear what the underlying issue is re: not being able to put a TorchForce into a CustomCVForce if/when you are able to figure it out.

peastman commented 2 years ago

At the moment I'm kind of confused. The error I get is similar to yours but in a different place:

  File "/home/peastman/miniconda3/envs/cf/lib/python3.9/site-packages/torchani/aev.py", line 114, in compute_shifts
    inv_distances = reciprocal_cell.norm(2, -1)
    num_repeats = torch.ceil(cutoff * inv_distances).to(torch.long)
    num_repeats = torch.where(pbc, num_repeats, num_repeats.new_zeros(()))
                  ~~~~~~~~~~~ <--- HERE
    r1 = torch.arange(1, num_repeats[0].item() + 1, device=cell.device)
    r2 = torch.arange(1, num_repeats[1].item() + 1, device=cell.device)
RuntimeError: Expected condition, x and y to be on the same device, but condition is on cpu and x and y are on cuda:0 and cuda:0 respectively

The problem is that pbc in that line is on the CPU. That's a tensor that gets created in the constructor of the module:

https://github.com/openmm/openmm-ml/blob/28b4fadd761221fbe085570faac3e95a2d0be4e0/openmmml/models/anipotential.py#L99

That shouldn't be a problem. In the C++ code, we tell it to move the module to the GPU, which ought to move all tensors stored in fields of the module to the GPU.

const torch::Device device(torch::kCUDA, cu.getDeviceIndex()); // This implicitly initialize PyTorch
module.to(device);

I can make this error go away by explicitly setting the device in the Python code where the tensor is created. But then I get a similar error in a different place:

  File "/home/peastman/miniconda3/envs/cf/lib/python3.9/site-packages/torchani/aev.py", line 150, in neighbor_pairs
        shifts (:class:`torch.Tensor`): tensor of shape (?, 3) storing shifts
    """
    coordinates = coordinates.detach().masked_fill(padding_mask.unsqueeze(-1), math.nan)
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    cell = cell.detach()
    num_atoms = padding_mask.shape[1]
RuntimeError: expected self and mask to be on the same device, but got mask on cpu and self on cuda:0

It appears that padding_mask is on the CPU. It gets created inside torchani when invoking neighbor_pairs():

atom_index12, shifts = neighbor_pairs(species == -1, coordinates_, cell, shifts, Rcr)

It's the first argument. So it seems that species must be on the GPU. I tried moving it to the GPU in the Python code, but that doesn't make any difference.

And of course, none of this has anything to do with whether it's inside a CustomCVForce.

dominicrufa commented 2 years ago

@peastman, to reproduce the problem exactly, you'll have to use this script to incorporate nnpops. i mentioned that in the issue, but it might have gotten lost in the thread. if that doesn't solve your problem, i suspect something else is wrong.

dominicrufa commented 2 years ago

@peastman, have you been able to reproduce the problem?

peastman commented 2 years ago

It looks to me like this may involve a bug in PyTorch. It seems to be messing up the CUDA context. Immediately before and after we call module.forward() at https://github.com/openmm/openmm-torch/blob/84f7d884ec0d9d72a57a769046bdddd1d62b8fc2/platforms/cuda/src/CudaTorchKernels.cpp#L119, I check to see what context is current with

cuCtxGetCurrent(&ctx);
printf("%d\n", ctx);

From this I can see that PyTorch is not restoring the context correctly.

@raimis do you have any ideas about how to handle this? I've tried various ways of restoring the context. They fix the CUDA error coming from OpenMM code, but then lead to CUDA errors in PyTorch code.

dominicrufa commented 2 years ago

@peastman, is this specifically what is throwing the openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400) error? would i still be seeing this if the context were being restored correctly?

peastman commented 2 years ago

Correct. If I manually restore the context, the error goes away. But if I then follow with a second energy evaluation, we get a CUDA error inside PyTorch.

dominicrufa commented 2 years ago

@peastman , if it is indeed a pytorch bug, would it make more sense to use this hack in the meantime since the time horizon for the pytorch bugfix is an unknown? I only say this because this issue is blocking for me. if you'd prefer to avoid the hack, I'll open a PR fixing the problem with the hack (for reference's sake) as a temporary workaround that i can integrate into my downstream workflow.

jchodera commented 2 years ago

It looks to me like this may involve a bug in PyTorch. It seems to be messing up the CUDA context.

Perhaps the NVIDIA folks like @dmclark17 might be able to help us here since it involves a few community codes?

dmclark17 commented 2 years ago

Sure—I can do some investigating and try to reproduce on my end.

I've tried various ways of restoring the context. They fix the CUDA error coming from OpenMM code, but then lead to CUDA errors in PyTorch code.

I'm still getting up to speed on how contexts are being handled here—have you tried popping the current context before the PyTorch code and then pushing it afterwards?

openmm / openmm-ml

`TestMLPotential.py` fails with `nnpops` implementation #25