`TestMLPotential.py` fails with `nnpops` implementation

dominicrufa commented 2 years ago

I'm not too familiar with torch tracebacks, but it seems like Torch isn't robust to the placement of arrays onto different devices:

ERROR: testCreateMixedSystem (__main__.TestMLPotential)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/lila/home/rufad/github/openmm-ml/test/TestMLPotential.py", line 27, in testCreateMixedSystem
    mixedEnergy = mixedContext.getState(getEnergy=True).getPotentialEnergy().value_in_unit(unit.kilojoules_per_mole)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/openmmml/models/anipotential/___torch_mangle_14.py", line 36, in forward
      _5 = torch.mul(boxvectors1, 10.)
      pbc0 = self.pbc
      _6, energy1, = (model0).forward(_4, _5, pbc0, )
                      ~~~~~~~~~~~~~~~ <--- HERE
      energy = energy1
    energyScale = self.energyScale
  File "code/__torch__/NNPOps/OptimizedTorchANI.py", line 19, in forward
    species_aevs = (aev_computer).forward(species_coordinates0, cell, pbc, )
    neural_networks = self.neural_networks
    species_energies = (neural_networks).forward(species_aevs, )
                        ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    energy_shifter = self.energy_shifter
    species_energies0 = (energy_shifter).forward(species_energies, None, None, )
  File "code/__torch__/NNPOps/BatchedNN.py", line 10, in forward
    species_aev: Tuple[Tensor, Tensor]) -> __torch__.NNPOps.EnergyShifter.SpeciesEnergies:
    _0 = getattr(self, "0")
    return (_0).forward(species_aev, )
            ~~~~~~~~~~~ <--- HERE
  def __len__(self: __torch__.NNPOps.BatchedNN.TorchANIBatchedNN) -> int:
    return 1
  File "code/__torch__/NNPOps/BatchedNN.py", line 33, in forward
    layer0_weights = self.layer0_weights
    layer0_biases = self.layer0_biases
    vectors0 = ops.NNPOpsBatchedNN.BatchedLinear(vectors, layer0_weights, layer0_biases)
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    vectors1 = __torch__.torch.nn.functional.celu(vectors0, 0.10000000000000001, False, )
    layer2_weights = self.layer2_weights

Traceback of TorchScript, original code (most recent call last):
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/models/anipotential.py", line 135, in forward
                    self.pbc = self.pbc.to(positions.device)
                    boxvectors = boxvectors.to(torch.float32)
                    _, energy = self.model((self.species, positions), cell=10.0*boxvectors, pbc=self.pbc)
                                ~~~~~~~~~~ <--- HERE

                return energy * self.energyScale # Hartree --> kJ/mol
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/OptimizedTorchANI.py", line 53, in forward
        species_coordinates = self.species_converter(species_coordinates)
        species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc)
        species_energies = self.neural_networks(species_aevs)
                           ~~~~~~~~~~~~~~~~~~~~ <--- HERE
        species_energies = self.energy_shifter(species_energies)

  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/BatchedNN.py", line 122, in forward
    def forward(self, species_aev: Tuple[Tensor, Tensor]) -> SpeciesEnergies:
        return self[0].forward(species_aev)
               ~~~~~~~~~~~~~~~ <--- HERE
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/BatchedNN.py", line 99, in forward
        vectors = aev.unsqueeze(-2).unsqueeze(-1)

        vectors = batchedLinear(vectors, self.layer0_weights, self.layer0_biases) # Linear 0
                  ~~~~~~~~~~~~~ <--- HERE
        vectors = F.celu(vectors, alpha=0.1)                                      # CELU   1
        vectors = batchedLinear(vectors, self.layer2_weights, self.layer2_biases) # Linear 2
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper__bmm)

----------------------------------------------------------------------
Ran 1 test in 32.967s

FAILED (errors=1)

@peastman , any idea what is going wrong here? or perhaps @raimis knows what is wrong.

alternatively, if i try to run this without GPUs, it throws a runtime error:

======================================================================
ERROR: testCreateMixedSystem (__main__.TestMLPotential)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/lila/home/rufad/github/openmm-ml/test/TestMLPotential.py", line 17, in testCreateMixedSystem
    mixedSystem = potential.createMixedSystem(pdb.topology, mmSystem, mlAtoms, interpolate=False)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/mlpotential.py", line 265, in createMixedSystem
    self._impl.addForces(topology, newSystem, atomList, forceGroup, **args)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/models/anipotential.py", line 91, in addForces
    model = OptimizedTorchANI(model, species).to(device)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 899, in to
    return self._apply(convert)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 616, in _apply
    self._buffers[key] = fn(buf)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 897, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

do we generally want to make this package robust to the platform type? or only to CUDA?

peastman commented 2 years ago

Contexts are handled by the ContextSelector class. It pushes the context in its constructor and pops it in the destructor. To use it, you create an instance as a local variable. The context is current from that line to the end of the enclosing block.

Here is the method where the problem occurs.

https://github.com/openmm/openmm-torch/blob/84f7d884ec0d9d72a57a769046bdddd1d62b8fc2/platforms/cuda/src/CudaTorchKernels.cpp#L80-L158

There are ContextSelectors to set the context for two short blocks, one in lines 97-101 and another in lines 145-150. It does not set a context at the point where the PyTorch model is invoked (either line 114 or 119). And usually that works.

But in fails when the TorchForce is inside a CustomCVForce. In that case, this whole method is called from https://github.com/openmm/openmm/blob/c7af17c8ba2b6c3667e5126b494d1972b1b6d254/platforms/common/src/CommonKernels.cpp#L5389. The invoking method has already placed a context onto the stack, and PyTorch removes it.

This does suggest a workaround: possibly we could modify the implementation of CustomCVForce to not have a context set when it calls calcForcesAndEnergy(). That might work as long as nothing at an even higher level has set a context. But of course, the whole point of having a stack of contexts is so that you don't have to worry about that.

peastman commented 2 years ago

The workaround is in https://github.com/openmm/openmm/pull/3533.

dmclark17 commented 2 years ago

Thanks for the explanation!

I'm trying to create a standalone reproducer to make sure I understand and can communicate the issue. I am loading in a simple model that multiplies an input tensor by two. I created it using the following:

import torch

class TestModule(torch.nn.Module):
    def forward(self, input):
        return 2 * input

module = torch.jit.script(TestModule())
module.save('model.pt')

The C++ code looks like this:

#include <cuda.h>

#include <torch/torch.h>
#include <torch/script.h>

#include <stdio.h>

void printContext(const char *msg) {
  CUcontext context;
  CUresult res = cuCtxGetCurrent(&context);
  printf("Context %d. Code %d. %s\n", context, res, msg);
}

int main() {
  cuInit(0);

  CUcontext ctx, myContext;
  CUdevice dev;
  CUresult res;

  cuDeviceGet(&dev, 0);
  cuCtxCreate(&ctx, CU_CTX_SCHED_SPIN, dev);
  printContext("After creation");

  torch::jit::script::Module module = torch::jit::load("../model.pt");
  module.to(at::kCUDA);
  printContext("After loading torchscript");

  std::vector<torch::jit::IValue> inputs;
  inputs.push_back(torch::ones({1,}).to(at::kCUDA));
  at::Tensor output = module.forward(inputs).toTensor();
  printContext("After run");
}

I am seeing the following output:

Context 1471272896. Code 0. After creation
Context 1471272896. Code 0. After loading torchscript
Context 1471272896. Code 0. After run

In this case, it doesn't seem like PyTorch is changing the context. On the other hand, if there isn't a current context when the JIT module was executed, it seems like PyTorch is creating a new context and leaving it on the stack. It doesn't seem like this is the expected behavior given the error observed with OpenMM-torch. Do you have any ideas on how to make the example more realistic? Thanks!

peastman commented 2 years ago

If you move the lines that load the module up to the top of main(), you can reproduce the problem. That matches what happens in OpenMM: the module gets loaded while creating the System, and cuInit() gets called later when you create the Context. The following version also adds a call to cuCtxPushCurrent() to even better match what happens in the real code.

int main() {
  torch::jit::script::Module module = torch::jit::load("../model.pt");
  module.to(at::kCUDA);

  cuInit(0);

  CUcontext ctx, myContext;
  CUdevice dev;
  CUresult res;

  cuDeviceGet(&dev, 0);
  cuCtxCreate(&ctx, CU_CTX_SCHED_SPIN, dev);
  printContext("After creation");

  cuCtxPushCurrent(ctx);
  printContext("After push");

  std::vector<torch::jit::IValue> inputs;
  inputs.push_back(torch::ones({1,}).to(at::kCUDA));
  at::Tensor output = module.forward(inputs).toTensor();
  printContext("After run");
}

Here is the output I get.

Context 319176160. Code 0. After creation
Context 319176160. Code 0. After push
Context 319165568. Code 0. After run

dmclark17 commented 2 years ago

Interesting—I am not able to reproduce that on my end; with that ordering, I am seeing:

Context 255. Code 3. After loading torchscript. Expected error code 3 for not initialized
Context 1472714976. Code 0. After creation
Context 1472714976. Code 0. After push
Context 1472714976. Code 0. After run

I think I'm using the CUDA toolkit that conda installed with PyTorch—I'm not sure if that could be causing the difference.

peastman commented 2 years ago

What version of PyTorch do you have? I was testing with 1.9.1.

dmclark17 commented 2 years ago

I am using 1.11.0 and linking to the libtorch that comes with the conda installation. I will try using 1.9.1!

dominicrufa commented 2 years ago

@peastman, can you merge the workaround with openmm's main, or are we anticipating a PyTorch bug fix?

peastman commented 2 years ago

Merged. We should still figure out what's going on with PyTorch, but it should fix the immediate problem.

What version of PyTorch were you using when you encountered the problem?

dominicrufa commented 2 years ago

pytorch                   1.10.0          cuda112py39h3ad47f5_1    conda-forge
pytorch-gpu               1.10.0          cuda112py39h0bbbad9_1    conda-forge

@peastman, were you able to see the problem with nnpops equipped, specifically? if so, would you be able to push your modifications and commit to main of this repo? otherwise, I can do it if you can review it afterward.

peastman commented 2 years ago

were you able to see the problem with nnpops equipped, specifically?

Yes.

would you be able to push your modifications and commit to main of this repo?

I didn't make any changes to code in this repo.

dominicrufa commented 2 years ago

@peastman , which pull request did you use to reproduce the problem?

peastman commented 2 years ago

The one you said to use, #21.

dominicrufa commented 2 years ago

right, yes. sorry for the confusion. i think it just needs to be rebased with main and merged to main so that the functionality for equipping the TorchANI force is equippable with nnpops. but I don't have write permissions to that PR. i can pull it into my PR and rebase/request a merge in to main if you'd prefer.

dmclark17 commented 2 years ago

What version of PyTorch do you have? I was testing with 1.9.1.

I am able to reproduce the issue with 1.9.0:

Context 255. Code 3. After loading torchscript. Expected error code 3 for not initialized
Context 1470519632. Code 0. After creation
Context 1470519632. Code 0. After push
Context 1470509040. Code 0. After run

I am not seeing anything about CUDA contexts in the 1.11.0 release notes.

dmclark17 commented 2 years ago

I've been looking into the difference between PyTorch 1.9 and 1.11, and it seems like 1.9 is calling cudaSetDevice(0) when the JIT module is called—this is initializing the primary context. However, this API call is absent in 1.11, which explains why it doesn't reproduce issue in the standalone example. I'll see if I can find the responsible code change.

Would it be possible to try to reproduce the original bug with PyTorch 1.11 to see if it is fixed? I need to use #21 to reproduce, correct?

jchodera commented 2 years ago

@dominicrufa : Was this fixed?

dominicrufa commented 2 years ago

closing as this is fixed with main

openmm / openmm-ml

`TestMLPotential.py` fails with `nnpops` implementation #25