Can't run steps of dynamics with NNPOps `TorchForce`

dominicrufa commented 3 years ago

In attempting to run MD on a TorchForce-equipped System (the TorchForce has the NNPOps symmetry functions equipped as described here ), I am observing strange behavior. Namely, I am able to create a Context with the System and return the State object with a potential energy, but when i run a step of dynamics, I observe

Traceback (most recent call last):
  File "/lila/home/rufad/github/qmlify/qmlify/openmm_torch/notebooks/yield_dynamics.py", line 119, in <module>
    ml_int.step(1)
  File "/home/rufad/anaconda3/envs/nnpops/lib/python3.9/site-packages/simtk/openmm/openmm.py", line 7036, in step
    return _openmm.CustomIntegrator_step(self, steps)
simtk.openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torchani/nn.py", line 95, in forward
    if torch.gt((torch.size(midx))[0], 0):
      input_ = torch.index_select(aev0, 0, midx)
      _29 = torch.flatten((_22).forward(input_, ), 0, -1)
                           ~~~~~~~~~~~~ <--- HERE
      _30 = torch.masked_scatter_(output, mask, _29)
    else:
  File "code/__torch__/torch/nn/modules/container.py", line 22, in forward
    _5 = getattr(self, "5")
    _6 = getattr(self, "6")
    input0 = (_0).forward(input, )
              ~~~~~~~~~~~ <--- HERE
    input1 = (_1).forward(input0, )
    input2 = (_2).forward(input1, )
  File "code/__torch__/torch/nn/modules/linear.py", line 13, in forward
    input: Tensor) -> Tensor:
    _0 = __torch__.torch.nn.functional.linear
    return _0(input, self.weight, self.bias, )
           ~~ <--- HERE
  File "code/__torch__/torch/nn/functional.py", line 4, in linear
    weight: Tensor,
    bias: Optional[Tensor]=None) -> Tensor:
  return torch.linear(input, weight, bias)
         ~~~~~~~~~~~~ <--- HERE
def celu(input: Tensor,
    alpha: float=1.,

Traceback of TorchScript, original code (most recent call last):
  File "/home/rufad/anaconda3/envs/nnpops/lib/python3.9/site-packages/torchani/nn.py", line 68, in forward
            if midx.shape[0] > 0:
                input_ = aev.index_select(0, midx)
                output.masked_scatter_(mask, m(input_).flatten())
                                             ~ <--- HERE
        output = output.view_as(species)
        return SpeciesEnergies(species, torch.sum(output, dim=1))
  File "/home/rufad/anaconda3/envs/nnpops/lib/python3.9/site-packages/torch/nn/modules/container.py", line 119, in forward
    def forward(self, input):
        for module in self:
            input = module(input)
                    ~~~~~~ <--- HERE
        return input
  File "/home/rufad/anaconda3/envs/nnpops/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 94, in forward
    def forward(self, input: Tensor) -> Tensor:
        return F.linear(input, self.weight, self.bias)
               ~~~~~~~~ <--- HERE
  File "/home/rufad/anaconda3/envs/nnpops/lib/python3.9/site-packages/torch/nn/functional.py", line 1753, in linear
    if has_torch_function_variadic(input, weight):
        return handle_torch_function(linear, (input, weight), input, weight, bias=bias)
    return torch._C._nn.linear(input, weight, bias)
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

On the other hand, if i do not equip the NNPops ani symmetry functions, this error is not encountered. I didnt notice any examples/pytests in this repo re: equipping a TorchForce with ANISymmetryFunctions. I'm not sure if this interoperability has been tested yet. If so, would it be possible to add a pytest/example? I'm not sure if this should go into the openmm-torch repo instead (since the functionality I was to practice uses NNPOPS). I'd be happy to troubleshoot if needed.

peastman commented 3 years ago

This seems to be a common error. This issue has lots of discussion by people encountering it.

https://github.com/NVIDIA/apex/issues/580

Here's one where the problem was fixed by upgrading to PyTorch 1.9.

https://github.com/allenai/allennlp/issues/5064

In this one it was fixed by upgrading to CUDA 11.2.

https://stackoverflow.com/questions/66600362/runtimeerror-cuda-error-cublas-status-execution-failed-when-calling-cublassge

There are many other pages discussing the same error. Often it seems related to inconsistencies in the shapes or dtypes of tensors.

dominicrufa commented 3 years ago

I noticed these, too. Will give these solutions a try and get back. Thanks for the sleuthing.

jchodera commented 2 years ago

@dominicrufa : Is this still an active issue?

openmm / NNPOps

Can't run steps of dynamics with NNPOps `TorchForce` #28