Closed dominicrufa closed 1 year ago
also, when I add the interpolate=True
argument to the createMixedSystem
and equip to a Context
, it fails with
Traceback (most recent call last):
File "/lila/home/rufad/nnpops/run.py", line 45, in <module>
context.getState(getEnergy=True).getPotentialEnergy()
File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400)
when i call context.getState(getEnergy=True).getPotentialEnergy()
which i don't know how to debug; however, when i leave interpolate=False
, I can pull the state and the potential energy without issue.
It looks like a case where the model is on one device and the input tensor is on a different one.
alternatively, if i try to run this without GPUs, it throws a runtime error:
What exactly does that mean? Are you running it on a computer without a GPU? Or do you mean there is a GPU, but you're specifying the CPU platform when you create your context? The PyTorch plugin is supposed to work with all platforms.
What exactly does that mean? Are you running it on a computer without a GPU? Or do you mean there is a GPU, but you're specifying the CPU platform when you create your context? The PyTorch plugin is supposed to work with all platforms.
running on a computer without a GPU.
It looks like a case where the model is on one device and the input tensor is on a different one.
right. I am just trying to figure out why this is the case/how to fix. the platform is Reference
in the test. Is this an error you see if you run it locally?
TestMLPotential.py passes when I run it locally. Perhaps the problem is that you have CUDA installed (so PyTorch tries to use it), but you don't have any CUDA compatible GPUs (so it fails when it tries)?
sorry, i should clarify; i am trying to run TestMLPotential
with the nnpops
mixin here and am observing the aforementioned errors. I'm not sure if this is an edge case, but I'm not sure how to solve the issue.
but you don't have any CUDA compatible GPUs (so it fails when it tries)?
i definitely have both, and i can make nnpops
implementation run without observing this issue. it only appears if i set interpolate=True
in the createMixedSystem
I think we're talking about different things, since a few different errors are described above. I was referring to the No CUDA GPUs are available
error.
I think we're talking about different things, since a few different errors are described above. I was referring to the No CUDA GPUs are available error.
yes, i don't disagree that is the case. This is not the blocking issue, just a passing observation (which you correctly clarified).
I am primarily concerned with integrating the nnpop
-equipped TorchANI
force with the createMixedSystem
with the interpolate=True
argument; the above issue is not a concern since i cannot test nnpops
without a GPU anyway.
1 #!/usr/bin/env python
2 import torch
3 import torchani
4 from NNPOps import OptimizedTorchANI
5 from openmmtools.testsystems import HostGuestExplicit
6 from openmmml.mlpotential import MLPotential
7 from simtk import openmm, unit
8 import time
9 import numpy as np
10 from simtk.openmm import LangevinMiddleIntegrator
11
12 temperature = 298.15 * unit.kelvin
13 frictionCoeff = 1. / unit.picosecond
14 stepSize = 1. * unit.femtoseconds
15 hgv = HostGuestExplicit(constraints=None)
16
17 potential = MLPotential('ani2x')
18 system = potential.createMixedSystem(hgv.topology, system = hgv.system, atoms = list(range(126,156)), implementation='nnpops', interpolate=True)
19 print(f"done making system")
20 _int = LangevinMiddleIntegrator(temperature, frictionCoeff, stepSize)
21 context = openmm.Context(system, _int)
22 context.setPositions(hgv.positions)
23 # query and print out the global parameters:
24 swig_params = context.getParameters()
25 print(f"context parameters:")
26 for i in swig_params:
27 print(i, swig_params[i])
28 context.getState(getEnergy=True).getPotentialEnergy()
@peastman , if i reduce the problem to this code snippet, pull main
into this PR (so that i can use nnpops
on gpu), then this snippet works with interpolate=False
, but not with True
. if i set to True
, i see:
Traceback (most recent call last):
File "/lila/home/rufad/nnpops/run.py", line 45, in <module>
context.getState(getEnergy=True).getPotentialEnergy()
File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400)
and i'm not sure how to debug this.
Let me make sure I understand. This error happens when all of the following are true:
interpolate=True
.If any one of those is not true, it works. Is that correct?
How does this relate to the original problem you described up at the top? That one produced an exception about tensors being on different devices, while this one produces CUDA_ERROR_INVALID_HANDLE
.
@peastman
Let me make sure I understand. This error happens when all of the following are true:
correct.
If any one of those is not true, it works. Is that correct?
I don't know. i haven't tried all of the permutations; however, i need the latter two points to be True
for my use cases. when the latter two points are True
and the first is True
, it fails. however, when the latter two are True
and the first is False
, it works.
How does this relate to the original problem you described up at the top? That one produced an exception about tensors being on different devices, while this one produces CUDA_ERROR_INVALID_HANDLE.
the precise issue is not the problem; after more digging, and showing this, I realized there is an edge case associated with running nnpops-implemented System
with the interpolate
argument. perhaps these should be different issues? the main thing is that i cannot seem to even use these two functionalities together, which is a prerequisite to performing the energy-matching assertion in the test you wrote
Here's the error I get when running your example.
Traceback (most recent call last):
File "test.py", line 21, in <module>
context = openmm.Context(system, _int)
File "/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/openmm/openmm.py", line 5125, in __init__
this = _openmm.new_Context(*args)
openmm.OpenMMException: Unknown device: 87. If you have recently updated the caffe2.proto file to add a new device type, did you forget to update the DeviceTypeName() function to reflect such recent changes?
Exception raised from DeviceTypeName at /tmp/pip-req-build-d1tk7kuo/c10/core/DeviceType.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6a (0x7f39ccd45dba in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd8 (0x7f39ccd42338 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::DeviceTypeName[abi:cxx11](c10::DeviceType, bool) + 0x309 (0x7f39ccd22169 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: torch::jit::Unpickler::readInstruction() + 0x1d53 (0x7f3a141cb0a3 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::jit::Unpickler::run() + 0xa9 (0x7f3a141cb599 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::jit::Unpickler::parse_ivalue() + 0x2f (0x7f3a141cb7cf in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&) + 0x42c (0x7f3a1416faac in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x325cda5 (0x7f3a1416fda5 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x32603cb (0x7f3a141733cb in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::jit::load(std::shared_ptr<caffe2::serialize::ReadAdapterInterface>, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x1c0 (0x7f3a14174560 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0xc7 (0x7f3a141812f7 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: TorchPlugin::TorchForceImpl::initialize(OpenMM::ContextImpl&) + 0x65 (0x7f398e74b1e5 in /usr/local/openmm/lib/libOpenMMTorch.so)
frame #12: OpenMM::ContextImpl::initialize() + 0x422 (0x7f39909b6a52 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #13: OpenMM::Context::Context(OpenMM::System const&, OpenMM::Integrator&, OpenMM::ContextImpl&) + 0xf8 (0x7f39909b1228 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #14: OpenMM::ContextImpl::createLinkedContext(OpenMM::System const&, OpenMM::Integrator&) + 0x31 (0x7f39909b4341 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #15: OpenMM::CustomCVForceImpl::initialize(OpenMM::ContextImpl&) + 0x3b2 (0x7f39909c5482 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #16: OpenMM::ContextImpl::initialize() + 0x422 (0x7f39909b6a52 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #17: OpenMM::Context::Context(OpenMM::System const&, OpenMM::Integrator&) + 0x78 (0x7f39909b0fa8 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #18: <unknown function> + 0x159676 (0x7f3990fca676 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/openmm/_openmm.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #36: __libc_start_main + 0xe7 (0x7f3a4fcf1c87 in /lib/x86_64-linux-gnu/libc.so.6)
Notice the message "Unknown device: 87" near the top. Each time I run it, there's a different number. That makes me think it might be a problem with uninitialized memory somewhere. I'm not sure where it's getting the number from though. The error happens in the first line of TorchForceImpl::initialize()
:
module = torch::jit::load(owner.getFile());
The above was using the main branch, so it actually wasn't using the NNPOps optimized version. Strange...
that's especially strange. I haven't encountered that. (unintentionally closed issue); i can't tell if this is a version issue, but all of my packaged come from conda
omm_dev.txt
I'm going to play around with this a bit more before i forfeit.
I think this may be an issue with incompatible versions of pytorch. Investigating...
I was compiling OpenMM-Torch against a version of libtorch downloaded from https://pytorch.org, and I think it was incompatible with the one from conda. I needed to do that because the conda version was missing the CMake files needed to compile against it. I updated to the newest conda package (PyTorch 1.10.0), and now it does include the CMake files. But when I try to compile against it, all the test cases fail to build with the errors
/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so: undefined reference to `std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream()@GLIBCXX_3.4.26'
/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so: undefined reference to `std::__cxx11::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >::basic_stringstream()@GLIBCXX_3.4.26'
collect2: error: ld returned 1 exit status
The packages installed to build pytorch can differ from the packages installed to run it when you just conda install
the package. Is it possible that you need to install some of those to build things with pytorch?
I don't think so. The link errors refer to standard C++ functions. Usually that indicates a binary incompatibility of some sort, either libraries were compiled with different ABIs or different versions of libstdc++.
I was thinking that it might be trying to use your system libraries instead of the conda-forge built libraries installed via the packages appearing in the build:
dependencies that don't appear in the run:
dependencies.
@peastman : i was playing around with the nnpops
-implementation, and discovered that the error thrown here might somehow be a consequence of placing the TorchForce
into a CustomCVForce
as you did here.
If I set interpolate=False
and replace your ANIForce
implementation with
class ANIForce(torch.nn.Module):
101
102 def __init__(self, model, species, atoms):
103 super(ANIForce, self).__init__()
104 self.model = model
105 self.species = species
106 self.energyScale = torchani.units.hartree2kjoulemol(1)
107
108 if atoms is None:
109 self.indices = None
110 else:
111 self.indices = torch.tensor(atoms, dtype=torch.int64)
112
113 self.model = model
114 self.pbc = torch.tensor([True, True, True], dtype=torch.bool)
115
116 def forward(self, positions, boxvectors: Optional[torch.Tensor] = None, scale : Optional[torch.Tensor] = None):
and add a scale
GlobalParameter
like this:
149 force = openmmtorch.TorchForce(filename)
150 force.setForceGroup(forceGroup)
151 if topology.getPeriodicBoxVectors() is not None:
152 force.setUsesPeriodicBoundaryConditions(True)
153 force.addGlobalParameter('scale', 1.)
154 system.addForce(force)
i can manipulate the global parameter and make calls to the state.getPotentialEnergy()
without seeing the openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400)
.
I'm not sure how easy it would be to find the root cause of the CustomCVForce
error, but I wonder if the createMixedSystem
function here might be modified to not place the TorchForce
into the CustomCVForce
, and just leave it as a separate force (with the scale
GlobalParameter
still equipped). It's a temporary workaround, but functionally, it would be no different, I don't think.
your thoughts?
@peastman: Since it will take a while to establish why putting a TorchForce
inside of CustomCVForce
throws an OpenMMException
. Could you make the change @dominicrufa suggests now so we can start using openmm-ml
while this is being debugged?
@dominicrufa could you post the output of conda list
in your environment? Also, what are CUDA_SDK_ROOT_DIR
and CUDA_TOOLKIT_ROOT_DIR
set to in CMake?
@peastman, my conda list
is in this comment.
Also, what are CUDA_SDK_ROOT_DIR and CUDA_TOOLKIT_ROOT_DIR set to in CMake?
if you are referring to my omm installation, I am using a nightly build from conda-forge; i'm not building from source.
I'm referring to the OpenMM-Torch plugin. Do you build it from source or install with conda?
conda
. everything is installed with conda. openmmtorch
will pin the conda-forge
release of openmm
. once everything but omm is installed, you have to force install the omnia-dev
version of omm so it plays nicely with openmm-torch
Is there an issue with the build environments of openmm
from omnia-dev
not being fully matched with the conda-forge build infrastructure? Or do we think this issue is independent from build version incompatibilities?
@jchodera: conda's openmmtorch
requires omm7.7 (it pins the latest conda-forge version), but it is not compatible with the latest master release; we need omnia's dev version to operate it. this isn't a blocker at present.
@peastman : i can make a PR to fix the issue/make the test pass with nnpops
equipped/unequipped if you are blocked on the installation part. let me know.
What's the best way to proceed here? Sounds like we need another bugfix release of OpenMM?
@mikemhenry : Would it be super difficult to add recipes for these to https://github.com/omnia-md/conda-dev-recipes ? This might be an alternative to additional bugfix releases as we work out the kinks here.
I finally managed to get it to compile. The lessons I learned:
compilers
package. It breaks all sorts of things, and it's impossible to uninstall. You have to delete the whole environment and recreate it from scratch.I'll see if I can figure out what's going on with the CustomCVForce.
Don't install PyTorch from conda-forge. It's broken.
@mikemhenry : Can you address this with the feedstock maintainers?
What's the best way to proceed here? Sounds like we need another bugfix release of OpenMM?
@mikemhenry : Would it be super difficult to add recipes for these to https://github.com/omnia-md/conda-dev-recipes ? This might be an alternative to additional bugfix releases as we work out the kinks here.
I'm not exactly sure what you mean, do you mean also getting builds of packages like openmm-ml, nnpops, built there and land on omnia? or something else?
Don't install PyTorch from conda-forge. It's broken.
@mikemhenry : Can you address this with the feedstock maintainers?
:upside_down_face: A few weeks ago I re-built the entire pytorch-gpu recipe https://github.com/conda-forge/pytorch-cpu-feedstock/pull/89#issuecomment-1042665168 and got it uploaded to conda-forge, so I am working on getting the pytorch eco system on conda-forge fixed, its kinda complicated why it was broken but we are slowly getting things rebuild and working again. A handful of packages have been re-built, so it might be working now depending on what has been rebuilt.
Still no success at creating a working environment. Here are some things I've tried so far.
torch
package. To get that, I need to install something else.pytorch
channel is compiled with the old pre-C++11 ABI. I can't build plugins to use it.libc10_cuda.so
that isn't present in the conda-forge package.pytorch
, pytorch-gpu
, and pytorch-cpu
. What is the relationship between them? Is it harmful to have all of them installed at once? The torchani
package lists pytorch-cpu
as a dependency, so that gets installed whether I want it or not.@peastman
Don't install PyTorch from conda-forge. It's broken. Instead download libtorch from the PyTorch website.
what is wrong with pytorch on conda-forge?
@peastman: I didn't run into this trouble a few weeks ago conda-installing everything. perhaps versions on conda-forge changed since? have you tried creating a yaml from my conda-list text file here and making an env from that?
I need to be able to build OpenMM and plugins from source. Otherwise, I can't debug and fix the problem. Just installing conda packages isn't an option.
@peastman, if building all of these packages to debug the CustomCVForce
issue is the blocker, might it be easier in the interim to just modify the createMixedSystem
function so that the TorchForce
is outside of the CustomCVForce
like i mentioned here? that only requires conda-installing everything to fix rather than building from source.
That's possible, but I think it would be a less clean implementation. Since the compilation errors are a showstopper in any case, I'd prefer to focus on them. Once that's figured out and we understand the cause of the CUDA error, we can change the implementation if absolutely necessary. But hopefully the fix will turn out to be something much simpler.
I finally figured it out! It turned out to be the same problem as in https://github.com/openmm/openmm/pull/3520: I needed to update to a newer version of the package containing glibc. Having done that, I can compile OpenMM-Torch against the PyTorch package installed from conda-forge.
Moving on to the next problem, which is that I can't compile NNPOps.
Success! I can finally compile all the libraries and reproduce the problem above!
@peastman , I'd be curious to hear what the underlying issue is re: not being able to put a TorchForce
into a CustomCVForce
if/when you are able to figure it out.
At the moment I'm kind of confused. The error I get is similar to yours but in a different place:
File "/home/peastman/miniconda3/envs/cf/lib/python3.9/site-packages/torchani/aev.py", line 114, in compute_shifts
inv_distances = reciprocal_cell.norm(2, -1)
num_repeats = torch.ceil(cutoff * inv_distances).to(torch.long)
num_repeats = torch.where(pbc, num_repeats, num_repeats.new_zeros(()))
~~~~~~~~~~~ <--- HERE
r1 = torch.arange(1, num_repeats[0].item() + 1, device=cell.device)
r2 = torch.arange(1, num_repeats[1].item() + 1, device=cell.device)
RuntimeError: Expected condition, x and y to be on the same device, but condition is on cpu and x and y are on cuda:0 and cuda:0 respectively
The problem is that pbc
in that line is on the CPU. That's a tensor that gets created in the constructor of the module:
That shouldn't be a problem. In the C++ code, we tell it to move the module to the GPU, which ought to move all tensors stored in fields of the module to the GPU.
const torch::Device device(torch::kCUDA, cu.getDeviceIndex()); // This implicitly initialize PyTorch
module.to(device);
I can make this error go away by explicitly setting the device in the Python code where the tensor is created. But then I get a similar error in a different place:
File "/home/peastman/miniconda3/envs/cf/lib/python3.9/site-packages/torchani/aev.py", line 150, in neighbor_pairs
shifts (:class:`torch.Tensor`): tensor of shape (?, 3) storing shifts
"""
coordinates = coordinates.detach().masked_fill(padding_mask.unsqueeze(-1), math.nan)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
cell = cell.detach()
num_atoms = padding_mask.shape[1]
RuntimeError: expected self and mask to be on the same device, but got mask on cpu and self on cuda:0
It appears that padding_mask
is on the CPU. It gets created inside torchani when invoking neighbor_pairs()
:
atom_index12, shifts = neighbor_pairs(species == -1, coordinates_, cell, shifts, Rcr)
It's the first argument. So it seems that species
must be on the GPU. I tried moving it to the GPU in the Python code, but that doesn't make any difference.
And of course, none of this has anything to do with whether it's inside a CustomCVForce.
@peastman, to reproduce the problem exactly, you'll have to use this script to incorporate nnpops
. i mentioned that in the issue, but it might have gotten lost in the thread. if that doesn't solve your problem, i suspect something else is wrong.
@peastman, have you been able to reproduce the problem?
It looks to me like this may involve a bug in PyTorch. It seems to be messing up the CUDA context. Immediately before and after we call module.forward()
at https://github.com/openmm/openmm-torch/blob/84f7d884ec0d9d72a57a769046bdddd1d62b8fc2/platforms/cuda/src/CudaTorchKernels.cpp#L119, I check to see what context is current with
cuCtxGetCurrent(&ctx);
printf("%d\n", ctx);
From this I can see that PyTorch is not restoring the context correctly.
@raimis do you have any ideas about how to handle this? I've tried various ways of restoring the context. They fix the CUDA error coming from OpenMM code, but then lead to CUDA errors in PyTorch code.
@peastman, is this specifically what is throwing the openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400)
error? would i still be seeing this if the context were being restored correctly?
Correct. If I manually restore the context, the error goes away. But if I then follow with a second energy evaluation, we get a CUDA error inside PyTorch.
@peastman , if it is indeed a pytorch bug, would it make more sense to use this hack in the meantime since the time horizon for the pytorch bugfix is an unknown? I only say this because this issue is blocking for me. if you'd prefer to avoid the hack, I'll open a PR fixing the problem with the hack (for reference's sake) as a temporary workaround that i can integrate into my downstream workflow.
It looks to me like this may involve a bug in PyTorch. It seems to be messing up the CUDA context.
Perhaps the NVIDIA folks like @dmclark17 might be able to help us here since it involves a few community codes?
Sure—I can do some investigating and try to reproduce on my end.
I've tried various ways of restoring the context. They fix the CUDA error coming from OpenMM code, but then lead to CUDA errors in PyTorch code.
I'm still getting up to speed on how contexts are being handled here—have you tried popping the current context before the PyTorch code and then pushing it afterwards?
I'm not too familiar with
torch
tracebacks, but it seems likeTorch
isn't robust to the placement of arrays onto different devices:@peastman , any idea what is going wrong here? or perhaps @raimis knows what is wrong.
alternatively, if i try to run this without GPUs, it throws a runtime error:
do we generally want to make this package robust to the platform type? or only to
CUDA
?