Closed dominicrufa closed 2 years ago
Contexts are handled by the ContextSelector class. It pushes the context in its constructor and pops it in the destructor. To use it, you create an instance as a local variable. The context is current from that line to the end of the enclosing block.
Here is the method where the problem occurs.
There are ContextSelectors to set the context for two short blocks, one in lines 97-101 and another in lines 145-150. It does not set a context at the point where the PyTorch model is invoked (either line 114 or 119). And usually that works.
But in fails when the TorchForce is inside a CustomCVForce. In that case, this whole method is called from https://github.com/openmm/openmm/blob/c7af17c8ba2b6c3667e5126b494d1972b1b6d254/platforms/common/src/CommonKernels.cpp#L5389. The invoking method has already placed a context onto the stack, and PyTorch removes it.
This does suggest a workaround: possibly we could modify the implementation of CustomCVForce to not have a context set when it calls calcForcesAndEnergy()
. That might work as long as nothing at an even higher level has set a context. But of course, the whole point of having a stack of contexts is so that you don't have to worry about that.
The workaround is in https://github.com/openmm/openmm/pull/3533.
Thanks for the explanation!
I'm trying to create a standalone reproducer to make sure I understand and can communicate the issue. I am loading in a simple model that multiplies an input tensor by two. I created it using the following:
import torch
class TestModule(torch.nn.Module):
def forward(self, input):
return 2 * input
module = torch.jit.script(TestModule())
module.save('model.pt')
The C++ code looks like this:
#include <cuda.h>
#include <torch/torch.h>
#include <torch/script.h>
#include <stdio.h>
void printContext(const char *msg) {
CUcontext context;
CUresult res = cuCtxGetCurrent(&context);
printf("Context %d. Code %d. %s\n", context, res, msg);
}
int main() {
cuInit(0);
CUcontext ctx, myContext;
CUdevice dev;
CUresult res;
cuDeviceGet(&dev, 0);
cuCtxCreate(&ctx, CU_CTX_SCHED_SPIN, dev);
printContext("After creation");
torch::jit::script::Module module = torch::jit::load("../model.pt");
module.to(at::kCUDA);
printContext("After loading torchscript");
std::vector<torch::jit::IValue> inputs;
inputs.push_back(torch::ones({1,}).to(at::kCUDA));
at::Tensor output = module.forward(inputs).toTensor();
printContext("After run");
}
I am seeing the following output:
Context 1471272896. Code 0. After creation
Context 1471272896. Code 0. After loading torchscript
Context 1471272896. Code 0. After run
In this case, it doesn't seem like PyTorch is changing the context. On the other hand, if there isn't a current context when the JIT module was executed, it seems like PyTorch is creating a new context and leaving it on the stack. It doesn't seem like this is the expected behavior given the error observed with OpenMM-torch. Do you have any ideas on how to make the example more realistic? Thanks!
If you move the lines that load the module up to the top of main()
, you can reproduce the problem. That matches what happens in OpenMM: the module gets loaded while creating the System, and cuInit()
gets called later when you create the Context. The following version also adds a call to cuCtxPushCurrent()
to even better match what happens in the real code.
int main() {
torch::jit::script::Module module = torch::jit::load("../model.pt");
module.to(at::kCUDA);
cuInit(0);
CUcontext ctx, myContext;
CUdevice dev;
CUresult res;
cuDeviceGet(&dev, 0);
cuCtxCreate(&ctx, CU_CTX_SCHED_SPIN, dev);
printContext("After creation");
cuCtxPushCurrent(ctx);
printContext("After push");
std::vector<torch::jit::IValue> inputs;
inputs.push_back(torch::ones({1,}).to(at::kCUDA));
at::Tensor output = module.forward(inputs).toTensor();
printContext("After run");
}
Here is the output I get.
Context 319176160. Code 0. After creation
Context 319176160. Code 0. After push
Context 319165568. Code 0. After run
Interesting—I am not able to reproduce that on my end; with that ordering, I am seeing:
Context 255. Code 3. After loading torchscript. Expected error code 3 for not initialized
Context 1472714976. Code 0. After creation
Context 1472714976. Code 0. After push
Context 1472714976. Code 0. After run
I think I'm using the CUDA toolkit that conda installed with PyTorch—I'm not sure if that could be causing the difference.
What version of PyTorch do you have? I was testing with 1.9.1.
I am using 1.11.0 and linking to the libtorch that comes with the conda installation. I will try using 1.9.1!
@peastman, can you merge the workaround with openmm
's main
, or are we anticipating a PyTorch
bug fix?
Merged. We should still figure out what's going on with PyTorch, but it should fix the immediate problem.
What version of PyTorch were you using when you encountered the problem?
pytorch 1.10.0 cuda112py39h3ad47f5_1 conda-forge
pytorch-gpu 1.10.0 cuda112py39h0bbbad9_1 conda-forge
@peastman, were you able to see the problem with nnpops
equipped, specifically?
if so, would you be able to push your modifications and commit to main of this repo? otherwise, I can do it if you can review it afterward.
were you able to see the problem with nnpops equipped, specifically?
Yes.
would you be able to push your modifications and commit to main of this repo?
I didn't make any changes to code in this repo.
@peastman , which pull request did you use to reproduce the problem?
The one you said to use, #21.
right, yes. sorry for the confusion. i think it just needs to be rebased with main
and merged to main
so that the functionality for equipping the TorchANI
force is equippable with nnpops
. but I don't have write permissions to that PR. i can pull it into my PR and rebase/request a merge in to main
if you'd prefer.
What version of PyTorch do you have? I was testing with 1.9.1.
I am able to reproduce the issue with 1.9.0:
Context 255. Code 3. After loading torchscript. Expected error code 3 for not initialized
Context 1470519632. Code 0. After creation
Context 1470519632. Code 0. After push
Context 1470509040. Code 0. After run
I am not seeing anything about CUDA contexts in the 1.11.0 release notes.
I've been looking into the difference between PyTorch 1.9 and 1.11, and it seems like 1.9 is calling cudaSetDevice(0)
when the JIT module is called—this is initializing the primary context. However, this API call is absent in 1.11, which explains why it doesn't reproduce issue in the standalone example. I'll see if I can find the responsible code change.
Would it be possible to try to reproduce the original bug with PyTorch 1.11 to see if it is fixed? I need to use #21 to reproduce, correct?
@dominicrufa : Was this fixed?
closing as this is fixed with main
I'm not too familiar with
torch
tracebacks, but it seems likeTorch
isn't robust to the placement of arrays onto different devices:@peastman , any idea what is going wrong here? or perhaps @raimis knows what is wrong.
alternatively, if i try to run this without GPUs, it throws a runtime error:
do we generally want to make this package robust to the platform type? or only to
CUDA
?