The same codes got different results on different GPU devices.

Hello.

I wrote my test codes as follow: test.py

a = torch.zeros((1), dtype=torch.int)
a = a.cuda(0)
x = test_cuda.func(a)
print(x)

cuda.cpp

#include <torch/torch.h>

void func_wrapper(int* a);
at::Tensor func(at::Tensor a)
{
    func_wrapper(a.data<int>());
    return a;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
  m.def("func", &func, "func");
}

cuda_kernel.cu

#include <ATen/ATen.h>
#include <cuda.h>
#include <cuda_runtime.h>

__global__ void func_kernel(int* __restrict__ a)
{
    a[0] = 4;
}

void func_wrapper(int* a)
{
    func_kernel<<<1,1>>>(a);
}

When I used a = a.cuda(0) in test.py. I got expected result:

tensor([4], device='cuda:0', dtype=torch.int32)

But when I used a = a.cuda(3) (I have multiple GPUs). I got

tensor([0], device='cuda:3', dtype=torch.int32)

The result tensor was tensor([0]). Why?

Thanks a lot.

OS: Ubuntu 14
PyTorch version: torch-nightly 1.0.0.dev20190219
How you installed PyTorch (conda, pip, source): I compiled and ran the code in a dockr container. The docker image was ufoym/deepo:pytorch-py36-cu90
GPU models and configuration: 4 GeForce GTX TITANs

pytorch / extension-cpp

The same codes got different results on different GPU devices. #29