mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
https://torchsparse.mit.edu
MIT License
1.15k stars 131 forks source link

[BUG] Illegal Memory Access Error #269

Closed fishbotics closed 6 months ago

fishbotics commented 7 months ago

Is there an existing issue for this?

Current Behavior

I am using ResNet21D and have been getting Illegal Memory Access errors. This started happening when I shrunk my voxel size down, but I'm not sure why it's happening. I'd like to share some sample data but I've had a hard time making it reproducible. In this bug, I'd like to ask two things:

Here's the error I'm getting reliably:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 1
----> 1 self.pc_encoder_2(x)

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File torchsparse/nn/modules/conv.pyx:99, in torchsparse.nn.modules.conv.Conv3d.forward()

File torchsparse/nn/functional/conv/conv.pyx:114, in torchsparse.nn.functional.conv.conv.conv3d()

File /usr/local/lib/python3.10/dist-packages/torch/autograd/function.py:506, in Function.apply(cls, *args, **kwargs)
    503 if not torch._C._are_functorch_transforms_active():
    504     # See NOTE: [functorch vjp and autograd interaction]
    505     args = _functorch.utils.unwrap_dead_wrappers(args)
--> 506     return super().apply(*args, **kwargs)  # type: ignore[misc]
    508 if cls.setup_context == _SingleLevelFunction.setup_context:
    509     raise RuntimeError(
    510         'In order to use an autograd.Function with functorch transforms '
    511         '(vmap, grad, jvp, jacrev, ...), it must override the setup_context '
    512         'staticmethod. For more details, please see '
    513         'https://pytorch.org/docs/master/notes/extending.func.html')

File torchsparse/nn/functional/conv/func/implicit_gemm.pyx:73, in torchsparse.nn.functional.conv.func.implicit_gemm.ImplicitGEMMConvolutionFuntion.forward()

RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Note that x is the output from some other torch sparse operations. The failure is happening when I pass it into a 3D Conv that looks as so pc_encoder_2 = spnn.Conv3d(16, 32, 2, stride=2, dilation=1)

I tried to make this reproducible by saving the input that's causing the crash. To do this, I used

with open('my_file.pkl', 'wb') as f:
    pickle.dump(x.cpu(), f)

But when I load this in another terminal and send it to the GPU and create an instance of the model above, it doesn't crash in the other terminal. However. it does crash in the original terminal (from which the data was saved).

So my questions here are:

1) Any suggestions for how to make this reproducible? I'd like to be able to give you all a proper repro that you could use to help debug. 2) Any ideas on how to fix this issue (without a repro).

Expected Behavior

I don't expect this to crash.

Environment

- GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
- NVCC:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

- PyTorch: 2.0.1+cu117
- PyTorch CUDA: 11.7
- TorchSparse: 2.1.0+torch20cu117

Anything else?

I will gladly upload repro data if you can help me figure out how to save the data with the right info!

zhijian-liu commented 6 months ago

@ys-2020, could you please take a look at this issue when you have time? Thanks!

ys-2020 commented 6 months ago

Hi @fishbotics. Thank you for your interest! Could you please further describe what does it mean that "it does crash in the original terminal (from which the data was saved)"? From my understanding, there shouldn't be much difference between terminals if the previous job has been stopped. (And the problem seems less likely to be caused by TorchSparse if that is the case.)

ybc-ybc commented 6 months ago

I also meet this problem.

set_kmap_mode(hashmap), this error will disappear. Why?

ys-2020 commented 6 months ago

Merged into issue #239.