mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
https://torchsparse.mit.edu
MIT License
1.22k stars 143 forks source link

[BUG] CUDA memory error when batch_size is 8 #303

Closed hua0x522 closed 3 months ago

hua0x522 commented 7 months ago

Is there an existing issue for this?

Current Behavior

I've tried to run the evaluate.py in AE of TorchSparse++, which can indice the flag 'batch_size'. However, if I set the batch_size >= 8, it will report "CUDA error: an illegal memory access was encountered". If the batch_size is 1 to 6, it can execute normally. 1 the error log is:

Traceback (most recent call last):                                                                                                                  
  File "evaluate.py", line 333, in <module>                                                                                                         
    main()                                                                                                                                          
  File "evaluate.py", line 250, in main
    _ = model(inputs["pts_input"])
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangxuezhu/torchsparse/evaluation/core/models/segmentation_models/minkunet.py", line 104, in forward
    x1 = self.stage1(x0)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangxuezhu/torchsparse/evaluation/core/models/modules/layers_3d.py", line 42, in forward
    out = self.net(x)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangxuezhu/code/torchsparse/torchsparse/nn/modules/activation.py", line 11, in forward
    return fapply(input, super().forward)
  File "/home/wangxuezhu/code/torchsparse/torchsparse/nn/utils/apply.py", line 13, in fapply
    feats = fn(input.feats, *args, **kwargs)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 101, in forward
    return F.relu(input, inplace=self.inplace)
  File "/home/wangxuezhu/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/functional.py", line 1469, in relu
    result = torch.relu_(input)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected Behavior

No response

Environment

- GCC:11.4.0
- NVCC:12.3
- PyTorch:2.1.2
- PyTorch CUDA:12.1
- TorchSparse:2.1.0

Anything else?

No response

zhijian-liu commented 6 months ago

@ys-2020, could you please take a look at this issue when you have time? Thanks!

ybc-ybc commented 5 months ago

@ys-2020, could you please take a look at this issue when you have time? Thanks!

hello, the .whl file on the sever still is unreachable.

francotheengineer commented 4 months ago

There's no bug here. I think your batch size is too large for your GPU VRAM. Reduce the batch size so your model can fit in memory.