mit-han-lab / pvcnn

[NeurIPS 2019, Spotlight] Point-Voxel CNN for Efficient 3D Deep Learning
https://pvcnn.mit.edu/
MIT License
636 stars 129 forks source link

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #42

Closed cjyiiiing closed 3 years ago

cjyiiiing commented 3 years ago

When I train pvcnn, everything is OK. But when I train pvcnn2, error occured:

==> training epoch 0/50
train:   0% 0/3576 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 326, in <module>
    main()
  File "train.py", line 286, in main
    current_step=current_step, writer=writer)
  File "train.py", line 150, in train
    loss.backward()
  File "/home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR (findAlgorithms at /opt/conda/conda-bld/pytorch_1587428266983/work/aten/src/ATen/native/cudnn/Conv.cpp:623)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f0688842b5e in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xd73755 (0x7f06897e2755 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd662fa (0x7f06897d52fa in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xd676df (0x7f06897d66df in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6b330 (0x7f06897da330 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution_backward_weight(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) + 0x49 (0x7f06897da589 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdd1ee0 (0x7f0689840ee0 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xe16158 (0x7f0689885158 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, std::array<bool, 2ul>) + 0x2fc (0x7f06897db23c in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0xdd1beb (0x7f0689840beb in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0xe161b4 (0x7f06898851b4 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x29defc6 (0x7f06b6012fc6 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2a2ea54 (0x7f06b6062a54 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x378 (0x7f06b5c2af28 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x2ae8215 (0x7f06b611c215 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f06b6119513 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7f06b611a2f2 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f06b6112969 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f06b9459558 in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0xc819d (0x7f06bbec419d in /home/fwq3/.conda/envs/pvcnn/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #20: <unknown function> + 0x7fa3 (0x7f06d5d6bfa3 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: clone + 0x3f (0x7f06d5c9c4cf in /lib/x86_64-linux-gnu/libc.so.6)

I'm really confused about this and haven't found the solution. Can you help me?

zhijian-liu commented 3 years ago

@kentangSJTU, could you take a look at this issue? At the same time, @cjyiiiing could you please provide us with more detailed environment information (e.g., CUDA version, PyTorch Version)?

cjyiiiing commented 3 years ago

I tried to train pvcnn2 again this morning. To my surprise, no error occurred now. But I didn't change anything like the code or the environment. It's strange.

My environment:

zhijian-liu commented 3 years ago

That's interesting. I'm closing the issue for now. Feel free to reopen if if you encounter a similar issue in the future.