open-mmlab / mmdetection3d

OpenMMLab's next-generation platform for general 3D object detection.
https://mmdetection3d.readthedocs.io/en/latest/
Apache License 2.0
5.29k stars 1.54k forks source link

Setting --gpu-id 3 when i train ScanNet on Votenet ,i got error #2471

Closed jumptiger66 closed 1 year ago

jumptiger66 commented 1 year ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug I want to train Votenet (ICCV2019) with ScanNet/SUNRGBD on the default GPU 0. The code runs without any problems on this GPU. However, when I switch to a different GPU (such as GPU 1, 2, or 3, since I am only training on a single GPU anyway), I get an error related to the device.

Reproduction

  1. What command or script did you run?
python tools/train.py '/home/sunhao/code2023/mmdetection3d/configs/votenet/votenet_16x8_sunrgbd-3d-10class.py' --gpu-id 3
  1. Did you make any modifications on the code or config? Did you understand what you have modified? No modifications.

  2. What dataset did you use? ScanNet/SunRGBD

Environment

  1. Please run python mmdet3d/utils/collect_env.py to collect necessary environment information and paste it here. sys.platform: linux Python: 3.7.16 (default, Jan 17 2023, 22:20:44) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda-11.5 NVCC: Cuda compilation tools, release 11.5, V11.5.50 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:
    • GCC 9.3
    • C++ Version: 201402
    • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
    • Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
    • OpenMP 201511 (a.k.a. OpenMP 4.5)
    • LAPACK is enabled (usually provided by MKL)
    • NNPACK is enabled
    • CPU capability usage: AVX2
    • CUDA Runtime 11.3
    • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
    • CuDNN 8.3.2 (built against CUDA 11.5)
    • Magma 2.5.2
    • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1 OpenCV: 4.7.0 MMCV: 1.6.0 MMCV Compiler: GCC 9.3 MMCV CUDA Compiler: 11.3 MMDetection: 2.24.1 MMSegmentation: 0.24.1 MMDetection3D: 1.0.0rc3+120a93d spconv2.0: False

  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source] Conda
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback Traceback (most recent call last): File "/home/sunhao/code2023/mmdetection3d/tools/train.py", line 284, in main() File "/home/sunhao/code2023/mmdetection3d/tools/train.py", line 280, in main meta=meta) File "/home/sunhao/code2023/mmdetection3d/mmdet3d/apis/train.py", line 351, in train_model meta=meta) File "/home/sunhao/code2023/mmdetection3d/mmdet3d/apis/train.py", line 319, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], kwargs) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train self.run_iter(data_batch, train_mode=True, kwargs) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 32, in run_iter kwargs) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step return self.module.train_step(inputs[0], kwargs[0]) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 248, in train_step losses = self(data) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func return old_func(args, kwargs) File "/home/sunhao/code2023/mmdetection3d/mmdet3d/models/detectors/base.py", line 60, in forward return self.forward_train(kwargs) File "/home/sunhao/code2023/mmdetection3d/mmdet3d/models/detectors/votenet.py", line 59, in forward_train x = self.extract_feat(points_cat) File "/home/sunhao/code2023/mmdetection3d/mmdet3d/models/detectors/single_stage.py", line 61, in extract_feat x = self.backbone(points) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func return old_func(*args, *kwargs) File "/home/sunhao/code2023/mmdetection3d/mmdet3d/models/backbones/pointnet2_sa_ssg.py", line 119, in forward sa_xyz[i], sa_features[i]) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/sunhao/code2023/mmdetection3d/mmdet3d/ops/pointnet_modules/point_sa_module.py", line 200, in forward target_xyz) File "/home/sunhao/code2023/mmdetection3d/mmdet3d/ops/pointnet_modules/point_sa_module.py", line 138, in _sample_points indices = self.points_sampler(points_xyz, features) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 205, in new_func return old_func(*args, *kwargs) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/ops/points_sampler.py", line 127, in forward npoint) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/ops/points_sampler.py", line 144, in forward fps_idx = furthest_point_sample(points.contiguous(), npoint) File "/home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/ops/furthest_point_sample.py", line 39, in forward m=num_points, RuntimeError: furthest_point_sampling_forward_impl: at param 1, inconsistent device: cuda:0 vs cuda:3

Exception raised from Dispatch at /tmp/mmcv/mmcv/ops/csrc/common/pytorch_device_registry.hpp:116 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f06af200497 in /home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f06af1d7c94 in /home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/torch/lib/libc10.so) frame #2: auto Dispatch<DeviceRegistry<void ()(at::Tensor, at::Tensor, at::Tensor, int, int, int), &(furthest_point_sampling_forward_impl(at::Tensor, at::Tensor, at::Tensor, int, int, int))>, at::Tensor&, at::Tensor&, at::Tensor&, int&, int&, int&>(DeviceRegistry<void ()(at::Tensor, at::Tensor, at::Tensor, int, int, int), &(furthest_point_sampling_forward_impl(at::Tensor, at::Tensor, at::Tensor, int, int, int))> const&, char const, at::Tensor&, at::Tensor&, at::Tensor&, int&, int&, int&) + 0x385 (0x7f05ef8c13b5 in /home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/_ext.cpython-37m-x86_64-linux-gnu.so) frame #3: furthest_point_sampling_forward_impl(at::Tensor, at::Tensor, at::Tensor, int, int, int) + 0x62 (0x7f05ef8c0c32 in /home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/_ext.cpython-37m-x86_64-linux-gnu.so) frame #4: furthest_point_sampling_forward(at::Tensor, at::Tensor, at::Tensor, int, int, int) + 0x69 (0x7f05ef8c0cd9 in /home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/_ext.cpython-37m-x86_64-linux-gnu.so) frame #5: + 0x2c6e84 (0x7f05ef91ce84 in /home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/_ext.cpython-37m-x86_64-linux-gnu.so) frame #6: + 0x2b4ba1 (0x7f05ef90aba1 in /home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/mmcv/_ext.cpython-37m-x86_64-linux-gnu.so) frame #7: _PyMethodDef_RawFastCallKeywords + 0x301 (0x4af061 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #8: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x4ae9f0] frame #9: _PyEval_EvalFrameDefault + 0x15d6 (0x4a82b6 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #10: _PyFunction_FastCallDict + 0x116 (0x4c0d96 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #11: THPFunction_apply(_object, _object*) + 0x5d6 (0x7f06ef136c96 in /home/sunhao/anaconda3/envs/tr3d/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #12: _PyMethodDef_RawFastCallKeywords + 0x1fb (0x4aef5b in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #13: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x4ae9f0] frame #14: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #15: _PyFunction_FastCallDict + 0x116 (0x4c0d96 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #16: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x4c9a80] frame #17: PyObject_Call + 0x60 (0x4c7170 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #18: _PyEval_EvalFrameDefault + 0x1ea4 (0x4a8b84 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #19: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #20: _PyFunction_FastCallDict + 0x2d7 (0x4c0f57 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #21: _PyObject_Call_Prepend + 0x6e (0x4c6abe in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #22: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x578b67] frame #23: _PyObject_FastCallKeywords + 0x430 (0x4b77a0 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #24: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x4aea19] frame #25: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #26: _PyFunction_FastCallDict + 0x116 (0x4c0d96 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #27: _PyEval_EvalFrameDefault + 0x1ea4 (0x4a8b84 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #28: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #29: _PyFunction_FastCallDict + 0x2d7 (0x4c0f57 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #30: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x4c9a80] frame #31: PyObject_Call + 0x60 (0x4c7170 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #32: _PyEval_EvalFrameDefault + 0x1ea4 (0x4a8b84 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #33: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #34: _PyFunction_FastCallDict + 0x2d7 (0x4c0f57 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #35: _PyObject_Call_Prepend + 0x6e (0x4c6abe in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #36: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x578b67] frame #37: _PyObject_FastCallKeywords + 0x430 (0x4b77a0 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #38: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x4aea19] frame #39: _PyEval_EvalFrameDefault + 0x468a (0x4ab36a in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #40: _PyFunction_FastCallKeywords + 0x106 (0x4b9d16 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #41: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x4ae8df] frame #42: _PyEval_EvalFrameDefault + 0x468a (0x4ab36a in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #43: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #44: _PyFunction_FastCallDict + 0x2d7 (0x4c0f57 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #45: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x4c9a80] frame #46: PyObject_Call + 0x60 (0x4c7170 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #47: _PyEval_EvalFrameDefault + 0x1ea4 (0x4a8b84 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #48: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #49: _PyFunction_FastCallDict + 0x2d7 (0x4c0f57 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #50: _PyObject_Call_Prepend + 0x6e (0x4c6abe in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #51: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x578b67] frame #52: _PyObject_FastCallKeywords + 0x430 (0x4b77a0 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #53: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x4aea19] frame #54: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #55: _PyFunction_FastCallDict + 0x116 (0x4c0d96 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #56: _PyEval_EvalFrameDefault + 0x1ea4 (0x4a8b84 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #57: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #58: _PyFunction_FastCallDict + 0x2d7 (0x4c0f57 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #59: /home/sunhao/anaconda3/envs/tr3d/bin/python3.7() [0x4c9a80] frame #60: PyObject_Call + 0x60 (0x4c7170 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #61: _PyEval_EvalFrameDefault + 0x1ea4 (0x4a8b84 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #62: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7) frame #63: _PyFunction_FastCallDict + 0x2d7 (0x4c0f57 in /home/sunhao/anaconda3/envs/tr3d/bin/python3.7)

A placeholder for trackback.

Bug fix

jumptiger66 commented 1 year ago

I switch to the latest mmdetection3d version ( 1.1.0 ) and this bug has been solved.