open-mmlab / mmdetection3d

OpenMMLab's next-generation platform for general 3D object detection.
https://mmdetection3d.readthedocs.io/en/latest/
Apache License 2.0
5.33k stars 1.54k forks source link

Potential Bug in CenterPoint #387

Closed XuyangBai closed 3 years ago

XuyangBai commented 3 years ago

Describe the bug

When training using CenterPoint, it will raise an error at L404 https://github.com/open-mmlab/mmdetection3d/blob/391a56b6af48f5056a769c4cd18dfac2a67c6c06/mmdet3d/models/dense_heads/centerpoint_head.py#L402-L405 The error message is

*** TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I suspect it is due to the update of numpy, which changes the behavior of np.array function. Now I use numpy 1.20.0, and I have tried using an earlier version of numpy like 1.19.1 but then there will be another error like this https://github.com/open-mmlab/mmdetection3d/issues/301.

Currently I can solve this error by

        device = heatmaps[0][0].device
        heatmaps = [[y.cpu() for y in x] for x in heatmaps]
        heatmaps = np.array(heatmaps).transpose(1, 0).tolist()
        heatmaps = [torch.stack(hms_).to(device) for hms_ in heatmaps]

but it may increase the training time since it brings memory copy between CPU and GPU. Could you please share your numpy and mmpycocotools version or other solutions to this problem?

Reproduction

  1. What command or script did you run?
    ./tools/dist_train.sh configs/centerpoint/centerpoint_02pillar_second_secfpn_dcn_4x8_cyclic_20e_nus.py 4
  2. Did you make any modifications on the code or config? Did you understand what you have modified? I didn't make any modifications
  3. What dataset did you use? nuScenes

Environment

  1. Please run python mmdet3d/utils/collect_env.py to collect necessary environment infomation and paste it here.
    
    sys.platform: linux
    Python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) [GCC 9.3.0]
    CUDA available: True
    GPU 0,1,2,3,4,5,6,7: Tesla V100-PCIE-16GB
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 10.2, V10.2.89
    GCC: gcc (Ubuntu 5.5.0-12ubuntu1) 5.5.0 20171010
    PyTorch: 1.6.0
    PyTorch compiling details: PyTorch built with:
    - GCC 7.3
    - C++ Version: 201402
    - Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
    - Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
    - OpenMP 201511 (a.k.a. OpenMP 4.5)
    - NNPACK is enabled
    - CPU capability usage: AVX2
    - CUDA Runtime 10.2
    - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
    - CuDNN 7.6.5
    - Magma 2.5.2
    - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.7.0 OpenCV: 4.5.1 MMCV: 1.2.5 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.2 MMDetection: 2.10.0 MMDetection3D: 0.11.0+391a56b


2. You may add addition that may be helpful for locating the problem, such as
    - How you installed PyTorch [e.g., pip, conda, source]
    - Other environment variables that may be related (such as `$PATH`, `$LD_LIBRARY_PATH`, `$PYTHONPATH`, etc.)
tianweiy commented 3 years ago

I think for some reason the LiDARInstance3DBoxes is converted to cuda in the current version? It works well a few months ago

XuyangBai commented 3 years ago

Thanks @tianweiy for your help, I install np1.19.4 and the problem was solved