open-mmlab / mmdetection3d

OpenMMLab's next-generation platform for general 3D object detection.
https://mmdetection3d.readthedocs.io/en/latest/
Apache License 2.0
5.18k stars 1.52k forks source link

ImvoteNet 2-stage model reproduce error #448

Closed lji72 closed 3 years ago

lji72 commented 3 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug I want to train an imVoteNet model and successfully train an "imvotenet_faster_rcnn_r50_fpn_2x4_sunrgbd-3d-10class.py", then train an "imvotenet_stage2_16x8_sunrgbd-3d-10class.py“ and raise a cuda error at the beginning of the training process.

Reproduction

  1. What command or script did you run?
    ./tools/dist_train.sh configs/imvotenet/imvotenet_stage2_16x8_sunrgbd-3d-10class.py 4 --work-dir ./train_log_imvotenet/ 
  2. Did you make any modifications on the code or config? Did you understand what you have modified? I make a modification on "load_from" value which is changed from HTTP link to a local path on the server.
  3. What dataset did you use? SUNRGBD

Environment

  1. Please run python mmdet3d/utils/collect_env.py to collect necessary environment infomation and paste it here.

Python: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0] CUDA available: True GPU 0,1,2,3: Tesla P100-PCIE-16GB CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.2.r11.2/compiler.29373293_0 GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.4.0 PyTorch compiling details: PyTorch built with:

TorchVision: 0.5.0 OpenCV: 4.5.1 MMCV: 1.2.7 MMCV Compiler: GCC 5.4 MMCV CUDA Compiler: 11.2 MMDetection: 2.10.0 MMDetection3D: 0.11.0+

  1. You may add addition that may be helpful for locating the problem, such as
    1. I install env in conda.
    2. I use the generated sunrgbd dataset to train VoteNet successfully.
    3. I train the first stage model of imVoteNet successfully.
    4. I use python scripts instead of matlab scripts to preprocess sunrgbd data. But I use the generated data to train VoteNet and first stage model of imVoteNet successfully.

Error traceback If applicable, paste the error trackback here.

/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [1,0,0], thread: [32,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [1,0,0], thread: [33,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [1,0,0], thread: [34,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [1,0,0], thread: [35,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [1,0,0], thread: [36,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [1,0,0], thread: [37,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

yezhen17 commented 3 years ago

Hi @lji72 ,

I've run the code and nothing unexpected happens. Can you provide more error trackback, such as which function triggered this? You can try CUDA_LAUNCH_BLOCKING=1 /tools/dist_train.sh configs/imvotenet/imvotenet_stage2_16x8_sunrgbd-3d-10class.py 4 --work-dir ./train_log_imvotenet/ if currently you cannot see more meaningful error trace.

lji72 commented 3 years ago

1./scratch/workspace/xxxx/mmdet3d/mmdetection3d-master/mmdet3d/models/fusion_layers/coord_transform.py:33: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor). if 'pcd_rotation' in img_meta else torch.eye( /pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [64,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [65,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [66,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [67,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [68,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [69,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed. @

2.Traceback (most recent call last): File "./tools/train.py", line 212, in main() File "./tools/train.py", line 208, in main meta=meta) File "/scratch/workspace/xxx/mmdet3d/mmdetection-2.10.0/mmdet/apis/train.py", line 170, in train_detector runner.run(data_loaders, cfg.workflow) File "/scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run epoch_runner(data_loaders[i], kwargs) File "/scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True) File "/scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter kwargs) File "/scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 46, in train_step output = self.module.train_step(inputs[0], kwargs[0]) File "/scratch/workspace/xxx/mmdet3d/mmdetection-2.10.0/mmdet/models/detectors/base.py", line 247, in train_step losses = self(data) File "/scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func return old_func(args, kwargs) File "/scratch/workspace/xxx/mmdet3d/mmdetection3d-master/mmdet3d/models/detectors/base.py", line 58, in forward return self.forward_train(kwargs) File "/scratch/workspace/xxx/mmdet3d/mmdetection3d-master/mmdet3d/models/detectors/imvotenet.py", line 453, in forward_train img_metas, calib) File "/scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/scratch/workspace/xxx/mmdet3d/mmdetection3d-master/mmdet3d/models/fusion_layers/vote_fusion.py", line 202, in forward txt_cue = torch.gather(img_flatten, dim=-1, index=uv_expanded) RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorScatterGather.cu:67

3.frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7ff9b3022193 in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0x17f66 (0x7ff9b325ff66 in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x19cbd (0x7ff9b3261cbd in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7ff9b301263d in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/torch/lib/libc10.so) frame #4: + 0x67bac2 (0x7ff9fe439ac2 in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #5: + 0x67bb66 (0x7ff9fe439b66 in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x183dd6 (0x56407457edd6 in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #7: + 0xe730f (0x5640744e230f in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #8: + 0xe605b (0x5640744e105b in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #9: + 0xe605b (0x5640744e105b in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #10: + 0xe5928 (0x5640744e0928 in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #11: + 0xe62c8 (0x5640744e12c8 in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #12: + 0xe62de (0x5640744e12de in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #13: + 0xe62de (0x5640744e12de in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #14: + 0xe62de (0x5640744e12de in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #15: + 0xe62de (0x5640744e12de in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #16: + 0xe62de (0x5640744e12de in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #17: + 0xe62de (0x5640744e12de in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #18: PyDict_SetItem + 0x4bf (0x56407452661f in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #19: PyDict_SetItemString + 0x66 (0x5640745268c6 in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #20: PyImport_Cleanup + 0x9c (0x56407462e4cc in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #21: Py_FinalizeEx + 0x67 (0x56407462e897 in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #22: + 0x2484db (0x5640746434db in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #23: _Py_UnixMain + 0x3c (0x56407464385c in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/bin/python) frame #24: __libc_start_main + 0xe7 (0x7ffa050d4b97 in /lib/x86_64-linux-gnu/libc.so.6)

Are they enough? And I have a question, Are generated datasets used for VoteNet and ImVoteNet the same?

yezhen17 commented 3 years ago

I'm not sure about the problem. Please check uv_expanded and see if there is something obviously going wrong. And maybe you can first check that the dataset preparation is correct here? As for your question, the datasets should be the same, while the image data and calibration information is useless for VoteNet.

Divadi commented 3 years ago

I encountered the same problem, albeit I was training w/ v1 data, though perhaps v2 has the same issue (since calibs didn't change between v1 and v2 SUNRGBD) Here: https://github.com/open-mmlab/mmdetection3d/blob/237cebffd80a25b46d0ff76eff9925d800c997d0/mmdet3d/models/fusion_layers/vote_fusion.py#L199-L201 I added torch clamps to keep the projections within image bounds:

uv_rescaled[:, 0] = torch.clamp(uv_rescaled[:, 0].round(), 0, img_shape[1] - 1)
uv_rescaled[:, 1] = torch.clamp(uv_rescaled[:, 1].round(), 0, img_shape[0] - 1)
uv_flatten = uv_rescaled[:, 1].round() * \
    img_shape[1] + uv_rescaled[:, 0].round()
uv_expanded = uv_flatten.unsqueeze(0).expand(3, -1).long()

And this solved the issue. I was able to achieve similar numbers as the reference model with this method.

yezhen17 commented 3 years ago

I encountered the same problem, albeit I was training w/ v1 data, though perhaps v2 has the same issue (since calibs didn't change between v1 and v2 SUNRGBD) Here: https://github.com/open-mmlab/mmdetection3d/blob/237cebffd80a25b46d0ff76eff9925d800c997d0/mmdet3d/models/fusion_layers/vote_fusion.py#L199-L201

I added torch clamps to keep the projections within image bounds:

uv_rescaled[:, 0] = torch.clamp(uv_rescaled[:, 0].round(), 0, img_shape[1] - 1)
uv_rescaled[:, 1] = torch.clamp(uv_rescaled[:, 1].round(), 0, img_shape[0] - 1)
uv_flatten = uv_rescaled[:, 1].round() * \
    img_shape[1] + uv_rescaled[:, 0].round()
uv_expanded = uv_flatten.unsqueeze(0).expand(3, -1).long()

And this solved the issue. I was able to achieve similar numbers as the reference model with this method.

Hi @Divadi ,

Thanks for the solution! Originally I thought that something may be wrong with the dataset so the calculated image coordinates are out of bound. But if clamping the coordinates can achieve similar performance, then perhaps the error is caused by some deviation in calculation. Can you kindly check the range of uv_rescaled and see how far it gets out of bound?

lji72 commented 3 years ago

Thanks for your replies. @Divadi @THU17cyz

lji72 commented 3 years ago

It works for me. @Divadi

Divadi commented 3 years ago

I encountered the same problem, albeit I was training w/ v1 data, though perhaps v2 has the same issue (since calibs didn't change between v1 and v2 SUNRGBD) Here: https://github.com/open-mmlab/mmdetection3d/blob/237cebffd80a25b46d0ff76eff9925d800c997d0/mmdet3d/models/fusion_layers/vote_fusion.py#L199-L201

I added torch clamps to keep the projections within image bounds:

uv_rescaled[:, 0] = torch.clamp(uv_rescaled[:, 0].round(), 0, img_shape[1] - 1)
uv_rescaled[:, 1] = torch.clamp(uv_rescaled[:, 1].round(), 0, img_shape[0] - 1)
uv_flatten = uv_rescaled[:, 1].round() * \
    img_shape[1] + uv_rescaled[:, 0].round()
uv_expanded = uv_flatten.unsqueeze(0).expand(3, -1).long()

And this solved the issue. I was able to achieve similar numbers as the reference model with this method.

Hi @Divadi ,

Thanks for the solution! Originally I thought that something may be wrong with the dataset so the calculated image coordinates are out of bound. But if clamping the coordinates can achieve similar performance, then perhaps the error is caused by some deviation in calculation. Can you kindly check the range of uv_rescaled and see how far it gets out of bound?

It did not seem to be very much. Printing a number of cases, it seemed like uv_rescaled[:, 0] was sometimes perhaps 1 or so pixels higher than the maximum image. I had just decided that perhaps it was some issue with reversing data augmentation (but when I wrote my own pipeline for projecting points back to image without augmentation, there is an extremely close fit with corresponding pixels)

I was using info files generated via a February version of the repo (though I went through everything and checked that calib files were identical)

Divadi commented 3 years ago

While we're on the subject of SUNRGB-D, I just wanted to bring to attention another thing https://github.com/open-mmlab/mmdetection3d/blob/b035bc8edeef546adae77a2d0d716c0ebd32faba/data/sunrgbd/matlab/extract_rgbd_data_v2.m#L74 This line tries to force consistency between 2D and 3D bounding boxes (there is not a bijection between them). However, I believe it is often the case that a number of 3D bounding boxes are just dropped. Indeed, they don't have corresponding 2D annotations, but for 3D-only methods, this is not an issue, and having complete 3D boxes can likely help performance.

yezhen17 commented 3 years ago

I encountered the same problem, albeit I was training w/ v1 data, though perhaps v2 has the same issue (since calibs didn't change between v1 and v2 SUNRGBD) Here: https://github.com/open-mmlab/mmdetection3d/blob/237cebffd80a25b46d0ff76eff9925d800c997d0/mmdet3d/models/fusion_layers/vote_fusion.py#L199-L201

I added torch clamps to keep the projections within image bounds:

uv_rescaled[:, 0] = torch.clamp(uv_rescaled[:, 0].round(), 0, img_shape[1] - 1)
uv_rescaled[:, 1] = torch.clamp(uv_rescaled[:, 1].round(), 0, img_shape[0] - 1)
uv_flatten = uv_rescaled[:, 1].round() * \
    img_shape[1] + uv_rescaled[:, 0].round()
uv_expanded = uv_flatten.unsqueeze(0).expand(3, -1).long()

And this solved the issue. I was able to achieve similar numbers as the reference model with this method.

Hi @Divadi , Thanks for the solution! Originally I thought that something may be wrong with the dataset so the calculated image coordinates are out of bound. But if clamping the coordinates can achieve similar performance, then perhaps the error is caused by some deviation in calculation. Can you kindly check the range of uv_rescaled and see how far it gets out of bound?

It did not seem to be very much. Printing a number of cases, it seemed like uv_rescaled[:, 0] was sometimes perhaps 1 or so pixels higher than the maximum image. I had just decided that perhaps it was some issue with reversing data augmentation (but when I wrote my own pipeline for projecting points back to image without augmentation, there is an extremely close fit with corresponding pixels)

I was using info files generated via a February version of the repo (though I went through everything and checked that calib files were identical)

I guess it's because reversing a rotation (taking the transpose) is not precise enough (and we are using float precision). Nevertheless, since you reproduced the results, I believe this is not a problem, and clamping the coordinates is a good solution. Are you willing to open a pull request to fix this?

Divadi commented 3 years ago

Sure I can do it

yezhen17 commented 3 years ago

While we're on the subject of SUNRGB-D, I just wanted to bring to attention another thing https://github.com/open-mmlab/mmdetection3d/blob/b035bc8edeef546adae77a2d0d716c0ebd32faba/data/sunrgbd/matlab/extract_rgbd_data_v2.m#L74

This line tries to force consistency between 2D and 3D bounding boxes (there is not a bijection between them). However, I believe it is often the case that a number of 3D bounding boxes are just dropped. Indeed, they don't have corresponding 2D annotations, but for 3D-only methods, this is not an issue, and having complete 3D boxes can likely help performance.

I'm not sure about what you are implying. As SUNRGBD data take the form of depth-maps, the point clouds are partial, therefore each 3D annotation should correspond to a 2D one?

Divadi commented 3 years ago

I don't precisely remember the name, but each 2D bbox annotation struct seems to have a "has_3d_box" parameter (or perhaps the other way around), so they don't seem to have a 1-for-1 correspondence. I can look into this more when I have time in a few days

yezhen17 commented 3 years ago

I don't precisely remember the name, but each 2D bbox annotation struct seems to have a "has_3d_box" parameter (or perhaps the other way around), so they don't seem to have a 1-for-1 correspondence. I can look into this more when I have time in a few days

I can look into this later too. We follow the data preprocessing of the original VoteNet and ImVoteNet repo so we haven't digged deep into this.

lji72 commented 3 years ago

@THU17cyz I find that there is something wrong with my calibration data due to the difference between python and matlab code. Accuracy raises to 65.3% MAP@0.25 from 59.80% MAP(use fix method from @Divadi ) with fixing the issue. Many thanks for your guys' work!

yezhen17 commented 3 years ago

@THU17cyz I find that there is something wrong with my calibration data due to the difference between python and matlab code. Accuracy raises to 65.3% MAP@0.25 from 59.80% MAP(use fix method from @Divadi ) with fixing the issue. Many thanks for your guys' work!

Great to hear that. Can you explain more about why the calibration data is wrong? We wish to figure out what is the root cause of the out-of-bound bug, since we did not encounter this. Thanks!

Tai-Wang commented 3 years ago

Fixed via #463

yezhen17 commented 3 years ago

Hi @lji72 , @Divadi , just curious, did you also suffer from #507 ?

Divadi commented 3 years ago

@THU17cyz I never saw the notification for this; I apologize.

I had generated my pickle files before the imvotenet commit (which introduced #507 issue), so #448 was not caused by the Rt vs K issue. Besides, I had visualized projections & they looked reasonable, and the Rt and K values of my generated pickle files are reasonable