Closed lji72 closed 3 years ago
Hi @lji72 ,
I've run the code and nothing unexpected happens. Can you provide more error trackback, such as which function triggered this? You can try CUDA_LAUNCH_BLOCKING=1 /tools/dist_train.sh configs/imvotenet/imvotenet_stage2_16x8_sunrgbd-3d-10class.py 4 --work-dir ./train_log_imvotenet/
if currently you cannot see more meaningful error trace.
1./scratch/workspace/xxxx/mmdet3d/mmdetection3d-master/mmdet3d/models/fusion_layers/coord_transform.py:33: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor).
if 'pcd_rotation' in img_meta else torch.eye(
/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [64,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim]
failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [65,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim]
failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [66,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim]
failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [67,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim]
failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [68,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim]
failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [69,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim]
failed.
@
2.Traceback (most recent call last):
File "./tools/train.py", line 212, in
3.frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7ff9b3022193 in /scratch/workspace/xxx/anaconda3/envs/mmdet3D/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1:
Are they enough? And I have a question, Are generated datasets used for VoteNet and ImVoteNet the same?
I'm not sure about the problem. Please check uv_expanded
and see if there is something obviously going wrong. And maybe you can first check that the dataset preparation is correct here? As for your question, the datasets should be the same, while the image data and calibration information is useless for VoteNet.
I encountered the same problem, albeit I was training w/ v1 data, though perhaps v2 has the same issue (since calibs didn't change between v1 and v2 SUNRGBD) Here: https://github.com/open-mmlab/mmdetection3d/blob/237cebffd80a25b46d0ff76eff9925d800c997d0/mmdet3d/models/fusion_layers/vote_fusion.py#L199-L201 I added torch clamps to keep the projections within image bounds:
uv_rescaled[:, 0] = torch.clamp(uv_rescaled[:, 0].round(), 0, img_shape[1] - 1)
uv_rescaled[:, 1] = torch.clamp(uv_rescaled[:, 1].round(), 0, img_shape[0] - 1)
uv_flatten = uv_rescaled[:, 1].round() * \
img_shape[1] + uv_rescaled[:, 0].round()
uv_expanded = uv_flatten.unsqueeze(0).expand(3, -1).long()
And this solved the issue. I was able to achieve similar numbers as the reference model with this method.
I encountered the same problem, albeit I was training w/ v1 data, though perhaps v2 has the same issue (since calibs didn't change between v1 and v2 SUNRGBD) Here: https://github.com/open-mmlab/mmdetection3d/blob/237cebffd80a25b46d0ff76eff9925d800c997d0/mmdet3d/models/fusion_layers/vote_fusion.py#L199-L201
I added torch clamps to keep the projections within image bounds:
uv_rescaled[:, 0] = torch.clamp(uv_rescaled[:, 0].round(), 0, img_shape[1] - 1) uv_rescaled[:, 1] = torch.clamp(uv_rescaled[:, 1].round(), 0, img_shape[0] - 1) uv_flatten = uv_rescaled[:, 1].round() * \ img_shape[1] + uv_rescaled[:, 0].round() uv_expanded = uv_flatten.unsqueeze(0).expand(3, -1).long()
And this solved the issue. I was able to achieve similar numbers as the reference model with this method.
Hi @Divadi ,
Thanks for the solution! Originally I thought that something may be wrong with the dataset so the calculated image coordinates are out of bound. But if clamping the coordinates can achieve similar performance, then perhaps the error is caused by some deviation in calculation. Can you kindly check the range of uv_rescaled
and see how far it gets out of bound?
Thanks for your replies. @Divadi @THU17cyz
It works for me. @Divadi
I encountered the same problem, albeit I was training w/ v1 data, though perhaps v2 has the same issue (since calibs didn't change between v1 and v2 SUNRGBD) Here: https://github.com/open-mmlab/mmdetection3d/blob/237cebffd80a25b46d0ff76eff9925d800c997d0/mmdet3d/models/fusion_layers/vote_fusion.py#L199-L201
I added torch clamps to keep the projections within image bounds:
uv_rescaled[:, 0] = torch.clamp(uv_rescaled[:, 0].round(), 0, img_shape[1] - 1) uv_rescaled[:, 1] = torch.clamp(uv_rescaled[:, 1].round(), 0, img_shape[0] - 1) uv_flatten = uv_rescaled[:, 1].round() * \ img_shape[1] + uv_rescaled[:, 0].round() uv_expanded = uv_flatten.unsqueeze(0).expand(3, -1).long()
And this solved the issue. I was able to achieve similar numbers as the reference model with this method.
Hi @Divadi ,
Thanks for the solution! Originally I thought that something may be wrong with the dataset so the calculated image coordinates are out of bound. But if clamping the coordinates can achieve similar performance, then perhaps the error is caused by some deviation in calculation. Can you kindly check the range of
uv_rescaled
and see how far it gets out of bound?
It did not seem to be very much. Printing a number of cases, it seemed like uv_rescaled[:, 0] was sometimes perhaps 1 or so pixels higher than the maximum image. I had just decided that perhaps it was some issue with reversing data augmentation (but when I wrote my own pipeline for projecting points back to image without augmentation, there is an extremely close fit with corresponding pixels)
I was using info files generated via a February version of the repo (though I went through everything and checked that calib files were identical)
While we're on the subject of SUNRGB-D, I just wanted to bring to attention another thing https://github.com/open-mmlab/mmdetection3d/blob/b035bc8edeef546adae77a2d0d716c0ebd32faba/data/sunrgbd/matlab/extract_rgbd_data_v2.m#L74 This line tries to force consistency between 2D and 3D bounding boxes (there is not a bijection between them). However, I believe it is often the case that a number of 3D bounding boxes are just dropped. Indeed, they don't have corresponding 2D annotations, but for 3D-only methods, this is not an issue, and having complete 3D boxes can likely help performance.
I encountered the same problem, albeit I was training w/ v1 data, though perhaps v2 has the same issue (since calibs didn't change between v1 and v2 SUNRGBD) Here: https://github.com/open-mmlab/mmdetection3d/blob/237cebffd80a25b46d0ff76eff9925d800c997d0/mmdet3d/models/fusion_layers/vote_fusion.py#L199-L201
I added torch clamps to keep the projections within image bounds:
uv_rescaled[:, 0] = torch.clamp(uv_rescaled[:, 0].round(), 0, img_shape[1] - 1) uv_rescaled[:, 1] = torch.clamp(uv_rescaled[:, 1].round(), 0, img_shape[0] - 1) uv_flatten = uv_rescaled[:, 1].round() * \ img_shape[1] + uv_rescaled[:, 0].round() uv_expanded = uv_flatten.unsqueeze(0).expand(3, -1).long()
And this solved the issue. I was able to achieve similar numbers as the reference model with this method.
Hi @Divadi , Thanks for the solution! Originally I thought that something may be wrong with the dataset so the calculated image coordinates are out of bound. But if clamping the coordinates can achieve similar performance, then perhaps the error is caused by some deviation in calculation. Can you kindly check the range of
uv_rescaled
and see how far it gets out of bound?It did not seem to be very much. Printing a number of cases, it seemed like uv_rescaled[:, 0] was sometimes perhaps 1 or so pixels higher than the maximum image. I had just decided that perhaps it was some issue with reversing data augmentation (but when I wrote my own pipeline for projecting points back to image without augmentation, there is an extremely close fit with corresponding pixels)
I was using info files generated via a February version of the repo (though I went through everything and checked that calib files were identical)
I guess it's because reversing a rotation (taking the transpose) is not precise enough (and we are using float precision). Nevertheless, since you reproduced the results, I believe this is not a problem, and clamping the coordinates is a good solution. Are you willing to open a pull request to fix this?
Sure I can do it
While we're on the subject of SUNRGB-D, I just wanted to bring to attention another thing https://github.com/open-mmlab/mmdetection3d/blob/b035bc8edeef546adae77a2d0d716c0ebd32faba/data/sunrgbd/matlab/extract_rgbd_data_v2.m#L74
This line tries to force consistency between 2D and 3D bounding boxes (there is not a bijection between them). However, I believe it is often the case that a number of 3D bounding boxes are just dropped. Indeed, they don't have corresponding 2D annotations, but for 3D-only methods, this is not an issue, and having complete 3D boxes can likely help performance.
I'm not sure about what you are implying. As SUNRGBD data take the form of depth-maps, the point clouds are partial, therefore each 3D annotation should correspond to a 2D one?
I don't precisely remember the name, but each 2D bbox annotation struct seems to have a "has_3d_box" parameter (or perhaps the other way around), so they don't seem to have a 1-for-1 correspondence. I can look into this more when I have time in a few days
I don't precisely remember the name, but each 2D bbox annotation struct seems to have a "has_3d_box" parameter (or perhaps the other way around), so they don't seem to have a 1-for-1 correspondence. I can look into this more when I have time in a few days
I can look into this later too. We follow the data preprocessing of the original VoteNet and ImVoteNet repo so we haven't digged deep into this.
@THU17cyz I find that there is something wrong with my calibration data due to the difference between python and matlab code. Accuracy raises to 65.3% MAP@0.25 from 59.80% MAP(use fix method from @Divadi ) with fixing the issue. Many thanks for your guys' work!
@THU17cyz I find that there is something wrong with my calibration data due to the difference between python and matlab code. Accuracy raises to 65.3% MAP@0.25 from 59.80% MAP(use fix method from @Divadi ) with fixing the issue. Many thanks for your guys' work!
Great to hear that. Can you explain more about why the calibration data is wrong? We wish to figure out what is the root cause of the out-of-bound bug, since we did not encounter this. Thanks!
Fixed via #463
Hi @lji72 , @Divadi , just curious, did you also suffer from #507 ?
@THU17cyz I never saw the notification for this; I apologize.
I had generated my pickle files before the imvotenet commit (which introduced #507 issue), so #448 was not caused by the Rt vs K issue. Besides, I had visualized projections & they looked reasonable, and the Rt and K values of my generated pickle files are reasonable
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug I want to train an imVoteNet model and successfully train an "imvotenet_faster_rcnn_r50_fpn_2x4_sunrgbd-3d-10class.py", then train an "imvotenet_stage2_16x8_sunrgbd-3d-10class.py“ and raise a cuda error at the beginning of the training process.
Reproduction
Environment
python mmdet3d/utils/collect_env.py
to collect necessary environment infomation and paste it here.Python: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0] CUDA available: True GPU 0,1,2,3: Tesla P100-PCIE-16GB CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.2.r11.2/compiler.29373293_0 GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.4.0 PyTorch compiling details: PyTorch built with:
TorchVision: 0.5.0 OpenCV: 4.5.1 MMCV: 1.2.7 MMCV Compiler: GCC 5.4 MMCV CUDA Compiler: 11.2 MMDetection: 2.10.0 MMDetection3D: 0.11.0+
Error traceback If applicable, paste the error trackback here.
Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!