Closed IAMShashankk closed 3 years ago
According to the error information, it seems that union
in File "/content/mmdetection/mmdet/core/bbox/iou_calculators/iou2d_calculator.py", line 250, in bbox_overlaps
is not put on GPU. You can check whether gt_bboxes
and bboxes
in File "/content/mmdetection/mmdet/models/roi_heads/htc_roi_head.py", line 269, in forward_train
are put on GPU or not.
Json file for stuffthingmap is not acquired. If the first one is fine, the problem may lie in the category of stuffthingmaps, you can check the start number of thing category is 0 or 1 and check whether it is the same as the officially provided stuffthingmaps.
@AronLin , Thanks for the inputs.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [6,0,0], thread: [61,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [6,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [6,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
File "tools/train.py", line 188, in <module>
main()
File "tools/train.py", line 184, in main
meta=meta)
File "/content/mmdetection/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/content/mmdetection/mmdet/models/detectors/base.py", line 237, in train_step
losses = self(**data)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/content/mmdetection/mmdet/models/detectors/base.py", line 171, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/content/mmdetection/mmdet/models/detectors/two_stage.py", line 148, in forward_train
**kwargs)
File "/content/mmdetection/mmdet/models/roi_heads/htc_roi_head.py", line 304, in forward_train
bbox_results['bbox_pred'], pos_is_gts, img_metas)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
return old_func(*args, **kwargs)
File "/content/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 442, in refine_bboxes
img_meta_)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
return old_func(*args, **kwargs)
File "/content/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 480, in regress_by_class
rois, bbox_pred, max_shape=max_shape)
File "/content/mmdetection/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py", line 92, in decode
self.add_ctr_clamp, self.ctr_clamp)
File "/usr/local/lib/python3.7/dist-packages/mmcv/utils/parrots_jit.py", line 21, in wrapper_inner
return func(*args, **kargs)
File "/content/mmdetection/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py", line 211, in delta2bbox
means = deltas.new_tensor(means).view(1, -1).repeat(1, deltas.size(-1) // 4)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f792f77f2f2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f792f77c67b in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f792f9d71f9 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f792f7673a4 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6ea39a (0x7f79a359139a in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x6ea441 (0x7f79a3591441 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #24: __libc_start_main + 0xe7 (0x7f79b56f5bf7 in /lib/x86_64-linux-gnu/libc.so.6)
This error is coming from "/content/mmdetection/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py"
when we are trying to generate a new tensor.
I have some debug information in this folder:
rois shape : torch.Size([512, 4])
deltas shape : torch.Size([512, 4])
deltas on cuda : True
rois on cuda : True
deltas.size(-1) : 1
means : [0.0, 0.0, 0.0, 0.0]
Could you please advise me on how to move forward from this error?
@AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above.
@AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above.
@IAMShashankk , Hi did you solve this problem? I got same problem when training HTC-dcn, which throwback cuda device side assersion that from 'delta_xywh_bbox_coder.py", line 205, in delta2bbox means = deltas.new_tensor(means).....'
@AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above.
@IAMShashankk , Hi did you solve this problem? I got same problem when training HTC-dcn, which throwback cuda device side assersion that from 'delta_xywh_bbox_coder.py", line 205, in delta2bbox means = deltas.new_tensor(means).....'
Not yet :(
Well, I”m no longer bother by that problem now! As a matter of fact,i shouldnt get into that trouble at all!I diminish that by correcting my HTC configuration file。 FYI, I”ve correct bbox/mask heads settings。To be more specific,I got 2 classes,yet earlier I just modified “num of class” in those heads and ignoring all others(after inherited from base HTC)。Yet chances are that is not enough,because when later I tried to explicitly copy all other param settings for those heads but leaving them untouched,and only modify numofclass again, it worked Im actually a little bit confused about this: for my knowledge of MMD,only when I state: delete =True for those heads will the param from base config be cleaned and need to reset from scratch, yet this experience has spoke contradict。
发自我的iPhone
------------------ Original ------------------ From: IAMShashankk @.> Date: 周六,8月 7,2021 7:06 下午 To: open-mmlab/mmdetection @.> Cc: Leemengwei @.>, Comment @.> Subject: Re: [open-mmlab/mmdetection] HTC with Custom Dataset - error in training (#5608)
@AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above.
@IAMShashankk , Hi did you solve this problem? I got same problem when training HTC-dcn, which throwback cuda device side assersion that from 'delta_xywh_bbox_coder.py", line 205, in delta2bbox means = deltas.new_tensor(means).....'
Not yet :(
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
@AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above.
@IAMShashankk , Hi did you solve this problem? I got same problem when training HTC-dcn, which throwback cuda device side assersion that from 'delta_xywh_bbox_coder.py", line 205, in delta2bbox means = deltas.new_tensor(means).....'
Not yet :(
Did his solution solve your problem?
Well, I”m no longer bother by that problem now! As a matter of fact,i shouldnt get into that trouble at all!I diminish that by correcting my HTC configuration file。 FYI, I”ve correct bbox/mask heads settings。To be more specific,I got 2 classes,yet earlier I just modified “num of class” in those heads and ignoring all others(after inherited from base HTC)。Yet chances are that is not enough,because when later I tried to explicitly copy all other param settings for those heads but leaving them untouched,and only modify numofclass again, it worked Im actually a little bit confused about this: for my knowledge of MMD,only when I state: delete =True for those heads will the param from base config be cleaned and need to reset from scratch, yet this experience has spoke contradict。 发自我的iPhone … ------------------ Original ------------------ From: IAMShashankk @.> Date: 周六,8月 7,2021 7:06 下午 To: open-mmlab/mmdetection @.> Cc: Leemengwei @.>, Comment @.> Subject: Re: [open-mmlab/mmdetection] HTC with Custom Dataset - error in training (#5608) @AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above. @IAMShashankk , Hi did you solve this problem? I got same problem when training HTC-dcn, which throwback cuda device side assersion that from 'delta_xywh_bbox_coder.py", line 205, in delta2bbox means = deltas.new_tensor(means).....' Not yet :( — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
May I ask how num_classes in your file configuration is set.
Describe the bug RuntimeError: CUDA error: device-side assert triggered; training HTC with the custom dataset. It looks like related to the shape of gt_sematic_seg {torch.Size([2, 1, 100, 168])} and mask_pred torch.Size([2, 16, 100, 168]) . I checked the shape for gt_sematic_seg is coming from formatting.py and the shape for mask_pred is coming from the call self.semantic_head(x) in htc_roi_head.py.
I have 16 classes in my dataset and I updated this in the config file (pasted below), at all the reqrueid places.
I have created a stuff dataset for my custom dataset (all the train, test, Val, stuff images are in .tiff format).
I am not aware if we have to specifically generate annotation JSON for stuffthingmaps (custom); in my implementation, I don't have it. Also if this is the case how can you specify the path for this?
Any help in solving this issue would be really helpful.
Reproduction
What dataset did you use? I used custom dataset. I have train, test, Val and stuffthingmaps generated. Environment
Please run
python mmdet/utils/collect_env.py
to collect necessary environment information and paste it here. sys.platform: linux Python: 3.7.10 (default, May 3 2021, 02:48:31) [GCC 7.5.0] CUDA available: True GPU 0: Tesla K80 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.0_bu.TC445_37.28845127_0 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.8.0+cu111 PyTorch compiling details: PyTorch built with:TorchVision: 0.9.0+cu111 OpenCV: 4.1.2 MMCV: 1.3.9 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.14.0+5f61347
$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)Error traceback If applicable, paste the error trackback here.