HTC with Custom Dataset - error in training

IAMShashankk commented 3 years ago

Describe the bug RuntimeError: CUDA error: device-side assert triggered; training HTC with the custom dataset. It looks like related to the shape of gt_sematic_seg {torch.Size([2, 1, 100, 168])} and mask_pred torch.Size([2, 16, 100, 168]) . I checked the shape for gt_sematic_seg is coming from formatting.py and the shape for mask_pred is coming from the call self.semantic_head(x) in htc_roi_head.py.

I have 16 classes in my dataset and I updated this in the config file (pasted below), at all the reqrueid places.

I have created a stuff dataset for my custom dataset (all the train, test, Val, stuff images are in .tiff format).

I am not aware if we have to specifically generate annotation JSON for stuffthingmaps (custom); in my implementation, I don't have it. Also if this is the case how can you specify the path for this?

Any help in solving this issue would be really helpful.

Reproduction

What command or script did you run?

!python tools.train.py config_file_name

Did you make any modifications on the code or config? Did you understand what you have modified? Changed config file:

_base_ = '/content/mmdetection/configs/htc/htc_r50_fpn_1x_coco.py'

model = dict(
    roi_head=dict(
        semantic_head=dict(
                         type='FusedSemanticHead',
                         num_classes=16),
        bbox_head=[
            dict(
                type='Shared2FCBBoxHead',
                num_classes=16),
            dict(
                type='Shared2FCBBoxHead',
                num_classes=16),
            dict(
                type='Shared2FCBBoxHead',
                num_classes=16)],

    mask_head=[
        dict(
            type='HTCMaskHead',
            num_classes=16), 
        dict(
            type='HTCMaskHead',
            num_classes=16), 
        dict(
            type='HTCMaskHead',
            num_classes=16)]))

data_root = '/content/HTC/'

dataset_type = 'COCODataset'
classes = ('armchair','bed','door1','door2','sink1','sink2','sink3','sink4','sofa1','sofa2','table1','table2','table3','tub','window1','window2')
data = dict(
    train=dict(
            seg_prefix='/content/HTC_Dataset/train_mask/',
        img_prefix='/content/HTC_Dataset/train/',
        classes=classes,
        ann_file='/content/HTC_Dataset/train/annotations_train.json'),
    val=dict(
        img_prefix='/content/HTC_Dataset/val/',
        classes=classes,
        ann_file='/content/HTC_Dataset/val/annotations_val.json'),
    test=dict(
        img_prefix='/content/HTC_Dataset/test/',
        classes=classes,
        ann_file='/content/HTC_Dataset/test/annotations_test.json'))

optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)

lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[8, 11])

runner = dict(type='EpochBasedRunner', max_epochs=12)
log_config = dict(interval=100)

load_from = '/content/checkpoints/htc_r50_fpn_1x_coco_20200317-7332cf16.pth'

What dataset did you use? I used custom dataset. I have train, test, Val and stuffthingmaps generated. Environment
Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here. sys.platform: linux Python: 3.7.10 (default, May 3 2021, 02:48:31) [GCC 7.5.0] CUDA available: True GPU 0: Tesla K80 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.0_bu.TC445_37.28845127_0 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.8.0+cu111 PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.0.5
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0+cu111 OpenCV: 4.1.2 MMCV: 1.3.9 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.14.0+5f61347

You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source] used pip
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback If applicable, paste the error trackback here.

/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [1,0,0], thread: [351,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [1,0,0], thread: [1023,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [1,0,0], thread: [256,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [1,0,0], thread: [775,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [1,0,0], thread: [776,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [1,0,0], thread: [671,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [1,0,0], thread: [943,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [1,0,0], thread: [944,0,0] Assertion `t >= 0 && t < n_classes` failed.
img shape formating -  torch.Size([3, 608, 1344])
gt_semantic_seg in formatting 1 -  (76, 168)
gt_semantic_seg in formatting 2 -  torch.Size([1, 76, 168])
img shape formating -  torch.Size([3, 800, 832])
gt_semantic_seg in formatting 1 -  (100, 104)
gt_semantic_seg in formatting 2 -  torch.Size([1, 100, 104])
Traceback (most recent call last):
  File "tools/train.py", line 188, in <module>
    main()
  File "tools/train.py", line 184, in main
    meta=meta)
  File "/content/mmdetection/mmdet/apis/train.py", line 170, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/content/mmdetection/mmdet/models/detectors/base.py", line 237, in train_step
    losses = self(**data)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
    return old_func(*args, **kwargs)
  File "/content/mmdetection/mmdet/models/detectors/base.py", line 171, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/content/mmdetection/mmdet/models/detectors/two_stage.py", line 148, in forward_train
    **kwargs)
  File "/content/mmdetection/mmdet/models/roi_heads/htc_roi_head.py", line 269, in forward_train
    gt_labels[j])
  File "/content/mmdetection/mmdet/core/bbox/assigners/max_iou_assigner.py", line 105, in assign
    overlaps = self.iou_calculator(gt_bboxes, bboxes)
  File "/content/mmdetection/mmdet/core/bbox/iou_calculators/iou2d_calculator.py", line 65, in __call__
    return bbox_overlaps(bboxes1, bboxes2, mode, is_aligned)
  File "/content/mmdetection/mmdet/core/bbox/iou_calculators/iou2d_calculator.py", line 250, in bbox_overlaps
    eps = union.new_tensor([eps])
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f174283e2f2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f174283b67b in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f1742a961f9 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f17428263a4 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6ea39a (0x7f17b665039a in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x6ea441 (0x7f17b6650441 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #24: __libc_start_main + 0xe7 (0x7f17c87b4bf7 in /lib/x86_64-linux-gnu/libc.so.6)

AronLin commented 3 years ago

According to the error information, it seems that union in File "/content/mmdetection/mmdet/core/bbox/iou_calculators/iou2d_calculator.py", line 250, in bbox_overlaps is not put on GPU. You can check whether gt_bboxes and bboxes in File "/content/mmdetection/mmdet/models/roi_heads/htc_roi_head.py", line 269, in forward_train are put on GPU or not.
Json file for stuffthingmap is not acquired. If the first one is fine, the problem may lie in the category of stuffthingmaps, you can check the start number of thing category is 0 or 1 and check whether it is the same as the officially provided stuffthingmaps.

IAMShashankk commented 3 years ago

@AronLin , Thanks for the inputs.

I checked that gt_bboxes and boxes are on GPU.

In stuffthingmap start number of thing categories is 92. Following this I also generated my stuffthingmap from 92; specifically following #1779. Now when I train HTC i move forward but still resulting in error:

/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [6,0,0], thread: [61,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [6,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [6,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
File "tools/train.py", line 188, in <module>
main()
File "tools/train.py", line 184, in main
meta=meta)
File "/content/mmdetection/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/content/mmdetection/mmdet/models/detectors/base.py", line 237, in train_step
losses = self(**data)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/content/mmdetection/mmdet/models/detectors/base.py", line 171, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/content/mmdetection/mmdet/models/detectors/two_stage.py", line 148, in forward_train
**kwargs)
File "/content/mmdetection/mmdet/models/roi_heads/htc_roi_head.py", line 304, in forward_train
bbox_results['bbox_pred'], pos_is_gts, img_metas)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
return old_func(*args, **kwargs)
File "/content/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 442, in refine_bboxes
img_meta_)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
return old_func(*args, **kwargs)
File "/content/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 480, in regress_by_class
rois, bbox_pred, max_shape=max_shape)
File "/content/mmdetection/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py", line 92, in decode
self.add_ctr_clamp, self.ctr_clamp)
File "/usr/local/lib/python3.7/dist-packages/mmcv/utils/parrots_jit.py", line 21, in wrapper_inner
return func(*args, **kargs)
File "/content/mmdetection/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py", line 211, in delta2bbox
means = deltas.new_tensor(means).view(1, -1).repeat(1, deltas.size(-1) // 4)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f792f77f2f2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f792f77c67b in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f792f9d71f9 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f792f7673a4 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6ea39a (0x7f79a359139a in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x6ea441 (0x7f79a3591441 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #24: __libc_start_main + 0xe7 (0x7f79b56f5bf7 in /lib/x86_64-linux-gnu/libc.so.6)

This error is coming from "/content/mmdetection/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py" when we are trying to generate a new tensor. I have some debug information in this folder: rois shape : torch.Size([512, 4]) deltas shape : torch.Size([512, 4]) deltas on cuda : True rois on cuda : True deltas.size(-1) : 1 means : [0.0, 0.0, 0.0, 0.0]

Could you please advise me on how to move forward from this error?

IAMShashankk commented 3 years ago

@AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above.

leemengwei commented 3 years ago

@AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above.

@IAMShashankk , Hi did you solve this problem? I got same problem when training HTC-dcn, which throwback cuda device side assersion that from 'delta_xywh_bbox_coder.py", line 205, in delta2bbox means = deltas.new_tensor(means).....'

IAMShashankk commented 3 years ago

@AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above.

@IAMShashankk , Hi did you solve this problem? I got same problem when training HTC-dcn, which throwback cuda device side assersion that from 'delta_xywh_bbox_coder.py", line 205, in delta2bbox means = deltas.new_tensor(means).....'

Not yet :(

leemengwei commented 3 years ago

Well， I”m no longer bother by that problem now！ As a matter of fact，i shouldnt get into that trouble at all！I diminish that by correcting my HTC configuration file。 FYI， I”ve correct bbox/mask heads settings。To be more specific，I got 2 classes，yet earlier I just modified “num of class” in those heads and ignoring all others（after inherited from base HTC）。Yet chances are that is not enough，because when later I tried to explicitly copy all other param settings for those heads but leaving them untouched，and only modify numofclass again， it worked Im actually a little bit confused about this： for my knowledge of MMD，only when I state： delete =True for those heads will the param from base config be cleaned and need to reset from scratch， yet this experience has spoke contradict。

发自我的iPhone

------------------ Original ------------------ From: IAMShashankk @.> Date: 周六,8月 7,2021 7:06 下午 To: open-mmlab/mmdetection @.> Cc: Leemengwei @.>, Comment @.> Subject: Re: [open-mmlab/mmdetection] HTC with Custom Dataset - error in training (#5608)

@AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above.

@IAMShashankk , Hi did you solve this problem? I got same problem when training HTC-dcn, which throwback cuda device side assersion that from 'delta_xywh_bbox_coder.py", line 205, in delta2bbox means = deltas.new_tensor(means).....'

Not yet :(

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

AronLin commented 3 years ago

@AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above.

@IAMShashankk , Hi did you solve this problem? I got same problem when training HTC-dcn, which throwback cuda device side assersion that from 'delta_xywh_bbox_coder.py", line 205, in delta2bbox means = deltas.new_tensor(means).....'

Not yet :(

Did his solution solve your problem?

Rock-L21 commented 2 years ago

Well， I”m no longer bother by that problem now！ As a matter of fact，i shouldnt get into that trouble at all！I diminish that by correcting my HTC configuration file。 FYI， I”ve correct bbox/mask heads settings。To be more specific，I got 2 classes，yet earlier I just modified “num of class” in those heads and ignoring all others（after inherited from base HTC）。Yet chances are that is not enough，because when later I tried to explicitly copy all other param settings for those heads but leaving them untouched，and only modify numofclass again， it worked Im actually a little bit confused about this： for my knowledge of MMD，only when I state： delete =True for those heads will the param from base config be cleaned and need to reset from scratch， yet this experience has spoke contradict。发自我的iPhone … ------------------ Original ------------------ From: IAMShashankk @.> Date: 周六,8月 7,2021 7:06 下午 To: open-mmlab/mmdetection @.> Cc: Leemengwei @.>, Comment @.> Subject: Re: [open-mmlab/mmdetection] HTC with Custom Dataset - error in training (#5608) @AronLin, On further debugging, I found that when we try to use deltas (which is actually bbox_pred) we get this error. If I try to see its shape by deltas.shape output is torch.size([512,4]). But when I try to print or perform an operation I get the error mentioned above. @IAMShashankk , Hi did you solve this problem? I got same problem when training HTC-dcn, which throwback cuda device side assersion that from 'delta_xywh_bbox_coder.py", line 205, in delta2bbox means = deltas.new_tensor(means).....' Not yet :( — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

May I ask how num_classes in your file configuration is set.

open-mmlab / mmdetection

HTC with Custom Dataset - error in training #5608