The default config tood_r50_fpn_1x_coco.py stops with an error in v2.20.0

Keiku commented 2 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug A clear and concise description of what the bug is.

The default config tood_r50_fpn_1x_coco.py stops with an error in v2.20.0.

Reproduction

What command or script did you run?

python tools/train.py configs/tood/tood_r50_fpn_1x_coco.py

Did you make any modifications on the code or config? Did you understand what you have modified?
What dataset did you use?

Environment

Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.

⋊> ~/c/t/mmdetection-2.20.0 on main ⨯ python mmdet/utils/collect_env.py      13:54:30
sys.platform: linux
Python: 3.8.6 (default, Feb  8 2022, 15:06:25) [GCC 5.4.0 20160609]
CUDA available: True
GPU 0: TITAN X (Pascal)
CUDA_HOME: /usr/local/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.10.2+cu102
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.3+cu102
OpenCV: 4.5.5
MMCV: 1.3.17
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMDetection: 2.20.0+e6ca031
⋊> ~/c/t/mmdetection-2.20.0 on main ⨯

You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback If applicable, paste the error trackback here.

⋊> ~/c/t/mmdetection-2.20.0 on main ⨯
python tools/train.py configs/tood/tood_r50_fpn_1x_coco.py
2022-02-14 13:33:40,268 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.6 (default, Feb  8 2022, 15:06:25) [GCC 5.4.0 20160609]
CUDA available: True
GPU 0: TITAN X (Pascal)
CUDA_HOME: /usr/local/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.10.2+cu102
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.3+cu102
OpenCV: 4.5.5
MMCV: 1.3.17
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMDetection: 2.20.0+e6ca031
------------------------------------------------------------

2022-02-14 13:33:40,478 - mmdet - INFO - Distributed training: False
2022-02-14 13:33:40,762 - mmdet - INFO - Config:
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_train2017.json',
        img_prefix='data/coco/train2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ]),
    val=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_val2017.json',
        img_prefix='data/coco/val2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_val2017.json',
        img_prefix='data/coco/val2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
evaluation = dict(interval=1, metric='bbox')
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
checkpoint_config = dict(interval=1)
log_config = dict(
    interval=50,
    hooks=[dict(type='TextLoggerHook'),
           dict(type='TensorboardLoggerHook')])
custom_hooks = [dict(type='SetEpochInfoHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
model = dict(
    type='TOOD',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch',
        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=1,
        add_extra_convs='on_output',
        num_outs=5),
    bbox_head=dict(
        type='TOODHead',
        num_classes=80,
        in_channels=256,
        stacked_convs=6,
        feat_channels=256,
        anchor_type='anchor_free',
        anchor_generator=dict(
            type='AnchorGenerator',
            ratios=[1.0],
            octave_base_scale=8,
            scales_per_octave=1,
            strides=[8, 16, 32, 64, 128]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[0.0, 0.0, 0.0, 0.0],
            target_stds=[0.1, 0.1, 0.2, 0.2]),
        initial_loss_cls=dict(
            type='FocalLoss',
            use_sigmoid=True,
            activated=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=1.0),
        loss_cls=dict(
            type='QualityFocalLoss',
            use_sigmoid=True,
            activated=True,
            beta=2.0,
            loss_weight=1.0),
        loss_bbox=dict(type='GIoULoss', loss_weight=2.0)),
    train_cfg=dict(
        initial_epoch=4,
        initial_assigner=dict(type='ATSSAssigner', topk=9),
        assigner=dict(type='TaskAlignedAssigner', topk=13),
        alpha=1,
        beta=6,
        allowed_border=-1,
        pos_weight=-1,
        debug=False),
    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(type='nms', iou_threshold=0.6),
        max_per_img=100))
work_dir = './work_dirs/tood_r50_fpn_1x_coco'
auto_resume = False
gpu_ids = range(0, 1)

2022-02-14 13:33:40,762 - mmdet - INFO - Set random seed to 1834907980, deterministic: False
2022-02-14 13:33:41,040 - mmdet - INFO - initialize ResNet with init_cfg {'type': 'Pretrained', 'checkpoint': 'torchvision://resnet50'}
2022-02-14 13:33:41,041 - mmcv - INFO - load model from: torchvision://resnet50
2022-02-14 13:33:41,041 - mmcv - INFO - load checkpoint from torchvision path: torchvision://resnet50
2022-02-14 13:33:41,112 - mmcv - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

2022-02-14 13:33:41,127 - mmdet - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'}
loading annotations into memory...
Done (t=14.69s)
creating index...
index created!
loading annotations into memory...
Done (t=0.50s)
creating index...
index created!
2022-02-14 13:34:01,233 - mmdet - INFO - Start running, host: keiichi_kuroyanagi@K-00007-LIN, work_dir: /home/keiichi_kuroyanagi/clones/tiny-coco-benchmark/mmdetection-2.20.0/work_dirs/tood_r50_fpn_1x_coco
2022-02-14 13:34:01,233 - mmdet - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) StepLrUpdaterHook
(NORMAL      ) CheckpointHook
(LOW         ) EvalHook
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) TensorboardLoggerHook
 --------------------
before_train_epoch:
(VERY_HIGH   ) StepLrUpdaterHook
(NORMAL      ) SetEpochInfoHook
(LOW         ) IterTimerHook
(LOW         ) EvalHook
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) TensorboardLoggerHook
 --------------------
before_train_iter:
(VERY_HIGH   ) StepLrUpdaterHook
(LOW         ) IterTimerHook
(LOW         ) EvalHook
 --------------------
after_train_iter:
(ABOVE_NORMAL) OptimizerHook
(NORMAL      ) CheckpointHook
(LOW         ) IterTimerHook
(LOW         ) EvalHook
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) TensorboardLoggerHook
 --------------------
after_train_epoch:
(NORMAL      ) CheckpointHook
(LOW         ) EvalHook
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) TensorboardLoggerHook
 --------------------
before_val_epoch:
(LOW         ) IterTimerHook
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) TensorboardLoggerHook
 --------------------
before_val_iter:
(LOW         ) IterTimerHook
 --------------------
after_val_iter:
(LOW         ) IterTimerHook
 --------------------
after_val_epoch:
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) TensorboardLoggerHook
 --------------------
after_run:
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) TensorboardLoggerHook
 --------------------
2022-02-14 13:34:01,234 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2022-02-14 13:34:01,234 - mmdet - INFO - Checkpoints will be saved to /home/keiichi_kuroyanagi/clones/tiny-coco-benchmark/mmdetection-2.20.0/work_dirs/tood_r50_fpn_1x_coco by HardDiskBackend.
2022-02-14 13:34:21,909 - mmdet - INFO - Epoch [1][50/58633]    lr: 9.890e-04, eta: 3 days, 8:20:58, time: 0.411, data_time: 0.053, memory: 3796, loss_cls: 1.0850, loss_bbox: 1.5065, loss: 2.5915
2022-02-14 13:34:39,847 - mmdet - INFO - Epoch [1][100/58633]   lr: 1.988e-03, eta: 3 days, 3:14:28, time: 0.359, data_time: 0.005, memory: 3796, loss_cls: 0.9613, loss_bbox: 1.3493, loss: 2.3106
2022-02-14 13:34:57,606 - mmdet - INFO - Epoch [1][150/58633]   lr: 2.987e-03, eta: 3 days, 1:17:04, time: 0.355, data_time: 0.006, memory: 3796, loss_cls: 0.9146, loss_bbox: 1.1918, loss: 2.1063
2022-02-14 13:35:15,742 - mmdet - INFO - Epoch [1][200/58633]   lr: 3.986e-03, eta: 3 days, 0:40:29, time: 0.363, data_time: 0.006, memory: 3796, loss_cls: 1.0055, loss_bbox: 1.1010, loss: 2.1065
2022-02-14 13:35:34,106 - mmdet - INFO - Epoch [1][250/58633]   lr: 4.985e-03, eta: 3 days, 0:29:04, time: 0.367, data_time: 0.006, memory: 3796, loss_cls: 0.9374, loss_bbox: 1.0659, loss: 2.0034
2022-02-14 13:35:52,111 - mmdet - INFO - Epoch [1][300/58633]   lr: 5.984e-03, eta: 3 days, 0:07:25, time: 0.360, data_time: 0.006, memory: 3796, loss_cls: 0.9610, loss_bbox: 1.1485, loss: 2.1095
2022-02-14 13:36:10,579 - mmdet - INFO - Epoch [1][350/58633]   lr: 6.983e-03, eta: 3 days, 0:07:22, time: 0.369, data_time: 0.005, memory: 3796, loss_cls: 0.9451, loss_bbox: 1.2223, loss: 2.1674
2022-02-14 13:36:28,955 - mmdet - INFO - Epoch [1][400/58633]   lr: 7.982e-03, eta: 3 days, 0:04:51, time: 0.368, data_time: 0.006, memory: 3796, loss_cls: 0.9752, loss_bbox: 1.2573, loss: 2.2325
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [134,0,0], thread: [32,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [134,0,0], thread: [33,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/

(omit)

/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [246,0,0], thread: [62,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [246,0,0], thread: [63,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
  File "tools/train.py", line 196, in <module>
    main()
  File "tools/train.py", line 185, in main
    train_detector(
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmdet/apis/train.py", line 209, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
    losses = self(**data)
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
    return old_func(*args, **kwargs)
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 83, in forward_train
    losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes,
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 335, in forward_train
    losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
    return old_func(*args, **kwargs)
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmdet/models/dense_heads/tood_head.py", line 433, in loss
    cls_avg_factors, bbox_avg_factors = multi_apply(
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmdet/core/utils/misc.py", line 30, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/mmdet/models/dense_heads/tood_head.py", line 343, in loss_single
    pos_inds = ((labels >= 0)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f57c7112d62 in /home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1c4d3 (0x7f57c73754d3 in /home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7f57c7375ee2 in /home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f57c70fc314 in /home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x29e239 (0x7f5823ad5239 in /home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xadf291 (0x7f5824316291 in /home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292 (0x7f5824316592 in /home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #25: __libc_start_main + 0xf0 (0x7f584701a840 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: _start + 0x29 (0x400739 in /home/keiichi_kuroyanagi/.pyenv/versions/3.8.6-tiny-coco-benchmark/bin/python)

fish: “python tools/train.py configs/t…” terminated by signal SIGABRT (Abort)
⋊> ~/c/t/mmdetection-2.20.0 on main ⨯

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Keiku commented 2 years ago

I recently created my own tiny_coco dataset. I'm trying to run all the configs and benchmark them, but I get some errors.

Czm369 commented 2 years ago

Maybe some gt_bboxes of some images are illegal.

Keiku commented 2 years ago

@Czm369 I am using the COCO dataset as it is. I don't think it's wrong data.

Keiku commented 2 years ago

The dataset generated by the above error is the COCO dataset itself.

Czm369 commented 2 years ago

Did you use some special data augmentation?

Keiku commented 2 years ago

@Czm369 This is the default config distributed by mmdetection v2.20.0. Doesn't it cause any problems in your environment?

Czm369 commented 2 years ago

The model is normally trained for some iterations, so it may exits some wrong data according to my experience. As you said, the model is trained in your own tiny_coco dataset. And I have not met the problem in my environment

Keiku commented 2 years ago

@Czm369 I got an error in tiny_coco, so I returned it to the original COCO dataset and verified it, but the error was still there. If you have time, please look at the config and log properly. Thank you for the advice that tends to be a data issue. I check if there is a problem with the dataset. (But I don't know where the problem is with COCO itself.)

hhaAndroid commented 2 years ago

@Keiku This is usually related to cuda. Please update to the latest code implementation, it may be solved by #7090.

ZERO-SPACE-X commented 2 years ago

@Keiku Hello, did you solve this problem? I'm using the v2.22 code and still have this problem

Keiku commented 2 years ago

@ZERO-SPACE-X I'm sorry. I'm busy right now, so I haven't been able to investigate. It will take some time because it includes cuda upgrade and so on.

ZERO-SPACE-X commented 2 years ago

@Keiku Thanks for the reply, I'm looking for other ways

open-mmlab / mmdetection

The default config tood_r50_fpn_1x_coco.py stops with an error in v2.20.0 #7151