faster_rcnn_r50_fpn training one of my own datasets gets loss_bbox=nan problem

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help, I have added the grad_clip but it doesn't work.
I have checked the dataset and it doesn't have bbox target that out of size of image.
I have applied 'exp' warmup but it doesn't work too.
This problem doesn't appear when I train my another dataset.

Describe the bug When I train using a config file which modified based on faster_rcnn_r50_fpn.py, the most loss_bbox get nan.

Reproduction

What command or script did you run?

A placeholder for the command.
model = dict(
type='FasterRCNN',
pretrained='torchvision://resnet50',
backbone=dict(
    type='ResNet',
    depth=50,
    num_stages=4,
    out_indices=(0, 1, 2, 3),
    frozen_stages=1,
    norm_cfg=dict(type='BN', requires_grad=True),
    norm_eval=True,
    style='pytorch'),
neck=dict(
    type='FPN',
    in_channels=[256, 512, 1024, 2048],
    out_channels=256,
    num_outs=5),
rpn_head=dict(
    type='RPNHead',
    in_channels=256,
    feat_channels=256,
    anchor_generator=dict(
        type='AnchorGenerator',
        scales=[8],
        ratios=[0.5, 1.0, 2.0],
        strides=[4, 8, 16, 32, 64]),
    bbox_coder=dict(
        type='DeltaXYWHBBoxCoder',
        target_means=[0.0, 0.0, 0.0, 0.0],
        target_stds=[1.0, 1.0, 1.0, 1.0]),
    loss_cls=dict(
        type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
    loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
roi_head=dict(
    type='StandardRoIHead',
    bbox_roi_extractor=dict(
        type='SingleRoIExtractor',
        roi_layer=dict(type='RoIAlign', out_size=7, sample_num=0),
        out_channels=256,
        featmap_strides=[4, 8, 16, 32]),
    bbox_head=dict(
        type='Shared2FCBBoxHead',
        in_channels=256,
        fc_out_channels=1024,
        roi_feat_size=7,
        num_classes=4,
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[0.0, 0.0, 0.0, 0.0],
            target_stds=[0.1, 0.1, 0.2, 0.2]),
        reg_class_agnostic=False,
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0))))
train_cfg = dict(
rpn=dict(
    assigner=dict(
        type='MaxIoUAssigner',
        pos_iou_thr=0.7,
        neg_iou_thr=0.3,
        min_pos_iou=0.3,
        match_low_quality=True,
        ignore_iof_thr=-1),
    sampler=dict(
        type='RandomSampler',
        num=256,
        pos_fraction=0.5,
        neg_pos_ub=-1,
        add_gt_as_proposals=False),
    allowed_border=-1,
    pos_weight=-1,
    debug=False),
rpn_proposal=dict(
    nms_across_levels=False,
    nms_pre=2000,
    nms_post=1000,
    max_num=1000,
    nms_thr=0.7,
    min_bbox_size=0),
rcnn=dict(
    assigner=dict(
        type='MaxIoUAssigner',
        pos_iou_thr=0.5,
        neg_iou_thr=0.5,
        min_pos_iou=0.5,
        match_low_quality=False,
        ignore_iof_thr=-1),
    sampler=dict(
        type='RandomSampler',
        num=512,
        pos_fraction=0.25,
        neg_pos_ub=-1,
        add_gt_as_proposals=True),
    pos_weight=-1,
    debug=False))
test_cfg = dict(
rpn=dict(
    nms_across_levels=False,
    nms_pre=1000,
    nms_post=1000,
    max_num=1000,
    nms_thr=0.7,
    min_bbox_size=0),
rcnn=dict(
    score_thr=0.5, nms=dict(type='nms', iou_thr=0.5), max_per_img=100))
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1000, 600), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
    type='Normalize',
    mean=[123.675, 116.28, 103.53],
    std=[58.395, 57.12, 57.375],
    to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
    type='MultiScaleFlipAug',
    img_scale=(1000, 600),
    flip=False,
    transforms=[
        dict(type='Resize', keep_ratio=True),
        dict(type='RandomFlip'),
        dict(
            type='Normalize',
            mean=[123.675, 116.28, 103.53],
            std=[58.395, 57.12, 57.375],
            to_rgb=True),
        dict(type='Pad', size_divisor=32),
        dict(type='ImageToTensor', keys=['img']),
        dict(type='Collect', keys=['img'])
    ])
]
classes = ('normal', 'abnormal')
data_root = '/home/ding/yz/Image_Recognize/mmdetection/data'
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
    type='RepeatDataset',
    classes=('normal', 'abnormal'),
    times=3,
    dataset=dict(
        type='VOCDataset',
        ann_file=[
            '/home/ding/yz/Image_Recognize/mmdetection/data/mine/ImageSets/Main/train_cv1.txt'
        ],
        img_prefix=[
            '/home/ding/yz/Image_Recognize/mmdetection/data/mine/'
        ],
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(type='Resize', img_scale=(1000, 600), keep_ratio=True),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ])),
val=dict(
    type='VOCDataset',
    classes=('normal', 'abnormal'),
    ann_file=
    '/home/ding/yz/Image_Recognize/mmdetection/data/mine/ImageSets/Main/val_cv1.txt',
    img_prefix='/home/ding/yz/Image_Recognize/mmdetection/data/mine/',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(
            type='MultiScaleFlipAug',
            img_scale=(1000, 600),
            flip=False,
            transforms=[
                dict(type='Resize', keep_ratio=True),
                dict(type='RandomFlip'),
                dict(
                    type='Normalize',
                    mean=[123.675, 116.28, 103.53],
                    std=[58.395, 57.12, 57.375],
                    to_rgb=True),
                dict(type='Pad', size_divisor=32),
                dict(type='ImageToTensor', keys=['img']),
                dict(type='Collect', keys=['img'])
            ])
    ]),
test=dict(
    type='VOCDataset',
    classes=('normal', 'abnormal'),
    ann_file=
    '/home/ding/yz/Image_Recognize/mmdetection/data/mine/ImageSets/Main/val_cv1.txt',
    img_prefix='/home/ding/yz/Image_Recognize/mmdetection/data/mine/',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(
            type='MultiScaleFlipAug',
            img_scale=(1000, 600),
            flip=False,
            transforms=[
                dict(type='Resize', keep_ratio=True),
                dict(type='RandomFlip'),
                dict(
                    type='Normalize',
                    mean=[123.675, 116.28, 103.53],
                    std=[58.395, 57.12, 57.375],
                    to_rgb=True),
                dict(type='Pad', size_divisor=32),
                dict(type='ImageToTensor', keys=['img']),
                dict(type='Collect', keys=['img'])
            ])
    ]))
evaluation = dict(interval=1, metric='mAP')
checkpoint_config = dict(interval=10)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
optimizer = dict(type='SGD', lr=0.005, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(policy='step', step=[30])
total_epochs = 30
work_dir = 'results/mine/lr/0.005/cv_0'
gpu_ids = range(0, 1)

Did you make any modifications on the code or config? Did you understand what you have modified? Yes, I change the lr and lr_config. About the datasets, I modified the CLASSES of VOCDatesets, num_classes=2
What dataset did you use? VOCDatasets cloth Environment
Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here. sys.platform: linux Python: 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0] CUDA available: True CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 9.2, V9.2.148 GPU 0: GeForce GTX 1080 Ti GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.4.0 PyTorch compiling details: PyTorch built with:
- GCC 7.3
- Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CUDA Runtime 9.2
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.1
- Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.5.0 OpenCV: 4.2.0 MMCV: 0.6.2 MMDetection: 2.2.0+741b638 MMDetection Compiler: GCC 5.4 MMDetection CUDA Compiler: 9.2

You may add addition that may be helpful for locating the problem, such as -I installed PyTorch by conda.

Error traceback If applicable, paste the error trackback here.

2020-07-13 23:03:03,140 - mmdet - INFO - workflow: [('train', 1)], max: 30 epochs
2020-07-13 23:03:29,667 - mmdet - INFO - Epoch [1][50/294]  lr: 5.000e-03, eta: 1:17:23, time: 0.529, data_time: 0.318, memory: 2240, loss_rpn_cls: 0.2708, loss_rpn_bbox: 0.0560, loss_cls: 0.2029, acc: 97.4023, loss_bbox: 0.0433, loss: 0.5730
2020-07-13 23:03:53,722 - mmdet - INFO - Epoch [1][100/294] lr: 5.000e-03, eta: 1:13:25, time: 0.481, data_time: 0.272, memory: 2408, loss_rpn_cls: 0.2238, loss_rpn_bbox: 0.1371, loss_cls: 0.1535, acc: 99.1562, loss_bbox: 0.0190, loss: 0.5334
2020-07-13 23:04:17,373 - mmdet - INFO - Epoch [1][150/294] lr: 5.000e-03, eta: 1:11:27, time: 0.473, data_time: 0.250, memory: 3131, loss_rpn_cls: 0.1667, loss_rpn_bbox: 0.0586, loss_cls: 0.0950, acc: 98.7441, loss_bbox: nan, loss: nan
2020-07-13 23:04:41,303 - mmdet - INFO - Epoch [1][200/294] lr: 5.000e-03, eta: 1:10:28, time: 0.479, data_time: 0.249, memory: 3131, loss_rpn_cls: 0.1298, loss_rpn_bbox: 0.0416, loss_cls: 0.0810, acc: 98.6660, loss_bbox: 0.0398, loss: 0.2921
2020-07-13 23:05:04,626 - mmdet - INFO - Epoch [1][250/294] lr: 5.000e-03, eta: 1:09:22, time: 0.467, data_time: 0.246, memory: 3131, loss_rpn_cls: 0.1163, loss_rpn_bbox: 0.0428, loss_cls: 0.0922, acc: 98.5000, loss_bbox: nan, loss: nan
2020-07-13 23:05:35,574 - mmdet - INFO - 
+-----------+-----+------+-----+-----+-----------+--------+-------+
| class     | gts | dets | tps | fps | precision | recall | ap    |
+-----------+-----+------+-----+-----+-----------+--------+-------+
| normal | 65  | 0    | 0   | 0   | 0.000     | 0.000  | 0.000 |
| abnormal | 50  | 0    | 0   | 0   | 0.000     | 0.000  | 0.000 |
+-----------+-----+------+-----+-----+-----------+--------+-------+
| mAP       |     |      |     |     |           |        | 0.000 |
+-----------+-----+------+-----+-----+-----------+--------+-------+
2020-07-13 23:10:42,334 - mmdet - INFO - Epoch [3][294/294] lr: 5.000e-03, mAP: 0.2230
2020-07-13 23:11:08,469 - mmdet - INFO - Epoch [4][50/294]  lr: 5.000e-03, eta: 0:55:22, time: 0.522, data_time: 0.320, memory: 3131, loss_rpn_cls: 0.0248, loss_rpn_bbox: 0.0330, loss_cls: 0.0857, acc: 97.6621, loss_bbox: nan, loss: nan
2020-07-13 23:11:31,720 - mmdet - INFO - Epoch [4][100/294] lr: 5.000e-03, eta: 0:55:19, time: 0.465, data_time: 0.260, memory: 3131, loss_rpn_cls: 0.0370, loss_rpn_bbox: 0.0394, loss_cls: 0.0905, acc: 97.6387, loss_bbox: nan, loss: nan
2020-07-13 23:11:58,028 - mmdet - INFO - Epoch [4][150/294] lr: 5.000e-03, eta: 0:55:36, time: 0.526, data_time: 0.319, memory: 3131, loss_rpn_cls: 0.0193, loss_rpn_bbox: 0.0327, loss_cls: 0.0780, acc: 97.3633, loss_bbox: 0.0803, loss: 0.2102
2020-07-13 23:12:21,799 - mmdet - INFO - Epoch [4][200/294] lr: 5.000e-03, eta: 0:55:32, time: 0.475, data_time: 0.266, memory: 3131, loss_rpn_cls: 0.0242, loss_rpn_bbox: 0.0421, loss_cls: 0.1020, acc: 97.4258, loss_bbox: nan, loss: nan
2020-07-13 23:12:45,930 - mmdet - INFO - Epoch [4][250/294] lr: 5.000e-03, eta: 0:55:28, time: 0.483, data_time: 0.272, memory: 3131, loss_rpn_cls: 0.0260, loss_rpn_bbox: 0.0414, loss_cls: 0.0880, acc: 97.0234, loss_bbox: 0.0877, loss: 0.2432
2020-07-13 23:13:17,105 - mmdet - INFO - 
+-----------+-----+------+-----+-----+-----------+--------+-------+
| class     | gts | dets | tps | fps | precision | recall | ap    |
+-----------+-----+------+-----+-----+-----------+--------+-------+
| normal | 65  | 36   | 16  | 20  | 0.444     | 0.246  | 0.162 |
| abnormal | 50  | 52   | 34  | 18  | 0.654     | 0.680  | 0.513 |
+-----------+-----+------+-----+-----+-----------+--------+-------+
| mAP       |     |      |     |     |           |        | 0.337 |
+-----------+-----+------+-----+-----+-----------+--------+-------+

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

open-mmlab / mmdetection

faster_rcnn_r50_fpn training one of my own datasets gets loss_bbox=nan problem #3315