open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.21k stars 9.4k forks source link

faster_rcnn_r50_fpn training one of my own datasets gets loss_bbox=nan problem #3315

Closed breAchyz closed 3 years ago

breAchyz commented 4 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help, I have added the grad_clip but it doesn't work.
  2. I have checked the dataset and it doesn't have bbox target that out of size of image.
  3. I have applied 'exp' warmup but it doesn't work too.
  4. This problem doesn't appear when I train my another dataset.

Describe the bug When I train using a config file which modified based on faster_rcnn_r50_fpn.py, the most loss_bbox get nan.

Reproduction

  1. What command or script did you run?

    A placeholder for the command.
    model = dict(
    type='FasterRCNN',
    pretrained='torchvision://resnet50',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch'),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            scales=[8],
            ratios=[0.5, 1.0, 2.0],
            strides=[4, 8, 16, 32, 64]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[0.0, 0.0, 0.0, 0.0],
            target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    roi_head=dict(
        type='StandardRoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', out_size=7, sample_num=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        bbox_head=dict(
            type='Shared2FCBBoxHead',
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=4,
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0.0, 0.0, 0.0, 0.0],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=False,
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0))))
    train_cfg = dict(
    rpn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.7,
            neg_iou_thr=0.3,
            min_pos_iou=0.3,
            match_low_quality=True,
            ignore_iof_thr=-1),
        sampler=dict(
            type='RandomSampler',
            num=256,
            pos_fraction=0.5,
            neg_pos_ub=-1,
            add_gt_as_proposals=False),
        allowed_border=-1,
        pos_weight=-1,
        debug=False),
    rpn_proposal=dict(
        nms_across_levels=False,
        nms_pre=2000,
        nms_post=1000,
        max_num=1000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.5,
            neg_iou_thr=0.5,
            min_pos_iou=0.5,
            match_low_quality=False,
            ignore_iof_thr=-1),
        sampler=dict(
            type='RandomSampler',
            num=512,
            pos_fraction=0.25,
            neg_pos_ub=-1,
            add_gt_as_proposals=True),
        pos_weight=-1,
        debug=False))
    test_cfg = dict(
    rpn=dict(
        nms_across_levels=False,
        nms_pre=1000,
        nms_post=1000,
        max_num=1000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=dict(
        score_thr=0.5, nms=dict(type='nms', iou_thr=0.5), max_per_img=100))
    img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
    train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1000, 600), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
    ]
    test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1000, 600),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
    ]
    classes = ('normal', 'abnormal')
    data_root = '/home/ding/yz/Image_Recognize/mmdetection/data'
    data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type='RepeatDataset',
        classes=('normal', 'abnormal'),
        times=3,
        dataset=dict(
            type='VOCDataset',
            ann_file=[
                '/home/ding/yz/Image_Recognize/mmdetection/data/mine/ImageSets/Main/train_cv1.txt'
            ],
            img_prefix=[
                '/home/ding/yz/Image_Recognize/mmdetection/data/mine/'
            ],
            pipeline=[
                dict(type='LoadImageFromFile'),
                dict(type='LoadAnnotations', with_bbox=True),
                dict(type='Resize', img_scale=(1000, 600), keep_ratio=True),
                dict(type='RandomFlip', flip_ratio=0.5),
                dict(
                    type='Normalize',
                    mean=[123.675, 116.28, 103.53],
                    std=[58.395, 57.12, 57.375],
                    to_rgb=True),
                dict(type='Pad', size_divisor=32),
                dict(type='DefaultFormatBundle'),
                dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
            ])),
    val=dict(
        type='VOCDataset',
        classes=('normal', 'abnormal'),
        ann_file=
        '/home/ding/yz/Image_Recognize/mmdetection/data/mine/ImageSets/Main/val_cv1.txt',
        img_prefix='/home/ding/yz/Image_Recognize/mmdetection/data/mine/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1000, 600),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='VOCDataset',
        classes=('normal', 'abnormal'),
        ann_file=
        '/home/ding/yz/Image_Recognize/mmdetection/data/mine/ImageSets/Main/val_cv1.txt',
        img_prefix='/home/ding/yz/Image_Recognize/mmdetection/data/mine/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1000, 600),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
    evaluation = dict(interval=1, metric='mAP')
    checkpoint_config = dict(interval=10)
    log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
    dist_params = dict(backend='nccl')
    log_level = 'INFO'
    load_from = None
    resume_from = None
    workflow = [('train', 1)]
    optimizer = dict(type='SGD', lr=0.005, momentum=0.9, weight_decay=0.0001)
    optimizer_config = dict(grad_clip=None)
    lr_config = dict(policy='step', step=[30])
    total_epochs = 30
    work_dir = 'results/mine/lr/0.005/cv_0'
    gpu_ids = range(0, 1)
  2. Did you make any modifications on the code or config? Did you understand what you have modified? Yes, I change the lr and lr_config. About the datasets, I modified the CLASSES of VOCDatesets, num_classes=2

  3. What dataset did you use? VOCDatasets cloth Environment

  4. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here. sys.platform: linux Python: 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0] CUDA available: True CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 9.2, V9.2.148 GPU 0: GeForce GTX 1080 Ti GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.4.0 PyTorch compiling details: PyTorch built with:

    • GCC 7.3
    • Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
    • Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
    • OpenMP 201511 (a.k.a. OpenMP 4.5)
    • NNPACK is enabled
    • CUDA Runtime 9.2
    • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_37,code=compute_37
    • CuDNN 7.6.3
    • Magma 2.5.1
    • Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.5.0 OpenCV: 4.2.0 MMCV: 0.6.2 MMDetection: 2.2.0+741b638 MMDetection Compiler: GCC 5.4 MMDetection CUDA Compiler: 9.2

  1. You may add addition that may be helpful for locating the problem, such as -I installed PyTorch by conda.

Error traceback If applicable, paste the error trackback here.

2020-07-13 23:03:03,140 - mmdet - INFO - workflow: [('train', 1)], max: 30 epochs
2020-07-13 23:03:29,667 - mmdet - INFO - Epoch [1][50/294]  lr: 5.000e-03, eta: 1:17:23, time: 0.529, data_time: 0.318, memory: 2240, loss_rpn_cls: 0.2708, loss_rpn_bbox: 0.0560, loss_cls: 0.2029, acc: 97.4023, loss_bbox: 0.0433, loss: 0.5730
2020-07-13 23:03:53,722 - mmdet - INFO - Epoch [1][100/294] lr: 5.000e-03, eta: 1:13:25, time: 0.481, data_time: 0.272, memory: 2408, loss_rpn_cls: 0.2238, loss_rpn_bbox: 0.1371, loss_cls: 0.1535, acc: 99.1562, loss_bbox: 0.0190, loss: 0.5334
2020-07-13 23:04:17,373 - mmdet - INFO - Epoch [1][150/294] lr: 5.000e-03, eta: 1:11:27, time: 0.473, data_time: 0.250, memory: 3131, loss_rpn_cls: 0.1667, loss_rpn_bbox: 0.0586, loss_cls: 0.0950, acc: 98.7441, loss_bbox: nan, loss: nan
2020-07-13 23:04:41,303 - mmdet - INFO - Epoch [1][200/294] lr: 5.000e-03, eta: 1:10:28, time: 0.479, data_time: 0.249, memory: 3131, loss_rpn_cls: 0.1298, loss_rpn_bbox: 0.0416, loss_cls: 0.0810, acc: 98.6660, loss_bbox: 0.0398, loss: 0.2921
2020-07-13 23:05:04,626 - mmdet - INFO - Epoch [1][250/294] lr: 5.000e-03, eta: 1:09:22, time: 0.467, data_time: 0.246, memory: 3131, loss_rpn_cls: 0.1163, loss_rpn_bbox: 0.0428, loss_cls: 0.0922, acc: 98.5000, loss_bbox: nan, loss: nan
2020-07-13 23:05:35,574 - mmdet - INFO - 
+-----------+-----+------+-----+-----+-----------+--------+-------+
| class     | gts | dets | tps | fps | precision | recall | ap    |
+-----------+-----+------+-----+-----+-----------+--------+-------+
| normal | 65  | 0    | 0   | 0   | 0.000     | 0.000  | 0.000 |
| abnormal | 50  | 0    | 0   | 0   | 0.000     | 0.000  | 0.000 |
+-----------+-----+------+-----+-----+-----------+--------+-------+
| mAP       |     |      |     |     |           |        | 0.000 |
+-----------+-----+------+-----+-----+-----------+--------+-------+
2020-07-13 23:10:42,334 - mmdet - INFO - Epoch [3][294/294] lr: 5.000e-03, mAP: 0.2230
2020-07-13 23:11:08,469 - mmdet - INFO - Epoch [4][50/294]  lr: 5.000e-03, eta: 0:55:22, time: 0.522, data_time: 0.320, memory: 3131, loss_rpn_cls: 0.0248, loss_rpn_bbox: 0.0330, loss_cls: 0.0857, acc: 97.6621, loss_bbox: nan, loss: nan
2020-07-13 23:11:31,720 - mmdet - INFO - Epoch [4][100/294] lr: 5.000e-03, eta: 0:55:19, time: 0.465, data_time: 0.260, memory: 3131, loss_rpn_cls: 0.0370, loss_rpn_bbox: 0.0394, loss_cls: 0.0905, acc: 97.6387, loss_bbox: nan, loss: nan
2020-07-13 23:11:58,028 - mmdet - INFO - Epoch [4][150/294] lr: 5.000e-03, eta: 0:55:36, time: 0.526, data_time: 0.319, memory: 3131, loss_rpn_cls: 0.0193, loss_rpn_bbox: 0.0327, loss_cls: 0.0780, acc: 97.3633, loss_bbox: 0.0803, loss: 0.2102
2020-07-13 23:12:21,799 - mmdet - INFO - Epoch [4][200/294] lr: 5.000e-03, eta: 0:55:32, time: 0.475, data_time: 0.266, memory: 3131, loss_rpn_cls: 0.0242, loss_rpn_bbox: 0.0421, loss_cls: 0.1020, acc: 97.4258, loss_bbox: nan, loss: nan
2020-07-13 23:12:45,930 - mmdet - INFO - Epoch [4][250/294] lr: 5.000e-03, eta: 0:55:28, time: 0.483, data_time: 0.272, memory: 3131, loss_rpn_cls: 0.0260, loss_rpn_bbox: 0.0414, loss_cls: 0.0880, acc: 97.0234, loss_bbox: 0.0877, loss: 0.2432
2020-07-13 23:13:17,105 - mmdet - INFO - 
+-----------+-----+------+-----+-----+-----------+--------+-------+
| class     | gts | dets | tps | fps | precision | recall | ap    |
+-----------+-----+------+-----+-----+-----------+--------+-------+
| normal | 65  | 36   | 16  | 20  | 0.444     | 0.246  | 0.162 |
| abnormal | 50  | 52   | 34  | 18  | 0.654     | 0.680  | 0.513 |
+-----------+-----+------+-----+-----+-----------+--------+-------+
| mAP       |     |      |     |     |           |        | 0.337 |
+-----------+-----+------+-----+-----+-----------+--------+-------+

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

breAchyz commented 4 years ago

I change the loss_bbox to GIoULoss, and it solve my problem. But the reason of loss_bbox=nan while loss_bbox set to L1Loss is not clear yet. If some one have any idea, please leave your comments, thanks.

ZwwWayne commented 3 years ago

I only met this problem in the situation where some of the ground truth boxes' widths or heights are zero. So you may also check your data to see whether there are zero-sized boxes.