open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.21k stars 9.4k forks source link

NoneType loss_cls and/or loss_reg #1460

Closed AloshkaD closed 4 years ago

AloshkaD commented 5 years ago

Describe the bug I'm training the mmdetection on some data with big features that sometimes occupy 80% of the image size. At some point the network throws the error below (see Error Trackback)
This error is likely caused by the anchor_strides values in the config. I know that because if I increase the anchor_strides value the error happens more often, which indicates that more images/masks are becoming problematic and the error occurs. I've tested countless stride values but that didn't help. Changing the learning rate did not help. I'm using the learning rate as per the recommendations for the number of GPUs/image per GPU
Reproduction

  1. What command or script did you run?
    !CUDA_VISIBLE_DEVICES=0,1,2,3  python .../mmdetection/tools/train.py {config_fname}
  2. Did you make any modifications on the code or config? Did you understand what you have modified? Yes, I made some minor changes to the config. I understand what I changed. Here is the config file
    
    model = dict(
    type='HybridTaskCascade',
    num_stages=3,
    pretrained=None,
    interleaved=True,
    mask_info_flow=True,
    backbone=dict(
        type='ResNeXt',
        depth=101,
        groups=64,
        base_width=4,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        style='pytorch',
        dcn=dict(
            modulated=False,
            groups=64,
            deformable_groups=1,
            fallback_on_stride=False),
        stage_with_dcn=(False, True, True, True)),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_scales=[8],
        anchor_ratios=[0.5, 1.0, 2.0],
        anchor_strides=[4, 8, 16, 32, 64],
        target_means=[.0, .0, .0, .0],
        target_stds=[1.0, 1.0, 1.0, 1.0],
        use_sigmoid_cls=True),
    bbox_roi_extractor=dict(
        type='SingleRoIExtractor',
        roi_layer=dict(type='RoIAlign', out_size=7, sample_num=2),
        out_channels=256,
        featmap_strides=[4,8,16,32]),
    bbox_head=[
        dict(
            type='SharedFCBBoxHead',
            num_fcs=2,
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=185,
            target_means=[0., 0., 0., 0.],
            target_stds=[0.1, 0.1, 0.2, 0.2],
            reg_class_agnostic=True),
        dict(
            type='SharedFCBBoxHead',
            num_fcs=2,
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=185,
            target_means=[0., 0., 0., 0.],
            target_stds=[0.05, 0.05, 0.1, 0.1],
            reg_class_agnostic=True),
        dict(
            type='SharedFCBBoxHead',
            num_fcs=2,
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=185,
            target_means=[0., 0., 0., 0.],
            target_stds=[0.033, 0.033, 0.067, 0.067],
            reg_class_agnostic=True)
    ],
    mask_roi_extractor=dict(
        type='SingleRoIExtractor',
        roi_layer=dict(type='RoIAlign', out_size=14, sample_num=2),
        out_channels=256,
        featmap_strides=[4, 8, 16, 32]),
    mask_head=dict(
        type='HTCMaskHead',
        num_convs=4,
        in_channels=256,
        conv_out_channels=256,
        num_classes=185))
    # model training and testing settings
    train_cfg = dict(
    rpn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.7,
            neg_iou_thr=0.3,
            min_pos_iou=0.3,
            ignore_iof_thr=-1),
        sampler=dict(
            type='RandomSampler',
            num=256,
            pos_fraction=0.5,
            neg_pos_ub=-1,
            add_gt_as_proposals=False),
        allowed_border=0,
        pos_weight=-1,
        smoothl1_beta=1 / 9.0,
        debug=False),
    rpn_proposal=dict(
        nms_across_levels=False,
        nms_pre=2000,
        nms_post=2000,
        max_num=2000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=[
        dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            mask_size=28,
            pos_weight=-1,
            debug=False),
        dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.6,
                neg_iou_thr=0.6,
                min_pos_iou=0.6,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            mask_size=28,
            pos_weight=-1,
            debug=False),
        dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.7,
                neg_iou_thr=0.7,
                min_pos_iou=0.7,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            mask_size=28,
            pos_weight=-1,
            debug=False)
    ],
    stage_loss_weights=[1, 0.5, 0.25])
    test_cfg = dict(
    rpn=dict(
        nms_across_levels=False,
        nms_pre=1000,
        nms_post=1000,
        max_num=1000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=dict(
        score_thr=0.5,
        nms=dict(type='nms', iou_thr=0.3),
        max_per_img=100,
        mask_thr_binary=0.45),
    keep_all_stages=False)
    # dataset settings
    dataset_type = 'CustomDataset'
    data_root = '.../images/'
    annotation_root=..../data_a/'
    img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
    data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=4,
    train=dict(
        type=dataset_type,
        ann_file=annotation_root + 'train_mmdetection_with_parcel.pkl',
        img_prefix=data_root,
        img_scale=[(400, 800), (700,900 )],
        multiscale_mode='range',
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0.5,
        with_mask=True,
        with_crowd=False,
        with_label=True,
        img_zoom=True,
        extra_aug=dict(
            type='Compose',
            transforms=[
                dict(
                    p=0.5,
                    max_h_size=64,
                    type='Cutout'
                ),
                dict(
                    brightness_limit=0.3,
                    contrast_limit=0.3,
                    p=0.5,
                    type='RandomBrightnessContrast'
                ),
                dict(
                    p=0.5,
                    quality_lower=80,
                    quality_upper=99,
                    type='JpegCompression'
                ),
            ],
            p=1.0
        )
    ),
    val=dict(
        type=dataset_type,
        ann_file=annotation_root + 'val_mmdetection_with_parcel.pkl',
        img_prefix=data_root,
        img_scale=(700, 900),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0,
        with_mask=True,
        with_crowd=False,
        with_label=True,
        img_zoom=True
        ),
    test=dict(
        type=dataset_type,
        ann_file=annotation_root + 'test_mmdetection_with_parcel.pkl',
        img_prefix=data_root,
        img_scale=(700, 900),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=1.0,
        with_mask=True,
        with_label=False,
        test_mode=True,
        img_zoom=True
        ))
    # optimizer
    optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
    optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
    # learning policy
    lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=1.0 / 3,
    step=[10, 18])
    checkpoint_config = dict(interval=1)
    log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
    ])
    # runtime settings
    total_epochs = 20
    dist_params = dict(backend='nccl')
    log_level = 'INFO'

work_dir = .../work_dir/rooftop/zoom' load_from ='.../epoch_1.pth'

resume_from = None workflow = [('train', 1)]

3. What dataset did you use?
School research dataset.
**Environment**
 - OS: [ Ubuntu 18.03]
 - GCC [5.4.0]
 - PyTorch version [1.1.0]
- How you installed PyTorch [ conda]
- GPU model [ 1080Ti andnvcc  2080ti]
- CUDA 10.01 

**Error traceback**

File ".../mmdetection/tools/train.py", line 95, in main() File ".../mmdetection/tools/train.py", line 91, in main logger=logger) File ".../mmdetection/mmdet/apis/train.py", line 61, in train_detector _non_dist_train(model, dataset, cfg, validate=validate) File ".../mmdetection/mmdet/apis/train.py", line 197, in _non_dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/a/anaconda3/envs/torch/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run epoch_runner(data_loaders[i], kwargs) File "/home/a/anaconda3/envs/torch/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train self.model, data_batch, train_mode=True, kwargs) File ".../mmdetection/mmdet/apis/train.py", line 39, in batch_processor losses = model(data) File "/home/a/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/a/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/a/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/a/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/a/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(input, kwargs) File "/home/a/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File ".../mmdetection/mmdet/models/detectors/base.py", line 84, in forward return self.forward_train(img, img_meta, kwargs) File ".../mmdetection/mmdet/models/detectors/htc.py", line 177, in forward_train rpn_loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore) File ".../mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 58, in loss loss_rpn_cls=losses['loss_cls'], loss_rpn_reg=losses['loss_reg']) TypeError: 'NoneType' object is not subscriptable



**Bug fix**
Not fixed yet. 

Thanks for your help!
hellock commented 4 years ago

Related to https://github.com/open-mmlab/mmdetection/blob/master/mmdet/models/anchor_heads/anchor_head.py#L184-L185

hellock commented 4 years ago

Feel free to reopen it if you have any further questions.

RitchieHuang11 commented 4 years ago

Related to https://github.com/open-mmlab/mmdetection/blob/master/mmdet/models/anchor_heads/anchor_head.py#L184-L185

I encounter this error, but I can't understand your reply

yuyijie1995 commented 4 years ago

@AloshkaD I want to add cutout augmentation method in mmdetection but it not work well ,could you show me your code?