open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.58k stars 9.46k forks source link

why performance of mask-rcnn-r50-fpn with syncBN in mmdetection is poor than that in detectron2? #4874

Open UcanSee opened 3 years ago

UcanSee commented 3 years ago

Recently I have run some transferring experiment about self-supervised learning, which using self-supervised model as pre-trained of mask-rcnn-r50. Most paper conduct these transferring experiment in detectron2, when I conduct these transferring experiment in mmdetection, I found the performance is poor than that in detectron2. Here is my environment: python3.7, pytorch-1.6, torchvision-0.7, mmdetection2.10.0, mmcv-1.2.7, the config file I used is here:

model settings

    model = dict(
        type='MaskRCNN',
        pretrained='torchvision://resnet50',
        backbone=dict(
            type='ResNet',
            depth=50,
            num_stages=4,
            out_indices=(0, 1, 2, 3),
            frozen_stages=-1,
            norm_cfg=dict(type='SyncBN', requires_grad=True),
            norm_eval=False,
            style='pytorch'),
        neck=dict(
            type='FPN',
            in_channels=[256, 512, 1024, 2048],
            out_channels=256,
            norm_cfg=dict(type='SyncBN', requires_grad=True),
            num_outs=5),
        rpn_head=dict(
            type='RPNHead',
            in_channels=256,
            feat_channels=256,
            anchor_generator=dict(
                type='AnchorGenerator',
                scales=[8],
                ratios=[0.5, 1.0, 2.0],
                strides=[4, 8, 16, 32, 64]),
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[.0, .0, .0, .0],
                target_stds=[1.0, 1.0, 1.0, 1.0]),
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
        roi_head=dict(
            type='StandardRoIHead',
            bbox_roi_extractor=dict(
                type='SingleRoIExtractor',
                roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
                out_channels=256,
                featmap_strides=[4, 8, 16, 32]),
            bbox_head=dict(
                type='ConvFCBBoxHead',
                num_shared_convs=4,
                num_shared_fcs=1,
                in_channels=256,
                fc_out_channels=1024,
                roi_feat_size=7,
                num_classes=80,
                norm_cfg=dict(type='SyncBN', requires_grad=True),
                bbox_coder=dict(
                    type='DeltaXYWHBBoxCoder',
                    target_means=[0., 0., 0., 0.],
                    target_stds=[0.1, 0.1, 0.2, 0.2]),
                reg_class_agnostic=False,
                loss_cls=dict(
                    type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
                loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
            mask_roi_extractor=dict(
                type='SingleRoIExtractor',
                roi_layer=dict(type='RoIAlign', output_size=14, sampling_ratio=0),
                out_channels=256,
                featmap_strides=[4, 8, 16, 32]),
            mask_head=dict(
                type='FCNMaskHead',
                num_convs=4,
                in_channels=256,
                conv_out_channels=256,
                num_classes=80,
                norm_cfg=dict(type='SyncBN', requires_grad=True),
                loss_mask=dict(
                    type='CrossEntropyLoss', use_mask=True, loss_weight=1.0))),
        # model training and testing settings
        train_cfg=dict(
            rpn=dict(
                assigner=dict(
                    type='MaxIoUAssigner',
                    pos_iou_thr=0.7,
                    neg_iou_thr=0.3,
                    min_pos_iou=0.3,
                    match_low_quality=True,
                    ignore_iof_thr=-1),
                sampler=dict(
                    type='RandomSampler',
                    num=256,
                    pos_fraction=0.5,
                    neg_pos_ub=-1,
                    add_gt_as_proposals=False),
                allowed_border=-1,
                pos_weight=-1,
                debug=False),
            rpn_proposal=dict(
                nms_pre=2000,
                max_per_img=1000,
                nms=dict(type='nms', iou_threshold=0.7),
                min_bbox_size=0),
            rcnn=dict(
                assigner=dict(
                    type='MaxIoUAssigner',
                    pos_iou_thr=0.5,
                    neg_iou_thr=0.5,
                    min_pos_iou=0.5,
                    match_low_quality=True,
                    ignore_iof_thr=-1),
                sampler=dict(
                    type='RandomSampler',
                    num=512,
                    pos_fraction=0.25,
                    neg_pos_ub=-1,
                    add_gt_as_proposals=True),
                mask_size=28,
                pos_weight=-1,
                debug=False)),
        test_cfg=dict(
            rpn=dict(
                nms_pre=1000,
                max_per_img=1000,
                nms=dict(type='nms', iou_threshold=0.7),
                min_bbox_size=0),
            rcnn=dict(
                score_thr=0.05,
                nms=dict(type='nms', iou_threshold=0.5),
                max_per_img=100,
                mask_thr_binary=0.5))
    )
    dataset_type = 'CocoDataset'
    data_root = 'data/coco/'
    img_norm_cfg = dict(
        mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
    train_pipeline = [
        dict(type='LoadImageFromFile'),
        dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
        dict(type='Resize',
             img_scale=[(1333, 640), (1333, 672), (1333, 704), (1333, 736), (1333, 768), (1333, 800)],
             multiscale_mode='value',
             keep_ratio=True),
        dict(type='RandomFlip', flip_ratio=0.5),
        dict(type='Normalize', **img_norm_cfg),
        dict(type='Pad', size_divisor=32),
        dict(type='DefaultFormatBundle'),
        dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
    ]
    test_pipeline = [
        dict(type='LoadImageFromFile'),
        dict(
            type='MultiScaleFlipAug',
            img_scale=(1333, 800),
            flip=False,
            transforms=[
                dict(type='Resize', keep_ratio=True),
                dict(type='RandomFlip'),
                dict(type='Normalize', **img_norm_cfg),
                dict(type='Pad', size_divisor=32),
                dict(type='ImageToTensor', keys=['img']),
                dict(type='Collect', keys=['img']),
            ])
    ]
    data = dict(
        samples_per_gpu=2,
        workers_per_gpu=2,
        train=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_train2017.json',
            img_prefix=data_root + 'train2017/',
            pipeline=train_pipeline),
        val=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_val2017.json',
            img_prefix=data_root + 'val2017/',
            pipeline=test_pipeline),
        test=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_val2017.json',
            img_prefix=data_root + 'val2017/',
            pipeline=test_pipeline))
    evaluation = dict(metric=['bbox', 'segm'])
    # optimizer
    optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
    optimizer_config = dict(grad_clip=None)
    # learning policy
    lr_config = dict(
        policy='step',
        warmup='linear',
        warmup_iters=1000,
        warmup_ratio=0.001,
        step=[8, 11])
    runner = dict(type='EpochBasedRunner', max_epochs=12)
    checkpoint_config = dict(interval=1)
    # yapf:disable
    log_config = dict(
        interval=50,
        hooks=[
            dict(type='TextLoggerHook'),
            # dict(type='TensorboardLoggerHook')
        ])
    # yapf:enable
    custom_hooks = [dict(type='NumClassCheckHook')]
    dist_params = dict(backend='nccl')
    log_level = 'INFO'
    load_from = None
    resume_from = None
    workflow = [('train', 1)]
I have conduct some experiments, here is the performance: 1、using ImageNet pre-trained model without SyncBN framework Backbone box AP mask AP
mmdetection R-50-FPN 38.80 35.02
detectron2 R-50-FPN 38.90 34.90
2、using ImageNet pre-trained model with SyncBN framework Backbone box AP mask AP
mmdetection R-50-FPN 38.90 35.40
detectron2 R-50-FPN 39.80 36.02
3、using SWAV self-supervised pre-trained model with SyncBN, link of SWAV is https://dl.fbaipublicfiles.com/vissl/model_zoo/swav_in1k_rn50_800ep_swav_8node_resnet_27_07_20.a0a6b676/model_final_checkpoint_phase799.torch framework Backbone box AP mask AP
mmdetection R-50-FPN 40.60 37.00
detectron2 R-50-FPN 41.84 37.88
4、using SWAV self-supervised pre-trained model with SyncBN with different number gpus framework num_gpus Backbone box AP mask AP
mmdetection 8 R-50-FPN 40.60 37.00
mmdetection 2 R-50-FPN 37.10 34.50

We can see that, when without syncBN, perfermance is consistent in two framework. But when with SyncBN, performance in mmdetection is always poor than detectron2. In addition, when number gpus is different, performancce seems unstable in mmdetection. I don't know how to resolve it, can you give me some suggestiones?

ZwwWayne commented 3 years ago

This two repos use different SyncBN. You can try to use 1. MMSyncBN rather than SyncBN, which is implemented in MMCV or try to migrate the NaiveSyncBatchNorm implemented in Detectron2.

UcanSee commented 3 years ago
I have already try MMSyncBN and NaiveSyncBatchNorm in mmdetection,but useless. In addition, I found that when I use 16 gpus rather than 8 gpus to train mask-rcnn-r50, setting samples_per_gpu=1 and not changing batch_size, perfermance of ImageNet pre-trained and SWAV pre-trained in mmdetection (experiments 2 and experiments 3 above)can upgrade to detectron2. Is that means there is something wrong in DDP of mmdetection? Here is the performance of num-gpus from 8 to 16: 1、using ImageNet pre-trained model with SyncBN: framework Backbone num-gpus box AP mask AP
mmdetection R-50-FPN 8 38.90 35.40
mmdetection R-50-FPN 16 40.20 36.00
detectron2 R-50-FPN 8 39.80 36.02
2、using SWAV self-supervised pre-trained model with SyncBN: framework Backbone num-gpus box AP mask AP
mmdetection R-50-FPN 8 40.60 37.00
mmdetection R-50-FPN 16 41.70 37.60
detectron2 R-50-FPN 8 41.84 37.88
JrPeng commented 3 years ago

+1, I have met the same issues. I strongly recommend that mmdetection could provide detailed benchmarks on syncbn/mmsyncbn, since syncbn is widely used in many situations, especially in large-scale object detection like openImages.

JrPeng commented 3 years ago

@UcanSee

Hi, would you mind showing your results of NaiveSyncBatchNorm in mmdetection?

JrPeng commented 3 years ago

@ZwwWayne Hi, would you mind explaining the difference between mmsyncbn and pytorch-syncbn?
Besides, what is the recommended group size for synchronization? 8/16/32/64? TY.

UcanSee commented 3 years ago

@UcanSee

Hi, would you mind showing your results of NaiveSyncBatchNorm in mmdetection?

There is no difference between NaiveSyncBatchNorm and nn.SyncBN

JrPeng commented 3 years ago

Hi, @ZwwWayne, any response? TY.