UcanSee commented 3 years ago

Recently I have run some transferring experiment about self-supervised learning, which using self-supervised model as pre-trained of mask-rcnn-r50. Most paper conduct these transferring experiment in detectron2, when I conduct these transferring experiment in mmdetection, I found the performance is poor than that in detectron2. Here is my environment: python3.7, pytorch-1.6, torchvision-0.7, mmdetection2.10.0, mmcv-1.2.7, the config file I used is here:

model settings

    model = dict(
        type='MaskRCNN',
        pretrained='torchvision://resnet50',
        backbone=dict(
            type='ResNet',
            depth=50,
            num_stages=4,
            out_indices=(0, 1, 2, 3),
            frozen_stages=-1,
            norm_cfg=dict(type='SyncBN', requires_grad=True),
            norm_eval=False,
            style='pytorch'),
        neck=dict(
            type='FPN',
            in_channels=[256, 512, 1024, 2048],
            out_channels=256,
            norm_cfg=dict(type='SyncBN', requires_grad=True),
            num_outs=5),
        rpn_head=dict(
            type='RPNHead',
            in_channels=256,
            feat_channels=256,
            anchor_generator=dict(
                type='AnchorGenerator',
                scales=[8],
                ratios=[0.5, 1.0, 2.0],
                strides=[4, 8, 16, 32, 64]),
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[.0, .0, .0, .0],
                target_stds=[1.0, 1.0, 1.0, 1.0]),
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
        roi_head=dict(
            type='StandardRoIHead',
            bbox_roi_extractor=dict(
                type='SingleRoIExtractor',
                roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
                out_channels=256,
                featmap_strides=[4, 8, 16, 32]),
            bbox_head=dict(
                type='ConvFCBBoxHead',
                num_shared_convs=4,
                num_shared_fcs=1,
                in_channels=256,
                fc_out_channels=1024,
                roi_feat_size=7,
                num_classes=80,
                norm_cfg=dict(type='SyncBN', requires_grad=True),
                bbox_coder=dict(
                    type='DeltaXYWHBBoxCoder',
                    target_means=[0., 0., 0., 0.],
                    target_stds=[0.1, 0.1, 0.2, 0.2]),
                reg_class_agnostic=False,
                loss_cls=dict(
                    type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
                loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
            mask_roi_extractor=dict(
                type='SingleRoIExtractor',
                roi_layer=dict(type='RoIAlign', output_size=14, sampling_ratio=0),
                out_channels=256,
                featmap_strides=[4, 8, 16, 32]),
            mask_head=dict(
                type='FCNMaskHead',
                num_convs=4,
                in_channels=256,
                conv_out_channels=256,
                num_classes=80,
                norm_cfg=dict(type='SyncBN', requires_grad=True),
                loss_mask=dict(
                    type='CrossEntropyLoss', use_mask=True, loss_weight=1.0))),
        # model training and testing settings
        train_cfg=dict(
            rpn=dict(
                assigner=dict(
                    type='MaxIoUAssigner',
                    pos_iou_thr=0.7,
                    neg_iou_thr=0.3,
                    min_pos_iou=0.3,
                    match_low_quality=True,
                    ignore_iof_thr=-1),
                sampler=dict(
                    type='RandomSampler',
                    num=256,
                    pos_fraction=0.5,
                    neg_pos_ub=-1,
                    add_gt_as_proposals=False),
                allowed_border=-1,
                pos_weight=-1,
                debug=False),
            rpn_proposal=dict(
                nms_pre=2000,
                max_per_img=1000,
                nms=dict(type='nms', iou_threshold=0.7),
                min_bbox_size=0),
            rcnn=dict(
                assigner=dict(
                    type='MaxIoUAssigner',
                    pos_iou_thr=0.5,
                    neg_iou_thr=0.5,
                    min_pos_iou=0.5,
                    match_low_quality=True,
                    ignore_iof_thr=-1),
                sampler=dict(
                    type='RandomSampler',
                    num=512,
                    pos_fraction=0.25,
                    neg_pos_ub=-1,
                    add_gt_as_proposals=True),
                mask_size=28,
                pos_weight=-1,
                debug=False)),
        test_cfg=dict(
            rpn=dict(
                nms_pre=1000,
                max_per_img=1000,
                nms=dict(type='nms', iou_threshold=0.7),
                min_bbox_size=0),
            rcnn=dict(
                score_thr=0.05,
                nms=dict(type='nms', iou_threshold=0.5),
                max_per_img=100,
                mask_thr_binary=0.5))
    )
    dataset_type = 'CocoDataset'
    data_root = 'data/coco/'
    img_norm_cfg = dict(
        mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
    train_pipeline = [
        dict(type='LoadImageFromFile'),
        dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
        dict(type='Resize',
             img_scale=[(1333, 640), (1333, 672), (1333, 704), (1333, 736), (1333, 768), (1333, 800)],
             multiscale_mode='value',
             keep_ratio=True),
        dict(type='RandomFlip', flip_ratio=0.5),
        dict(type='Normalize', **img_norm_cfg),
        dict(type='Pad', size_divisor=32),
        dict(type='DefaultFormatBundle'),
        dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
    ]
    test_pipeline = [
        dict(type='LoadImageFromFile'),
        dict(
            type='MultiScaleFlipAug',
            img_scale=(1333, 800),
            flip=False,
            transforms=[
                dict(type='Resize', keep_ratio=True),
                dict(type='RandomFlip'),
                dict(type='Normalize', **img_norm_cfg),
                dict(type='Pad', size_divisor=32),
                dict(type='ImageToTensor', keys=['img']),
                dict(type='Collect', keys=['img']),
            ])
    ]
    data = dict(
        samples_per_gpu=2,
        workers_per_gpu=2,
        train=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_train2017.json',
            img_prefix=data_root + 'train2017/',
            pipeline=train_pipeline),
        val=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_val2017.json',
            img_prefix=data_root + 'val2017/',
            pipeline=test_pipeline),
        test=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_val2017.json',
            img_prefix=data_root + 'val2017/',
            pipeline=test_pipeline))
    evaluation = dict(metric=['bbox', 'segm'])
    # optimizer
    optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
    optimizer_config = dict(grad_clip=None)
    # learning policy
    lr_config = dict(
        policy='step',
        warmup='linear',
        warmup_iters=1000,
        warmup_ratio=0.001,
        step=[8, 11])
    runner = dict(type='EpochBasedRunner', max_epochs=12)
    checkpoint_config = dict(interval=1)
    # yapf:disable
    log_config = dict(
        interval=50,
        hooks=[
            dict(type='TextLoggerHook'),
            # dict(type='TensorboardLoggerHook')
        ])
    # yapf:enable
    custom_hooks = [dict(type='NumClassCheckHook')]
    dist_params = dict(backend='nccl')
    log_level = 'INFO'
    load_from = None
    resume_from = None
    workflow = [('train', 1)]

I have conduct some experiments, here is the performance: 1、using ImageNet pre-trained model without SyncBN	framework	Backbone	box AP	mask AP
mmdetection	R-50-FPN	38.80	35.02
detectron2	R-50-FPN	38.90	34.90

2、using ImageNet pre-trained model with SyncBN	framework	Backbone	box AP	mask AP
mmdetection	R-50-FPN	38.90	35.40
detectron2	R-50-FPN	39.80	36.02

3、using SWAV self-supervised pre-trained model with SyncBN, link of SWAV is https://dl.fbaipublicfiles.com/vissl/model_zoo/swav_in1k_rn50_800ep_swav_8node_resnet_27_07_20.a0a6b676/model_final_checkpoint_phase799.torch	framework	Backbone	box AP	mask AP
mmdetection	R-50-FPN	40.60	37.00
detectron2	R-50-FPN	41.84	37.88

4、using SWAV self-supervised pre-trained model with SyncBN with different number gpus	framework	num_gpus	Backbone	box AP	mask AP
mmdetection	8	R-50-FPN	40.60	37.00
mmdetection	2	R-50-FPN	37.10	34.50

We can see that, when without syncBN, perfermance is consistent in two framework. But when with SyncBN, performance in mmdetection is always poor than detectron2. In addition, when number gpus is different, performancce seems unstable in mmdetection. I don't know how to resolve it, can you give me some suggestiones?

ZwwWayne commented 3 years ago

This two repos use different SyncBN. You can try to use 1. MMSyncBN rather than SyncBN, which is implemented in MMCV or try to migrate the NaiveSyncBatchNorm implemented in Detectron2.

UcanSee commented 3 years ago

I have already try MMSyncBN and NaiveSyncBatchNorm in mmdetection，but useless. In addition, I found that when I use 16 gpus rather than 8 gpus to train mask-rcnn-r50, setting samples_per_gpu=1 and not changing batch_size, perfermance of ImageNet pre-trained and SWAV pre-trained in mmdetection (experiments 2 and experiments 3 above)can upgrade to detectron2. Is that means there is something wrong in DDP of mmdetection? Here is the performance of num-gpus from 8 to 16: 1、using ImageNet pre-trained model with SyncBN:	framework	Backbone	num-gpus	box AP
mmdetection	R-50-FPN	8	38.90	35.40
mmdetection	R-50-FPN	16	40.20	36.00
detectron2	R-50-FPN	8	39.80	36.02

2、using SWAV self-supervised pre-trained model with SyncBN:	framework	Backbone	num-gpus	box AP
mmdetection	R-50-FPN	8	40.60	37.00
mmdetection	R-50-FPN	16	41.70	37.60
detectron2	R-50-FPN	8	41.84	37.88

JrPeng commented 3 years ago

+1, I have met the same issues. I strongly recommend that mmdetection could provide detailed benchmarks on syncbn/mmsyncbn, since syncbn is widely used in many situations, especially in large-scale object detection like openImages.

JrPeng commented 3 years ago

@UcanSee

Hi, would you mind showing your results of NaiveSyncBatchNorm in mmdetection?

JrPeng commented 3 years ago

@ZwwWayne Hi, would you mind explaining the difference between mmsyncbn and pytorch-syncbn?
Besides, what is the recommended group size for synchronization? 8/16/32/64? TY.

UcanSee commented 3 years ago

@UcanSee

Hi, would you mind showing your results of NaiveSyncBatchNorm in mmdetection?

There is no difference between NaiveSyncBatchNorm and nn.SyncBN

JrPeng commented 3 years ago

Hi, @ZwwWayne, any response? TY.

open-mmlab / mmdetection

why performance of mask-rcnn-r50-fpn with syncBN in mmdetection is poor than that in detectron2? #4874

model settings