Very poor performance on custom dataset with partly overlapping instances

Ma1hias commented 2 years ago

Hi,

I trained different default model configurations (mask_rcnn, cascade) with different backbones (swin, r50, x101) for a custom COCO formatted dataset. The dataset contains only one class with up to 100 instances per image, but most of the instances have a very small overlapping region with one or more other instances.

The results are very strange for all tested models because only a fraction of the instances got recognized but the accuracy on the few detected instances is actually very good.

Using Detectron2 I got much better results on the very same dataset, but I would like to use mmdetection.

Do you have any suggestions what could be the reason for this behavior?

Thank you for your time, Mathias

RangiLyu commented 2 years ago

Can you upload your config and provide more details such as how many GPUS you use and the hyperparams you use?

Ma1hias commented 2 years ago

I am training on 1 GPU.

Th model config is almost the same as the default MaskRCNN with the swin-s Backbone. I only tried to increase the number of results by increasing the numbers of proposals in the train_cfg and test_cfg. Using the default values has the same bad performance.

model = dict(
    type="MaskRCNN",
    backbone=dict(
        type="SwinTransformer",
        embed_dims=96,
        depths=[2, 2, 18, 2],
        num_heads=[3, 6, 12, 24],
        window_size=7,
        mlp_ratio=4,
        qkv_bias=True,
        qk_scale=None,
        drop_rate=0.0,
        attn_drop_rate=0.0,
        drop_path_rate=0.2,
        patch_norm=True,
        out_indices=(0, 1, 2, 3),
        with_cp=False,
        convert_weights=True,
        init_cfg=dict(
            type="Pretrained",
            checkpoint="https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_small_patch4_window7_224.pth",
        ),
    ),
    neck=dict(type="FPN", in_channels=[96, 192, 384, 768], out_channels=256, num_outs=5),
    rpn_head=dict(
        type="RPNHead",
        in_channels=256,
        feat_channels=256,
        anchor_generator=dict(type="AnchorGenerator", scales=[8], ratios=[0.5, 1.0, 2.0], strides=[4, 8, 16, 32, 64]),
        bbox_coder=dict(type="DeltaXYWHBBoxCoder", target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(type="CrossEntropyLoss", use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type="L1Loss", loss_weight=1.0),
    ),
    roi_head=dict(
        type="StandardRoIHead",
        bbox_roi_extractor=dict(
            type="SingleRoIExtractor",
            roi_layer=dict(type="RoIAlign", output_size=7, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32],
        ),
        bbox_head=dict(
            type="Shared2FCBBoxHead",
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=1,
            bbox_coder=dict(
                type="DeltaXYWHBBoxCoder", target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.1, 0.1, 0.2, 0.2]
            ),
            reg_class_agnostic=False,
            loss_cls=dict(type="CrossEntropyLoss", use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type="L1Loss", loss_weight=1.0),
        ),
        mask_roi_extractor=dict(
            type="SingleRoIExtractor",
            roi_layer=dict(type="RoIAlign", output_size=14, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32],
        ),
        mask_head=dict(
            type="FCNMaskHead",
            num_convs=4,
            in_channels=256,
            conv_out_channels=256,
            num_classes=1,
            loss_mask=dict(type="CrossEntropyLoss", use_mask=True, loss_weight=1.0),
        ),
    ),
    train_cfg=dict(
        rpn=dict(
            assigner=dict(
                type="MaxIoUAssigner",
                pos_iou_thr=0.7,
                neg_iou_thr=0.3,
                min_pos_iou=0.3,
                match_low_quality=True,
                ignore_iof_thr=-1,
            ),
            sampler=dict(type="RandomSampler", num=512, pos_fraction=0.5, neg_pos_ub=-1, add_gt_as_proposals=False),
            allowed_border=-1,
            pos_weight=-1,
            debug=False,
        ),
        rpn_proposal=dict(nms_pre=12000, max_per_img=2000, nms=dict(type="nms", iou_threshold=0.7), min_bbox_size=0),
        rcnn=dict(
            assigner=dict(
                type="MaxIoUAssigner",
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                match_low_quality=True,
                ignore_iof_thr=-1,
            ),
            sampler=dict(type="RandomSampler", num=512, pos_fraction=0.25, neg_pos_ub=-1, add_gt_as_proposals=True),
            mask_size=28,
            pos_weight=-1,
            debug=False,
        ),
    ),
    test_cfg=dict(
        rpn=dict(nms_pre=6000, max_per_img=1000, nms=dict(type="nms", iou_threshold=0.7), min_bbox_size=0),
        rcnn=dict(score_thr=0.05, nms=dict(type="nms", iou_threshold=0.5), max_per_img=250, mask_thr_binary=0.5),
    ),
)
dataset_type = "COCODataset"
data_root = "data/coco/"
img_norm_cfg = dict(mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type="LoadImageFromFile"),
    dict(type="LoadAnnotations", with_bbox=True, with_mask=True),
    dict(type="RandomFlip", flip_ratio=0.5),
    dict(
        type="AutoAugment",
        policies=[
            [
                {
                    "type": "Resize",
                    "img_scale": [
                        (480, 1333),
                        (512, 1333),
                        (544, 1333),
                        (576, 1333),
                        (608, 1333),
                        (640, 1333),
                        (672, 1333),
                        (704, 1333),
                        (736, 1333),
                        (768, 1333),
                        (800, 1333),
                    ],
                    "multiscale_mode": "value",
                    "keep_ratio": True,
                }
            ],
            [
                {
                    "type": "Resize",
                    "img_scale": [(400, 1333), (500, 1333), (600, 1333)],
                    "multiscale_mode": "value",
                    "keep_ratio": True,
                },
                {
                    "type": "RandomCrop",
                    "crop_type": "absolute_range",
                    "crop_size": (384, 600),
                    "allow_negative_crop": True,
                },
                {
                    "type": "Resize",
                    "img_scale": [
                        (480, 1333),
                        (512, 1333),
                        (544, 1333),
                        (576, 1333),
                        (608, 1333),
                        (640, 1333),
                        (672, 1333),
                        (704, 1333),
                        (736, 1333),
                        (768, 1333),
                        (800, 1333),
                    ],
                    "multiscale_mode": "value",
                    "override": True,
                    "keep_ratio": True,
                },
            ],
        ],
    ),
    dict(type="Normalize", mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True),
    dict(type="Pad", size_divisor=32),
    dict(type="DefaultFormatBundle"),
    dict(type="Collect", keys=["img", "gt_bboxes", "gt_labels", "gt_masks"]),
]
test_pipeline = [
    dict(type="LoadImageFromFile"),
    dict(
        type="MultiScaleFlipAug",
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type="Resize", keep_ratio=True),
            dict(type="RandomFlip"),
            dict(type="Normalize", mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True),
            dict(type="Pad", size_divisor=32),
            dict(type="ImageToTensor", keys=["img"]),
            dict(type="Collect", keys=["img"]),
        ],
    ),
]
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type="CocoDataset",
        ann_file="data/processed/coco/train.json",
        img_prefix="",
        pipeline=[
            dict(type="LoadImageFromFile"),
            dict(type="LoadAnnotations", with_bbox=True, with_mask=True),
            dict(type="RandomFlip", flip_ratio=0.5),
            dict(
                type="AutoAugment",
                policies=[
                    [
                        {
                            "type": "Resize",
                            "img_scale": [
                                (480, 1333),
                                (512, 1333),
                                (544, 1333),
                                (576, 1333),
                                (608, 1333),
                                (640, 1333),
                                (672, 1333),
                                (704, 1333),
                                (736, 1333),
                                (768, 1333),
                                (800, 1333),
                            ],
                            "multiscale_mode": "value",
                            "keep_ratio": True,
                        }
                    ],
                    [
                        {
                            "type": "Resize",
                            "img_scale": [(400, 1333), (500, 1333), (600, 1333)],
                            "multiscale_mode": "value",
                            "keep_ratio": True,
                        },
                        {
                            "type": "RandomCrop",
                            "crop_type": "absolute_range",
                            "crop_size": (384, 600),
                            "allow_negative_crop": True,
                        },
                        {
                            "type": "Resize",
                            "img_scale": [
                                (480, 1333),
                                (512, 1333),
                                (544, 1333),
                                (576, 1333),
                                (608, 1333),
                                (640, 1333),
                                (672, 1333),
                                (704, 1333),
                                (736, 1333),
                                (768, 1333),
                                (800, 1333),
                            ],
                            "multiscale_mode": "value",
                            "override": True,
                            "keep_ratio": True,
                        },
                    ],
                ],
            ),
            dict(type="Normalize", mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True),
            dict(type="Pad", size_divisor=32),
            dict(type="DefaultFormatBundle"),
            dict(type="Collect", keys=["img", "gt_bboxes", "gt_labels", "gt_masks"]),
        ],
        classes=("WALL",),
    ),
    val=dict(
        type="CocoDataset",
        ann_file="data/processed/coco/val.json",
        img_prefix="",
        pipeline=[
            dict(type="LoadImageFromFile"),
            dict(
                type="MultiScaleFlipAug",
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type="Resize", keep_ratio=True),
                    dict(type="RandomFlip"),
                    dict(type="Normalize", mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True),
                    dict(type="Pad", size_divisor=32),
                    dict(type="ImageToTensor", keys=["img"]),
                    dict(type="Collect", keys=["img"]),
                ],
            ),
        ],
        classes=("WALL",),
    ),
    test=dict(
        type="CocoDataset",
        ann_file="data/processed/coco/test.json",
        img_prefix="",
        pipeline=[
            dict(type="LoadImageFromFile"),
            dict(
                type="MultiScaleFlipAug",
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type="Resize", keep_ratio=True),
                    dict(type="RandomFlip"),
                    dict(type="Normalize", mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True),
                    dict(type="Pad", size_divisor=32),
                    dict(type="ImageToTensor", keys=["img"]),
                    dict(type="Collect", keys=["img"]),
                ],
            ),
        ],
        classes=("WALL",),
    ),
)
evaluation = dict(classwise=True, metric=["bbox", "segm"])
optimizer = dict(
    type="AdamW",
    lr=0.0001,
    betas=(0.9, 0.999),
    weight_decay=0.05,
    paramwise_cfg=dict(
        custom_keys=dict(
            absolute_pos_embed=dict(decay_mult=0.0),
            relative_position_bias_table=dict(decay_mult=0.0),
            norm=dict(decay_mult=0.0),
        )
    ),
)
optimizer_config = dict(grad_clip=None)
lr_config = dict(policy="step", warmup="linear", warmup_iters=1000, warmup_ratio=0.001, step=[27, 33])
runner = dict(type="EpochBasedRunner", max_epochs=50)
checkpoint_config = dict(interval=10)
log_config = dict(interval=10, hooks=[dict(type="TextLoggerHook"), dict(type="TensorboardLoggerHook")])
custom_hooks = [dict(type="NumClassCheckHook")]
dist_params = dict(backend="nccl")
log_level = "INFO"
load_from = "models/pretrained/mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco_20210903_104808-b92c91f1.pth"
resume_from = None
workflow = [("train", 1)]
pretrained = "https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_small_patch4_window7_224.pth"
fp16 = dict(loss_scale=dict(init_scale=512))
classes = ("WALL",)
work_dir = "./models/training/mask_swin-s"
auto_resume = False
gpu_ids = range(0, 1)

RangiLyu commented 2 years ago

The learning rate in this config is for 8 GPUs (total batch size = 16). Please adjust the hyperparameters.

Ma1hias commented 2 years ago

You are right, in this case I did not adjust the learning rate.

I also trained the model with the 1/8 the learning rate ans 2 images per gpu and got similar results. The precision is high but the recall is very low.

Do you have any other idea what could be the reason for the poor performance?

muneebelahimalik commented 11 months ago

You are right, in this case I did not adjust the learning rate.

I also trained the model with the 1/8 the learning rate ans 2 images per gpu and got similar results. The precision is high but the recall is very low.

Do you have any other idea what could be the reason for the poor performance?

Hey! Did you find any solution for this? @Ma1hias

open-mmlab / mmdetection

Very poor performance on custom dataset with partly overlapping instances #7445