jwwangchn / AI-TOD

Official code for "Tiny Object Detection in Aerial Images".
MIT License
193 stars 21 forks source link

Unexpected results from DotD papre #10

Open mhjabreel opened 2 years ago

mhjabreel commented 2 years ago

Hi,

I came across your interesting paper Dot Distance for Tiny Object Detection in Aerial Images.

I tried to implement it in mmdetection framework and tested on our in house dataset, however, I got unexpected results. Replacing the IoU metric for assigning by DotD showed a degredation of 0.8 of mAP.

I attached here the code I wrote.

@IOU_CALCULATORS.register_module()
class DotDistOverlaps:

    def __init__(self, average_size, scale=1., dtype=None):
        self.average_size = math.sqrt(average_size)
        self.scale = scale
        self.dtype = dtype

    def __call__(self, bboxes1, bboxes2):
        """Calculate IoU between 2D bboxes.

        Args:
            bboxes1 (Tensor): bboxes have shape (m, 4) in <x1, y1, x2, y2>
                format, or shape (m, 5) in <x1, y1, x2, y2, score> format.
            bboxes2 (Tensor): bboxes have shape (m, 4) in <x1, y1, x2, y2>
                format, shape (m, 5) in <x1, y1, x2, y2, score> format, or be
                empty. If ``is_aligned `` is ``True``, then m and n must be
                equal.

        Returns:
            Tensor: shape (m, n) if ``is_aligned `` is False else shape (m,)
        """
        assert bboxes1.size(-1) in [0, 4, 5]
        assert bboxes2.size(-1) in [0, 4, 5]
        if bboxes2.size(-1) == 5:
            bboxes2 = bboxes2[..., :4]
        if bboxes1.size(-1) == 5:
            bboxes1 = bboxes1[..., :4]

        centroids1_cx = (bboxes1[..., 0] + bboxes1[..., 2]) / 2.0
        centroids1_cy = (bboxes1[..., 1] + bboxes1[..., 3]) / 2.0

        # [B, M, 2]
        centroids1 = torch.stack((centroids1_cx, centroids1_cy), dim=1)

        centroids2_cx = (bboxes2[..., 0] + bboxes2[..., 2]) / 2.0
        centroids2_cy = (bboxes2[..., 1] + bboxes2[..., 3]) / 2.0

        # [B, N, 2]
        centroids2 = torch.stack((centroids2_cx, centroids2_cy), dim=1)
        distances = (centroids1[:, None, :] -
                    centroids2[None, :, :]).pow(2).sum(-1).sqrt()            
        dotd = torch.exp(-distances / self.average_size)
        return  dotd

I am not sure if I missed something there. Could you please check that?

Moreover, I tried to download the AI-TOD dataset, to test my implementation on it but I could not, could you please advise me how can I get the data. Thanks,

Mohammed Jabreel

Chasel-Tsui commented 2 years ago

Hi, Thanks for your attention.

It seems that there is no problem with the DotD calculation. Moreover, could you please provide the config file, so that I can help you figure out the problems? And here is a link to the official implementation for your reference. https://github.com/Chasel-Tsui/mmdet-aitod

As for the mAP drop, could you please provide more details about the in-house dataset? such as the average absolute object size and the largest/smallest object size. I think the performance degradation may result from the large size variation of the dataset, and we found that the DotD may generate sub-optimal results when the dataset contains many medium and large objects (>32*32 pixels). For a substitute, we recommend our newly released NWD-RKA (also in the above link) which may better handle tiny object detection when there is large size variation.

As for the AI-TOD dataset, we have attached a download link (BaiduPan) in this repo, you may need to download the xview training set and the remaining part of AI-TOD (AI-TOD_wo_xview) to generate the whole AI-TOD dataset with the end2end tools. We will update other download links (Google Drive or OneDrive) to the AI-TOD_wo_xview in a week.

Please feel free to contact me if you have further issues.

mhjabreel commented 2 years ago

Thanks @Chasel-Tsui,

Here is the config

model = dict(
    type='FasterRCNN',
    backbone=dict(
        type='mmcls.ConvNeXt',
        arch='base',
        out_indices=[0, 1, 2, 3],
        drop_path_rate=0.4,
        layer_scale_init_value=1.0,
        gap_before_final_norm=False,
        init_cfg=dict(
            type='Pretrained',
            checkpoint=
            'https://download.openmmlab.com/mmclassification/v0/convnext/downstream/convnext-base_3rdparty_in21k_20220301-262fd037.pth',
            prefix='backbone.')),
    neck=dict(
        type='FPN',
        in_channels=[128, 256, 512, 1024],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            scales=[8],
            ratios=[0.5, 1.0, 2.0],
            strides=[4, 8, 16, 32, 64]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[0.0, 0.0, 0.0, 0.0],
            target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    roi_head=dict(
        type='StandardRoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        bbox_head=dict(
            type='Shared2FCBBoxHead',
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=1,
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0.0, 0.0, 0.0, 0.0],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=False,
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0))),
    train_cfg=dict(
        rpn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.7,
                neg_iou_thr=0.3,
                min_pos_iou=0.3,
                match_low_quality=True,
                ignore_iof_thr=-1,
                iou_calculator=dict(
                    type='DotDistOverlaps', average_size=900.0)),
            sampler=dict(
                type='RandomSampler',
                num=256,
                pos_fraction=0.5,
                neg_pos_ub=-1,
                add_gt_as_proposals=False),
            allowed_border=-1,
            pos_weight=-1,
            debug=False),
        rpn_proposal=dict(
            nms_pre=2000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                match_low_quality=False,
                ignore_iof_thr=-1,
                iou_calculator=dict(
                    type='DotDistOverlaps', average_size=900.0)),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=-1,
            debug=False)),
    test_cfg=dict(
        rpn=dict(
            nms_pre=1000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            score_thr=0.05,
            nms=dict(type='nms', iou_threshold=0.5),
            max_per_img=100)))

The stats of the data set is as the following:

image

Thanks again.

Chasel-Tsui commented 2 years ago

Thanks @Chasel-Tsui,

Here is the config

model = dict(
    type='FasterRCNN',
    backbone=dict(
        type='mmcls.ConvNeXt',
        arch='base',
        out_indices=[0, 1, 2, 3],
        drop_path_rate=0.4,
        layer_scale_init_value=1.0,
        gap_before_final_norm=False,
        init_cfg=dict(
            type='Pretrained',
            checkpoint=
            'https://download.openmmlab.com/mmclassification/v0/convnext/downstream/convnext-base_3rdparty_in21k_20220301-262fd037.pth',
            prefix='backbone.')),
    neck=dict(
        type='FPN',
        in_channels=[128, 256, 512, 1024],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            scales=[8],
            ratios=[0.5, 1.0, 2.0],
            strides=[4, 8, 16, 32, 64]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[0.0, 0.0, 0.0, 0.0],
            target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    roi_head=dict(
        type='StandardRoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        bbox_head=dict(
            type='Shared2FCBBoxHead',
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=1,
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0.0, 0.0, 0.0, 0.0],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=False,
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0))),
    train_cfg=dict(
        rpn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.7,
                neg_iou_thr=0.3,
                min_pos_iou=0.3,
                match_low_quality=True,
                ignore_iof_thr=-1,
                iou_calculator=dict(
                    type='DotDistOverlaps', average_size=900.0)),
            sampler=dict(
                type='RandomSampler',
                num=256,
                pos_fraction=0.5,
                neg_pos_ub=-1,
                add_gt_as_proposals=False),
            allowed_border=-1,
            pos_weight=-1,
            debug=False),
        rpn_proposal=dict(
            nms_pre=2000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                match_low_quality=False,
                ignore_iof_thr=-1,
                iou_calculator=dict(
                    type='DotDistOverlaps', average_size=900.0)),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=-1,
            debug=False)),
    test_cfg=dict(
        rpn=dict(
            nms_pre=1000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            score_thr=0.05,
            nms=dict(type='nms', iou_threshold=0.5),
            max_per_img=100)))

The stats of the data set is as the following:

image

Thanks again.

Hi, my pleasure. I have checked the config and the stats information. Here are three possible solutions. First, you could try to disable the DotD in the RCNN stage, and switch it to the original IoU to test the performance.

Second, I notice that the "avg_annotation_area" of this dataset is 16761 pixels, which indicates that DotD is not a good candidate for this dataset. Although it contains some tiny objects, from a global perspective, it might not be very effective to apply tiny object detection strategies to a large object dataset (average size) to improve the mAP. To my best knowledge, the DotD strategy may not be effective for a dataset containing many large objects. The tested AI-TOD contains objects mainly on a tiny scale (smaller than 1024 pixels). If you want to use DotD, I am not sure but it might be helpful to ensemble two models, one of IoU training, and another of DotD training.

Chasel-Tsui commented 2 years ago

Thanks @Chasel-Tsui, Here is the config

model = dict(
    type='FasterRCNN',
    backbone=dict(
        type='mmcls.ConvNeXt',
        arch='base',
        out_indices=[0, 1, 2, 3],
        drop_path_rate=0.4,
        layer_scale_init_value=1.0,
        gap_before_final_norm=False,
        init_cfg=dict(
            type='Pretrained',
            checkpoint=
            'https://download.openmmlab.com/mmclassification/v0/convnext/downstream/convnext-base_3rdparty_in21k_20220301-262fd037.pth',
            prefix='backbone.')),
    neck=dict(
        type='FPN',
        in_channels=[128, 256, 512, 1024],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            scales=[8],
            ratios=[0.5, 1.0, 2.0],
            strides=[4, 8, 16, 32, 64]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[0.0, 0.0, 0.0, 0.0],
            target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    roi_head=dict(
        type='StandardRoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        bbox_head=dict(
            type='Shared2FCBBoxHead',
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=1,
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0.0, 0.0, 0.0, 0.0],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=False,
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0))),
    train_cfg=dict(
        rpn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.7,
                neg_iou_thr=0.3,
                min_pos_iou=0.3,
                match_low_quality=True,
                ignore_iof_thr=-1,
                iou_calculator=dict(
                    type='DotDistOverlaps', average_size=900.0)),
            sampler=dict(
                type='RandomSampler',
                num=256,
                pos_fraction=0.5,
                neg_pos_ub=-1,
                add_gt_as_proposals=False),
            allowed_border=-1,
            pos_weight=-1,
            debug=False),
        rpn_proposal=dict(
            nms_pre=2000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                match_low_quality=False,
                ignore_iof_thr=-1,
                iou_calculator=dict(
                    type='DotDistOverlaps', average_size=900.0)),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=-1,
            debug=False)),
    test_cfg=dict(
        rpn=dict(
            nms_pre=1000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            score_thr=0.05,
            nms=dict(type='nms', iou_threshold=0.5),
            max_per_img=100)))

The stats of the data set is as the following:

image

Thanks again.

Hi, my pleasure. I have checked the config and the stats information. Here are three possible solutions. First, you could try to disable the DotD in the RCNN stage, and switch it to the original IoU to test the performance.

Second, I notice that the "avg_annotation_area" of this dataset is 16761 pixels, which indicates that DotD is not a good candidate for this dataset. Although it contains some tiny objects, from a global perspective, it might not be very effective to apply tiny object detection strategies to a large object dataset (average size) to improve the mAP. To my best knowledge, the DotD strategy may not be effective for a dataset containing many large objects. The tested AI-TOD contains objects mainly on a tiny scale (smaller than 1024 pixels). If you want to use DotD, I am not sure but it might be helpful to ensemble two models, one of IoU training, and another of DotD training.

The last solution might be trying our newly released NWD-RKA (https://github.com/Chasel-Tsui/mmdet-aitod), which might be helpful for tiny object detection when objects are of size variation. But I could not guarantee the improvement since the dataset on average seems to be a large object detection dataset as mentioned above.

mhjabreel commented 2 years ago

Thanks @Chasel-Tsui - I have created a new dataset in which the maximum area of a bbox is 2024 and the DotD performed better than IoU.