openvinotoolkit / training_extensions

Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
https://openvinotoolkit.github.io/training_extensions/
Apache License 2.0
1.14k stars 442 forks source link

Bad Performance of Finetuning Model on COCO 2017 Dataset #542

Closed anujdutt9 closed 3 years ago

anujdutt9 commented 3 years ago

Hi. I am trying to finetune a ShuffleNetv2 SSD model, ShuffleNetv2 pre-trained on ImageNet dataset, on COCO2017 dataset, no frozen layers. But I am getting pretty poor performance of the model. I get a mAP @IoU=0.5 of 0.100 after 30 epochs. Can you please help me understand where I am going wrong here? Thanks

This is the configuration I am using:

# Variables
image_width = 512
image_height = 512
keepRatio = False

model = dict(
    type='SingleStageDetector',
    backbone=dict(
        type='shufflenetv2_w3d2',
        out_indices=(2,3),
        frozen_stages=-1,
        norm_eval=False,
        pretrained=True),
    neck=None,
    bbox_head=dict(
        type='SSDHead',
        num_classes=80,
        in_channels=(352, 704),
        # Update: COCO Anchors calculated using k-means
        anchor_generator=dict(
            type='SSDAnchorGeneratorClustered',
            strides=(16, 32),
            widths=([56.13, 15.34, 32.97, 18.43, 32.17], 
                    [56.15, 243.5, 95.18, 131.6]),
            heights=([58.23, 23.75, 77.06, 47.96, 33.5],
                    [143.1, 302.14, 98.01, 190.71])),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=(0.0, 0.0, 0.0, 0.0),
            target_stds=(0.1, 0.1, 0.2, 0.2)),
        depthwise_heads=True,
        depthwise_heads_activations='relu',
        loss_balancing=True))
cudnn_benchmark = True
train_cfg = dict(
    assigner=dict(
        type='MaxIoUAssigner',
        pos_iou_thr=0.4,
        neg_iou_thr=0.4,
        min_pos_iou=0.0,
        ignore_iof_thr=-1,
        gt_max_assign_all=False),
    smoothl1_beta=1.0,
    use_giou=False,
    use_focal=False,
    allowed_border=-1,
    pos_weight=-1,
    neg_pos_ratio=3,
    debug=False)
val_cfg = dict(
    nms=dict(type='nms', iou_thr=0.45),
    min_bbox_size=0,
    score_thr=0.02,
    max_per_img=200)
test_cfg = dict(
    nms=dict(type='nms', iou_thr=0.45),
    min_bbox_size=0,
    score_thr=0.02,
    max_per_img=200)

dataset_type = 'CocoDataset'
data_root = '/home/user/training_extensions/data/coco/'

# COCO normalization values
img_norm_cfg = dict(mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

# Train Data Pipeline
train_pipeline = [
    dict(type='LoadImageFromFile', to_float32=True),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='PhotoMetricDistortion',
        brightness_delta=32,
        contrast_range=(0.5, 1.5),
        saturation_range=(0.5, 1.5),
        hue_delta=18),
    dict(
        type='Expand',
        mean=img_norm_cfg['mean'],
        to_rgb=img_norm_cfg['to_rgb'],
        ratio_range=(1, 4)),
    dict(
        type='MinIoURandomCrop',
        min_ious=(0.1, 0.3, 0.5, 0.7, 0.9),
        min_crop_size=0.3),
    dict(type='Resize', img_scale=(image_width, image_height), keep_ratio=keepRatio),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]

# Val Data Pipeline
val_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(image_width, image_height),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=keepRatio),
            dict(
                type='Normalize', **img_norm_cfg),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(image_width, image_height),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=keepRatio),
            dict(
                type='Normalize', **img_norm_cfg),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]

data = dict(
    samples_per_gpu=96,
    workers_per_gpu=4,
    train=dict(
        type='CocoDataset',
        classes=('person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
               'train', 'truck', 'boat', 'traffic light', 'fire hydrant',
               'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog',
               'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe',
               'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
               'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat',
               'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
               'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
               'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
               'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
               'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop',
               'mouse', 'remote', 'keyboard', 'cell phone', 'microwave',
               'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock',
               'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'),
        ann_file=
        '/home/user/training_extensions/data/coco/annotations/instances_train2017.json',
        min_size=17,
        img_prefix=
        '/home/user/training_extensions/data/coco/train2017/',
        pipeline=train_pipeline,
        ),
    val=dict(
        type='CocoDataset',
        classes=('person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
               'train', 'truck', 'boat', 'traffic light', 'fire hydrant',
               'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog',
               'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe',
               'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
               'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat',
               'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
               'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
               'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
               'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
               'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop',
               'mouse', 'remote', 'keyboard', 'cell phone', 'microwave',
               'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock',
               'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'),
        ann_file=
        '/home/user/training_extensions/data/coco/annotations/instances_val2017.json',
        img_prefix=
        '/home/user/training_extensions/data/coco/val2017/',
        test_mode=True,
        pipeline=val_pipeline,
        ),
    test=dict(
        type='CocoDataset',
        classes=('person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
               'train', 'truck', 'boat', 'traffic light', 'fire hydrant',
               'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog',
               'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe',
               'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
               'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat',
               'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
               'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
               'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
               'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
               'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop',
               'mouse', 'remote', 'keyboard', 'cell phone', 'microwave',
               'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock',
               'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'),
        ann_file=
        '/home/user/training_extensions/data/coco/annotations/instances_val2017.json',
        img_prefix=
        '/home/user/training_extensions/data/coco/val2017/',
        test_mode=True,
        pipeline=test_pipeline
        ))

evaluation = dict(interval=1, metric=['bbox'])
optimizer = dict(type='SGD', lr=2e-3, momentum=0.9, weight_decay=5e-4)
optimizer_config = dict()
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=1200,
    warmup_ratio=0.3333333333333333,
    step=[5, 10, 15, 20])
checkpoint_config = dict(interval=1)
log_config = dict(
    interval=10,
    hooks=[dict(type='TextLoggerHook'),
           dict(type='TensorboardLoggerHook')])
total_epochs = 30
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = 'output/coco-object-detection-06'
load_from = None
resume_from = None
workflow = [('train', 1)]
gpu_ids = range(0, 1)
Ilya-Krylov commented 3 years ago

@anujdutt9 Hi, probably you are not wrong. It is expected that the model shows poor performance on general object detection problem. The detection head of this model is heavily trimmed SSD head, only two scales are remained. Moreover the anchor boxes are clustered especially for faces.

anujdutt9 commented 3 years ago

@Ilya-Krylov Hi. Thanks for your inputs. So, what would you suggest to make it work with the COCO dataset? Like I can try adding some more layers to the SSD head and have already calculated the anchor boxes using k-means for COCO dataset. Anything else that I am missing? Thanks

Ilya-Krylov commented 3 years ago

If you would like to train ShuffleNet-based SSD than yes, you can solve it that way as you suggested.

Or you can try to use more modern models from https://github.com/openvinotoolkit/mmdetection/blob/ote/docs/model_zoo.md that are already trained as MS-COCO detectors, most of them can be exported and inferred through OpenVINO (see tests https://github.com/openvinotoolkit/mmdetection/blob/ote/tests/test_models.py#L313). Not all models are covered by the tests, but you can try to do export by yourself, in many cases models can be exported and inferred using OpenVINO out-of-the-box.

anujdutt9 commented 3 years ago

Thanks for the suggestions. I have tried using the other models, but unfortunately, they are way above my model size requirements. I'll try adding more layers to the SSD head and see if that helps. Thanks