open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.57k stars 9.46k forks source link

Training RTMDet using MMDetection on higher number of classes #11466

Open Rishav-hub opened 9 months ago

Rishav-hub commented 9 months ago

So I have a dataset comprising of 1900 images in total and having 55 classes. I am experimenting with RTMDet tiny, medium and large models. But when I start training it shows ETA as 4 days for 700 epochs. The issue is model is not converging well if I train for lesser epochs.

this is how my config file looks like -:

MEDIUM_CONFIG_FILE = """
_base_ = './rtmdet_l_8xb32-300e_coco.py'

data_root = '/content/'

train_ann_file = 'train.json'
train_data_prefix = 'train_v1/'

valid_ann_file = 'valid.json'
valid_data_prefix = 'valid_v1/'

class_name = (classes tuple)

num_classes = 55

train_batch_size_per_gpu = 8
train_num_workers = 2

max_epochs = 700
stage2_num_epochs = 20
base_lr = 0.004
lr_start_factor = 1.0e-5
weight_decay = 0.05

metainfo = {
    'classes': (classes tuple),
    'palette': [
        (220, 20, 60),
    ]
}

train_dataloader = dict(
    batch_size=train_batch_size_per_gpu,
    num_workers=train_num_workers,
    dataset=dict(
        data_root=data_root,
        metainfo=metainfo,
        data_prefix=dict(img=train_data_prefix),
        ann_file=train_ann_file))

val_dataloader = dict(
    dataset=dict(
        data_root=data_root,
        metainfo=metainfo,
        data_prefix=dict(img=valid_data_prefix),
        ann_file=valid_ann_file))

test_dataloader = val_dataloader

val_evaluator = dict(ann_file=data_root + valid_ann_file)

test_evaluator = val_evaluator

model = dict(
    backbone=dict(deepen_factor=0.67, widen_factor=0.75),
    neck=dict(in_channels=[192, 384, 768], out_channels=192, num_csp_blocks=2),
    bbox_head=dict(in_channels=192,
               feat_channels=192,
               num_classes=num_classes))

# optimizer
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=weight_decay),
    paramwise_cfg=dict(
        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))

# learning rate
param_scheduler = [
    dict(
        type='LinearLR',
        start_factor=lr_start_factor,
        by_epoch=False,
        begin=0,
        end=1000), # This can be changed
    dict(
        # use cosine lr from 150 to 300 epoch
        type='CosineAnnealingLR',
        eta_min=base_lr * 0.05,
        begin=max_epochs // 2,
        end=max_epochs,
        T_max=max_epochs // 2,
        by_epoch=True,
        convert_to_iter_based=True),
]

train_pipeline_stage2 = [
    dict(type='LoadImageFromFile', backend_args=None),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='RandomResize',
        scale=(640, 640),
        ratio_range=(0.1, 2.0),
        keep_ratio=True),
    dict(type='RandomCrop', crop_size=(640, 640)),
    dict(type='YOLOXHSVRandomAug'),
    dict(type='RandomFlip', prob=0.5),
    dict(type='Pad', size=(640, 640), pad_val=dict(img=(114, 114, 114))),
    dict(type='PackDetInputs')
]

default_hooks = dict(
    checkpoint=dict(
        interval=5,
        max_keep_ckpts=2,  # only keep latest 2 checkpoints
        save_best='auto'
    ),
    logger=dict(type='LoggerHook', interval=5))

custom_hooks = [
    dict(
        type='PipelineSwitchHook',
        switch_epoch=max_epochs - stage2_num_epochs,
        switch_pipeline=train_pipeline_stage2)
]

print("Model ---->>>>", model)

train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=max_epochs, val_interval=5)
"""

So I need some suggestions from the experts in the community for either changing my config file or any other suggestion for improvement.

micheldom commented 5 months ago

Hi Rishav, sorry I'm not an expert but a first time commenter XD, have you tried running it with AMP? Adding the --amp parameter to train with automatic mixed precision can speed up the training time severalfold.

Additionally try experimenting with different batch sizes and see what the eta is after a couple of epochs. If the batch size is too high and your GPU doesn't have enough memory it leads to memory overflow which significantly increases training time. I myself find using a batch size of 6 on my laptop's RTX 4060 as 8 is too much for it to handle.