The lr value increased with epoch when trained swin-trasformer (small architecture)

fangxu622 commented 3 years ago

the config file:

...
model = dict(
    type='ImageClassifier',
    backbone=dict(
        type='SwinTransformer', arch='small', img_size=224, drop_path_rate=0.5),
    neck=dict(type='GlobalAveragePooling', dim=1),
    head=dict(
        type='MultiLabelLinearClsHead',
        num_classes=17,
        in_channels=768,
        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
        loss=dict(type='CrossEntropyLossWeight', pos_weight=[21.7, 4.641, 62.992, 53.503, 2.663, 164.863, 44.948, 42.736, 12.895, 14.641, 206.611, 18.267, 42.915, 153.185, 223.893, 194.329, 5.727] , use_sigmoid=True)
        #[21.7, 4.641, 62.992, 53.503, 2.663, 164.863, 44.948, 42.736, 12.895, 14.641, 206.611, 18.267, 42.915, 153.185, 223.893, 194.329, 5.727]
        #cal_acc=False
        ),
    init_cfg=[
        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
    ],
    # train_cfg=dict(augments=[
    #     dict(type='BatchMixup', alpha=0.8, num_classes=17, prob=0.5),
    #     dict(type='BatchCutMix', alpha=1.0, num_classes=17, prob=0.5)
    # ])
    )

...

# schedules 
paramwise_cfg = dict(
    norm_decay_mult=0.0,
    bias_decay_mult=0.0,
    custom_keys={
        '.absolute_pos_embed': dict(decay_mult=0.0),
        '.relative_position_bias_table': dict(decay_mult=0.0)
    })

# for batch in each gpu is 128, 8 gpu
# lr = 5e-4 * 128 * 8 / 512 = 0.001
optimizer = dict(
    type='AdamW',
    lr=5e-4 * 196 * 7 / 512,
    weight_decay=0.05,
    eps=1e-8,
    betas=(0.9, 0.999),
    paramwise_cfg=paramwise_cfg)
optimizer_config = dict(grad_clip=dict(max_norm=5.0))

# learning policy
lr_config = dict(
    policy='CosineAnnealing',
    by_epoch=False,
    min_lr_ratio=1e-2,
    warmup='linear',
    warmup_ratio=1e-3,
    warmup_iters=20 * 1252,
    warmup_by_epoch=False)

runner = dict(type='EpochBasedRunner', max_epochs=400)

mzr1996 commented 3 years ago

Hello, we set the warm-up iteration number here, which means 20 epochs and each epoch has 1252 iterations.

lr_config = dict(
    policy='CosineAnnealing',
    by_epoch=False,
    min_lr_ratio=1e-2,
    warmup='linear',
    warmup_ratio=1e-3,
    warmup_iters=20 * 1252,
    warmup_by_epoch=False)

You can modify it to

lr_config = dict(
    policy='CosineAnnealing',
    by_epoch=False,
    min_lr_ratio=1e-2,
    warmup='linear',
    warmup_ratio=1e-3,
    warmup_iters=20,
    warmup_by_epoch=True)

or

lr_config = dict(
    policy='CosineAnnealing',
    by_epoch=False,
    min_lr_ratio=1e-2,
    warmup='linear',
    warmup_ratio=1e-3,
    warmup_iters=20 * 759,
    warmup_by_epoch=False)

Refers to mmcv docs

fangxu622 commented 3 years ago

Yes. But loss is wired. When set warmup_iters=5,(10, 20 , ) warmup_by_epoch=True , the loss declined. But rised after epoch 5 （10，20）.

But the base architecture of swintransformer is OK , their settings is same except for architecture setting (base , small).

mzr1996 commented 3 years ago

Maybe you can try to modify the learning rate and learning rate scheduler to fit your dataset.

fangxu622 commented 3 years ago

Cloud please tell me how to calculate the lr according to decay and warmup lr strategy. The two strategies make me confused.

for example 1:

paramwise_cfg = dict(
    norm_decay_mult=0.0,
    bias_decay_mult=0.0,
    custom_keys=dict({
        '.absolute_pos_embed': dict(decay_mult=0.0),
        '.relative_position_bias_table': dict(decay_mult=0.0)
    }))
optimizer = dict(
    type='AdamW',
    lr=0.0011484375000000002,
    weight_decay=0.05,
    eps=1e-08,
    betas=(0.9, 0.999),
    paramwise_cfg=dict(
        norm_decay_mult=0.0,
        bias_decay_mult=0.0,
        custom_keys=dict({
            '.absolute_pos_embed': dict(decay_mult=0.0),
            '.relative_position_bias_table': dict(decay_mult=0.0)
        })))
optimizer_config = dict(grad_clip=dict(max_norm=5.0))
lr_config = dict(
    policy='CosineAnnealing',
    by_epoch=False,
    min_lr_ratio=0.01,
    warmup='linear',
    warmup_ratio=0.001,
    warmup_iters=8800,
    warmup_by_epoch=False)
runner = dict(type='EpochBasedRunner', max_epochs=400)

I can found the register function on optimize and lr_update file on mmcv . but I don't know how to execute the warmup and lr_update strategy on the whole pipeline. Cloud you give me the formula that calculates the lr for reference. I didn't calculate it correctly according to the source code.

fangxu622 commented 3 years ago

The wired thing is that same strategy on swintransformer base architecture is ok but no small. the loss declines fast on base architecture.

mzr1996 commented 3 years ago

The learning rate scheduler implementation of CosineAnnealing is in mmcv. As for the detail formula, you can refer to the PyTorch docs

mzr1996 commented 3 years ago

Has this question been solved?

fangxu622 commented 3 years ago

Has this question been solved?

Yeah! Solved! Thank you for your reminder

open-mmlab / mmpretrain

The lr value increased with epoch when trained swin-trasformer (small architecture) #431