why pretraining on 384x384 needs larger learning rate?

keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"

MIT License

1.41k stars 82 forks source link

Thanks. First note --base_lr is the base learning rate: the actual lr would be base_lr * bs / 256, as in /pretrain/utils/arg_util.py line131. And this lr scaling rule is commonly used in prior work like MAE.

So basically when we change the batch size, we don't need to care about --base_lr. It will adjust itself.

But, yes, actually we do double the base_lr when in 384 config, which is refer to BEiT. The reason could be the pretraining task in 384 are more difficult so we need a larger base_lr.

keyu-tian / SparK

why pretraining on 384x384 needs larger learning rate? #36