keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
https://arxiv.org/abs/2301.03580
MIT License
1.41k stars 82 forks source link

why pretraining on 384x384 needs larger learning rate? #36

Closed junwuzhang19 closed 1 year ago

junwuzhang19 commented 1 year ago

Hi, thanks for your work. In general, for smaller batch size, a smaller learning rate is needed. Why SparK needs a larger learning rate with smaller batch size while pretraining on 384x384?

keyu-tian commented 1 year ago

Thanks. First note --base_lr is the base learning rate: the actual lr would be base_lr * bs / 256, as in /pretrain/utils/arg_util.py line131. And this lr scaling rule is commonly used in prior work like MAE.

So basically when we change the batch size, we don't need to care about --base_lr. It will adjust itself.

But, yes, actually we do double the base_lr when in 384 config, which is refer to BEiT. The reason could be the pretraining task in 384 are more difficult so we need a larger base_lr.