microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

Does mup support fine tuning pretrained models #46

Closed jhj0411jhj closed 1 year ago

jhj0411jhj commented 1 year ago

Hi, I'm trying to tune hyperparameters of a pretrained model (e.g. resnet or swin-transformer) during the fine tuning stage. If I scale the model using mup, the pretrained weight cannot be used anymore. And I think the best hps for fully training a model might be different from the best hps for fine tuning a model. Can mup be applied to this scenario? Thanks.

edwardjhu commented 1 year ago

Hi Huaijun,

Thanks for the question. You are right that the best finetuning HPs are usually different from ones used for pretraining because of the differences in datasets and batch sizes. It's an on-going work to explore the best way to transfer hyperparameters during finetuning because of the importance of regularization in that regime. You might be able to transfer finetuning HPs by using two pretrained mup models of different sizes, but in our experience it doesn't work as well as pretraining, for the reason mentioned above.

Hope this helps!

jhj0411jhj commented 1 year ago

Many thanks for your explanation!