Closed jhj0411jhj closed 1 year ago
Hi Huaijun,
Thanks for the question. You are right that the best finetuning HPs are usually different from ones used for pretraining because of the differences in datasets and batch sizes. It's an on-going work to explore the best way to transfer hyperparameters during finetuning because of the importance of regularization in that regime. You might be able to transfer finetuning HPs by using two pretrained mup models of different sizes, but in our experience it doesn't work as well as pretraining, for the reason mentioned above.
Hope this helps!
Many thanks for your explanation!
Hi, I'm trying to tune hyperparameters of a pretrained model (e.g. resnet or swin-transformer) during the fine tuning stage. If I scale the model using mup, the pretrained weight cannot be used anymore. And I think the best hps for fully training a model might be different from the best hps for fine tuning a model. Can mup be applied to this scenario? Thanks.