microsoft / UniVL

An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
https://arxiv.org/abs/2002.06353
MIT License
339 stars 54 forks source link

What's the role of the parameter coef_lr? #14

Closed forence closed 3 years ago

forence commented 3 years ago

Hi Arrow,
I observed that after the pre-training stage-1, the parameters of BERT had very small changes with the initialization parameters. Is it because parameter coef_lr is working? Since it was set to 0.1 at the 1st stage and set to 1 at the 2nd stage. I guess it's to prevent BERT from being damaged at the beginning of training. https://github.com/microsoft/UniVL/blob/0a7c07f566a3b220731f4abcaa6e1ee59a686596/main_pretrain.py#L383-L385

By the way, you named no_decay_xxx with the decay coefficient, and named decay_xxx without decay coefficient. Are these typoes? https://github.com/microsoft/UniVL/blob/0a7c07f566a3b220731f4abcaa6e1ee59a686596/main_pretrain.py#L191-L194

ArrowLuo commented 3 years ago

Hi @forence, You are right. We use a small learning rate to prevent BERT from being damaged. no_decay_xxx and decay_xxx are typoes. Thanks.