Hi Arrow,
I observed that after the pre-training stage-1, the parameters of BERT had very small changes with the initialization parameters. Is it because parameter coef_lr is working? Since it was set to 0.1 at the 1st stage and set to 1 at the 2nd stage. I guess it's to prevent BERT from being damaged at the beginning of training.
https://github.com/microsoft/UniVL/blob/0a7c07f566a3b220731f4abcaa6e1ee59a686596/main_pretrain.py#L383-L385
Hi Arrow,
I observed that after the pre-training stage-1, the parameters of
BERT
had very small changes with the initialization parameters. Is it because parameter coef_lr is working? Since it was set to0.1
at the 1st stage and set to1
at the 2nd stage. I guess it's to preventBERT
from being damaged at the beginning of training. https://github.com/microsoft/UniVL/blob/0a7c07f566a3b220731f4abcaa6e1ee59a686596/main_pretrain.py#L383-L385By the way, you named
no_decay_xxx
with the decay coefficient, and nameddecay_xxx
without decay coefficient. Are these typoes? https://github.com/microsoft/UniVL/blob/0a7c07f566a3b220731f4abcaa6e1ee59a686596/main_pretrain.py#L191-L194