The current implementation already starts decaying during the warmup. I don't think this is a big problem since warmup_steps is usually a very small fraction of the training data.
At some later point, we should probably unify the learning rate scheduling between train and pretrain.
Add linear learning rate warmup to pretrain.
The current implementation already starts decaying during the warmup. I don't think this is a big problem since
warmup_steps
is usually a very small fraction of the training data.At some later point, we should probably unify the learning rate scheduling between
train
andpretrain
.