Linear lr warmup in pretrain.

Add linear learning rate warmup to pretrain.

The current implementation already starts decaying during the warmup. I don't think this is a big problem since warmup_steps is usually a very small fraction of the training data.

At some later point, we should probably unify the learning rate scheduling between train and pretrain.

stickeritis / sticker

Linear lr warmup in pretrain. #153