layer6ai-labs / T-Fixup

Code for the ICML'20 paper "Improving Transformer Optimization Through Better Initialization"
MIT License
89 stars 11 forks source link

Is it necessary to make the encoder layers equal to decoder layers? #4

Closed SefaZeng closed 1 year ago

SefaZeng commented 3 years ago

I tried to train a 40-layers-encoder and 6-layers-decoder base model on a 20M-scale dataset, with a learning rate of 5e-4, 4000 decay steps, dropout=0.3 and batch_size=8k. The training runs on 8 GPUs, so the effective batch_size is 64k tokens per update. But the gradients will go to very large soon and the training is diverged. If I set the lower learning rate, this phenomenon will occur late but still diverged. And if the learning rate is too low, the loss will keep high, and the BLEU is very low. Is there any tricks for you methods? Any help is appreciate.

risingdhxs commented 3 years ago

Answer to the question: no you don't need to have encoderlayer#=decoderlayer#.

For the diverging phenomenon, considering such a deep model and the large dataset, I would suggest combining a lower learning rate (as you have already tried) with a larger dropout. The effective batch size is also much larger than what we tested, but I'm not sure what kind of effect it has on training (typically Transformers benefit from larger batches, but in our case it might make it more fragile). If you're using fp16 during training, considering training under fp32, as it is much less likely to blow up.

Regarding the problem of low LR-> high loss, please consider combining a moderate learning rate with maybe slightly longer decay period, so the learning rates lingers in the higher value region a bit longer.

Please let me know if the above changes helped.