Is it necessary to make the encoder layers equal to decoder layers?

layer6ai-labs / T-Fixup

Code for the ICML'20 paper "Improving Transformer Optimization Through Better Initialization"

MIT License

89 stars 11 forks source link

Answer to the question: no you don't need to have encoderlayer#=decoderlayer#.

For the diverging phenomenon, considering such a deep model and the large dataset, I would suggest combining a lower learning rate (as you have already tried) with a larger dropout. The effective batch size is also much larger than what we tested, but I'm not sure what kind of effect it has on training (typically Transformers benefit from larger batches, but in our case it might make it more fragile). If you're using fp16 during training, considering training under fp32, as it is much less likely to blow up.

Regarding the problem of low LR-> high loss, please consider combining a moderate learning rate with maybe slightly longer decay period, so the learning rates lingers in the higher value region a bit longer.

Please let me know if the above changes helped.

layer6ai-labs / T-Fixup

Is it necessary to make the encoder layers equal to decoder layers? #4