layer6ai-labs / T-Fixup

Code for the ICML'20 paper "Improving Transformer Optimization Through Better Initialization"
MIT License
89 stars 11 forks source link

FP16 Training #1

Closed libeineu closed 4 years ago

libeineu commented 4 years ago

Hi! Very cool work with complex theoretical proof and exciting results! My name is Bei Li, the author of Learning Deep Transformer Models for Machine Translation. Recently. there are many works focusing on improving the deep Transformer through better initialization strategies. But a very serious problem is the FP16 training is weak when using those strategies. I wonder have you tried FP16 training in this work? It's very interesting! Looking forward to your next work!

libeineu commented 4 years ago

Hello,I have run the script on WMT14 En-De task, but I cannot reproduce the 29.1 BLEU score yet. Could you please provide the training details?

risingdhxs commented 4 years ago

Hi Bei,

Thank you for your interest in our work. In our work, both the deep models and the big model on WMT'17 En-De are trained with fp16, while WMT'17 BASE and IWSLT'14 models were trained with fp32, mainly due to training time concern. Our T-Fixup models goes well with fp16 precision, but we did notice that sometimes there is a loss scale overflow problem (especially when very deep models).

These errors are not caused by T-Fixup; instead they seem to be associated with fairseq's fp16 mode (such as in https://github.com/pytorch/fairseq/issues/512). So far we don't know how to solve them, but did notice that reduce the learning rate helps under such situations.

For the WMT'14 En-De model reproduction, please refer to our email exchanges, or to the model parameter section in the supplementary file.