gradient exploding when training deep models with FP16

Hi, previously, I trained a tfixup-base model with 6 layers on both encoder and decoder, the model achieves competitive performance against a vanilla transformer-prenorm baseline and shows rather stable gradients (i.e., norm around 1.0) throughout training process.

Then I tried to train a 40-layers-encoder and 6-layers-decoder base model on a 20M-scale dataset, with a learning rate of 5e-4, 4000 decay steps, dropout=0.3 and batch_size=8k. The training runs on 8 GPUs, so the effective batch_size is 64k tokens per update.

FP16 training is used for both the above cases. However, the deep model only updated once and then sees overflows endlessly. Besides, the only successful update yields a large gradient norm (9.84). Did you use any gradient clipping technique to avoid this problem (the default value of 25.0 doesn't work for me and explictly setting to 1.0 neither) ? Or is it related to the imbalance between encoder and decoder layers? Could you please share the exact hyperparams for deep and big models?

Following is part of the training log:

2020-08-21 10:41:00,724 | INFO | train | Num. of parameters: 192219200
2020-08-21 10:41:00,724 | INFO | train | Max samples = None, max tokens = 8192
2020-08-21 10:41:30,622 | INFO | train | Training on 8 GPUs
2020-08-21 10:41:30,879 | WARNING | train | Tensorboard is not available.
2020-08-21 10:41:55,393 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=64.0.
2020-08-21 10:41:57,747 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=32.0.
2020-08-21 10:42:00,009 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=16.0.
2020-08-21 10:42:02,414 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=8.0.
2020-08-21 10:42:05,629 | INFO | thseq.trainer | Epoch: 1(1/11946), step=1, loss=10.709, nll_loss=15.450, ppl=44772.791, lr=5.0000e-04, gnorm=9.84, samples=1456, tokens=(38197,64052)/(63864,64136), wps=0.00, ups=0.00, oom=0/8
2020-08-21 10:42:07,194 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=4.0.
2020-08-21 10:42:07,662 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=2.0.
2020-08-21 10:42:08,098 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=1.0.
2020-08-21 10:42:08,516 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=0.5.
2020-08-21 10:42:08,937 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=0.25.
2020-08-21 10:42:09,375 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=0.125.
2020-08-21 10:42:09,810 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=0.0625.
2020-08-21 10:42:10,247 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=0.03125.
2020-08-21 10:42:10,671 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=0.015625.
2020-08-21 10:42:11,090 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=0.0078125.
2020-08-21 10:42:11,519 | INFO | thseq.trainer | Overflow detected, skipping update. Reduced scale=0.00390625.

layer6ai-labs / T-Fixup

gradient exploding when training deep models with FP16 #3