Hi,
previously, I trained a tfixup-base model with 6 layers on both encoder and decoder, the model achieves competitive performance against a vanilla transformer-prenorm baseline and shows rather stable gradients (i.e., norm around 1.0) throughout training process.
Then I tried to train a 40-layers-encoder and 6-layers-decoder base model on a 20M-scale dataset, with a learning rate of 5e-4, 4000 decay steps, dropout=0.3 and batch_size=8k. The training runs on 8 GPUs, so the effective batch_size is 64k tokens per update.
FP16 training is used for both the above cases. However, the deep model only updated once and then sees overflows endlessly. Besides, the only successful update yields a large gradient norm (9.84). Did you use any gradient clipping technique to avoid this problem (the default value of 25.0 doesn't work for me and explictly setting to 1.0 neither) ? Or is it related to the imbalance between encoder and decoder layers? Could you please share the exact hyperparams for deep and big models?
Hi, previously, I trained a tfixup-base model with 6 layers on both encoder and decoder, the model achieves competitive performance against a vanilla transformer-prenorm baseline and shows rather stable gradients (i.e., norm around 1.0) throughout training process.
Then I tried to train a 40-layers-encoder and 6-layers-decoder base model on a 20M-scale dataset, with a learning rate of 5e-4, 4000 decay steps, dropout=0.3 and batch_size=8k. The training runs on 8 GPUs, so the effective batch_size is 64k tokens per update.
FP16 training is used for both the above cases. However, the deep model only updated once and then sees overflows endlessly. Besides, the only successful update yields a large gradient norm (9.84). Did you use any gradient clipping technique to avoid this problem (the default value of 25.0 doesn't work for me and explictly setting to 1.0 neither) ? Or is it related to the imbalance between encoder and decoder layers? Could you please share the exact hyperparams for deep and big models?
Following is part of the training log: