Closed KennyShang closed 1 year ago
Meet the similar problem. bleu score decrease to 0 after about 10K training step.
Hi @KennyShang @maydaygmail Could you provide more details about your implementation (e.g. alpha, beta you actually used, learning rate, batch size, warmup, adam's beta)?
BTW: "Up scale the residual x with beta;" the beta here should be alpha.
Thanks @shumingma I will check the hyper parameters. Does the 1000 layers DeepNet model training need model parallel?
@KennyShang @maydaygmail https://github.com/microsoft/torchscale
Need code of deepnet for reproduction Failed to reproduce the deepnet paper with the TL:DR section
Modifications for post-ln tranformers
Expected behavior After 30k training steps, the Bleu score decreases to 0 from 24-25 in a model with a config of 35L-35L;