microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.14k stars 2.44k forks source link

code for deepnet & reproduction #663

Closed KennyShang closed 1 year ago

KennyShang commented 2 years ago

Need code of deepnet for reproduction Failed to reproduce the deepnet paper with the TL:DR section

Modifications for post-ln tranformers

  1. calculate the alpha and beta for encoder and decoder;
  2. Up scale the residual x with beta;
  3. Reinit the weights in the q k v output and ffn according to paper;

Expected behavior After 30k training steps, the Bleu score decreases to 0 from 24-25 in a model with a config of 35L-35L;

maydaygmail commented 2 years ago

Meet the similar problem. bleu score decrease to 0 after about 10K training step.

shumingma commented 2 years ago

Hi @KennyShang @maydaygmail Could you provide more details about your implementation (e.g. alpha, beta you actually used, learning rate, batch size, warmup, adam's beta)?

BTW: "Up scale the residual x with beta;" the beta here should be alpha.

maydaygmail commented 2 years ago

Thanks @shumingma I will check the hyper parameters. Does the 1000 layers DeepNet model training need model parallel?

donglixp commented 1 year ago

@KennyShang @maydaygmail https://github.com/microsoft/torchscale