microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
2.98k stars 201 forks source link

retnet traning config #64

Open hanlinxuy opened 10 months ago

hanlinxuy commented 10 months ago

Hello,

I have followed the training configuration introduced here (https://github.com/microsoft/torchscale/issues/52) with retnet_medium architecture. I have some questions that I would appreciate if anyone could answer them.

The first is about the initialization. From the RETNET paper https://arxiv.org/abs/2307.08621, I saw that parameters were initialized following deepnet. So I am wondering why in the RetNetConfig it is set to False, and where should I set it True? (https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/config.py#L239)

If I simply add "--deepnorm" in command line, this will be activated together with subln (https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/config.py#L240), then I found the output of each layers getting larger and larger with the layer id increasing.

The second is about the vocabulary. I am newer to fairseq so I am not sure how to deal with a large dataset via fairseq_preprocess. I am trying to use MINIPILE while the dict.txt has 32309612 lines. It seems too large so I am wondering if there is some official recommendation for this part.

The third is about --share-decoder-input-output-embed, Is it recommended? I am sorry if I missed in paper.

Thank you guys in advance:)

simran-arora commented 9 months ago

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!

sunyt32 commented 9 months ago
  1. --share-decoder-input-output-embed saves model parameters especially when the model size is small. The performance is almost the same. We activate it in our experiment.
  2. Don't activate --subln or --deepnorm. The current initialization is good enough.
  3. The training instability comes from Linear bias and eps in LayerNorm. In our experiment, we set bias=False and eps=1e-5. Besides, RMSNorm is helpful for stability so we make a modification.
donglixp commented 9 months ago

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!

@simran-arora @hanlinxuy

The latest released code has considered the above points.

simran-arora commented 9 months ago

Thanks so much! I had used layer norm and did not set the bias=False. Will try switching these.

Adding the explicit deepnorm initialization also improved stability for my downstream runs, but I will try using the recommended techniques instead.

sunyt32 commented 9 months ago

@simran-arora It's better to set bias=False both in layer norm and nn.Linear.

Besides, would you mind sharing the training details with us? e.g. corpus, model size, and hyper-parameters. We'd like to see the instability setting.

hanlinxuy commented 9 months ago

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!

@simran-arora @hanlinxuy

  • The LN eps was modified from 1e-6 to 1e-5 as in the commit d1fefe9
  • The RMSNorm is also used in the commit 5c89ffb , so that the effects of LN_eps can be eliminated
  • For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments --subln or --deepnorm should not be added.
  • Removing bias also improves training stability.

The latest released code has considered the above points.

Thank you very much! Will try later with those new information!