google-research / tuning_playbook

A playbook for systematically maximizing the performance of deep learning models.
Other
26.35k stars 2.2k forks source link

Typo in the order of applying Norm and function #4

Closed madaan closed 5 months ago

madaan commented 1 year ago

Discussed in https://github.com/google-research/tuning_playbook/discussions/3

Originally posted by **madaan** January 19, 2023 Thanks, the playbook looks pretty cool! I am curious about: > Normalization should be the last operation before the residual. E.g. x + Norm(f(x)). Is this advice for specific settings/norms? For modern LMs, the order typically is `x + f(Norm(x))`. For example, transformer blocks in language models typically have the following design: ```py def block(x): # x is the input, ln{1, 2} are layer norms, attn is self-attention, mlp is a feed-forward network return x + mlp(ln2(x + attn(ln1(x)))) ``` Some examples are [T5](https://github.com/huggingface/transformers/blob/862888a35834527fed61beaf42373423ffdbd216/src/transformers/models/t5/modeling_t5.py#L580), [GPT-2](https://github.com/huggingface/transformers/blob/862888a35834527fed61beaf42373423ffdbd216/src/transformers/models/gpt2/modeling_gpt2.py#L388), and I think [PaLM](https://arxiv.org/pdf/2204.02311.pdf) also applies LayerNorm before MLP/Attention.
znado commented 5 months ago

You can see the papers mentioned here https://github.com/google-research/tuning_playbook/issues/31

In general I would start from the configs listed there and then experiment as needed to see how your training stability changes.