Typo in the order of applying Norm and function

Discussed in https://github.com/google-research/tuning_playbook/discussions/3

^{Originally posted by **madaan** January 19, 2023} Thanks, the playbook looks pretty cool! I am curious about: > Normalization should be the last operation before the residual. E.g. x + Norm(f(x)). Is this advice for specific settings/norms? For modern LMs, the order typically is `x + f(Norm(x))`. For example, transformer blocks in language models typically have the following design: ```py def block(x): # x is the input, ln{1, 2} are layer norms, attn is self-attention, mlp is a feed-forward network return x + mlp(ln2(x + attn(ln1(x)))) ``` Some examples are [T5](https://github.com/huggingface/transformers/blob/862888a35834527fed61beaf42373423ffdbd216/src/transformers/models/t5/modeling_t5.py#L580), [GPT-2](https://github.com/huggingface/transformers/blob/862888a35834527fed61beaf42373423ffdbd216/src/transformers/models/gpt2/modeling_gpt2.py#L388), and I think [PaLM](https://arxiv.org/pdf/2204.02311.pdf) also applies LayerNorm before MLP/Attention.

google-research / tuning_playbook

Typo in the order of applying Norm and function #4

Discussed in https://github.com/google-research/tuning_playbook/discussions/3