Why "Normalization should be the last operation before the residual" ?

google-research / tuning_playbook

A playbook for systematically maximizing the performance of deep learning models.

Other

26.29k stars 2.18k forks source link

Why "Normalization should be the last operation before the residual" ? #50

Closed Yura52 closed 1 year ago

Yura52 commented 1 year ago

Hi! Thanks for the great repository!

I have a question about this line which says Normalization should be the last operation before the residual. E.g. x + Norm(f(x)).

I am curious what is the intuition behind this advice? Or maybe there are mainstream architectures that follow this guidline or papers explaining this aspect?

hennels commented 1 year ago

I think this was stated to be a typo in the discussions. https://github.com/google-research/tuning_playbook/discussions/3#discussioncomment-4732988

Yura52 commented 1 year ago

Thank you for the reply!