Evidence for Norm(x + f(x)) causing issues?

google-research / tuning_playbook

A playbook for systematically maximizing the performance of deep learning models.

Other

26.29k stars 2.18k forks source link

Evidence for Norm(x + f(x)) causing issues? #31

Closed hennels closed 5 months ago

hennels commented 1 year ago

In the section on Potential fixes for common instability patterns there is a line stating:

Norm(x + f(x)) known to cause issues.

Text is here.

Are there any experiments or papers that you could reference to support this? I just found it very surprising considering that many recent Transformer-based architectures use exactly this pattern.

whxxiv commented 1 year ago

Here it may mean that no further Norm is required after the residual connection.

fzyzcjy commented 5 months ago

+1 Do you find the answer? Thanks!

znado commented 5 months ago

https://arxiv.org/abs/2110.04369 discusses pre vs post LN in the context of stability.

https://arxiv.org/abs/2206.00330 has another example (although I am not as familiar with the paper and cannot couch either way for their new method).