lucidrains / perceiver-pytorch

Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch
MIT License
1.1k stars 134 forks source link

Normalization happens before layers instead of afterwards #42

Closed SurrealEverything closed 3 years ago

SurrealEverything commented 3 years ago

Why is PreNorm used? Shouldn't normalization happen after the residual connections on each layer, like this:

PostNorm(_, Residual(FeedForward(...

?

lucidrains commented 3 years ago

@SurrealEverything where in the paper do you see that it is post-normalization?

SurrealEverything commented 3 years ago

You are right. The paper uses PreNorm. Sorry about this. Didn't know it's a thing.

lucidrains commented 3 years ago

no problem!