Closed SurrealEverything closed 3 years ago
Why is PreNorm used? Shouldn't normalization happen after the residual connections on each layer, like this:
PostNorm(_, Residual(FeedForward(...
?
@SurrealEverything where in the paper do you see that it is post-normalization?
You are right. The paper uses PreNorm. Sorry about this. Didn't know it's a thing.
no problem!
Why is PreNorm used? Shouldn't normalization happen after the residual connections on each layer, like this:
?