Closed chaoyanghe closed 4 years ago
As in paper On Layer Normalization in the Transformer Architecture, the position of Layer Normalization in Transformer implementation is used as pre-LN and post-LN. For example, Transformer Encoder-based BERT uses post-LN, but Vision Transformer uses pre-LN. In conclusion, that implementation is correct.
Additional comments. Attention is all you need uses post-LN.
I believe the issue has been answered and close the issue.
Thank you!
Hi, I checked your code at https://github.com/jeonsworld/ViT-pytorch/blob/878ebc5bd12255d2fffd6c0257b83ee075607a79/models/modeling.py#L154.
Your implementation is: Attention(LayerNorm(x)) + x, but the original Transformer is: LayerNorm(x +Attention(x)). Is this an error or deliberately implemented like this?