The Encoder implementation is different from the original "Attention is all need" paper?

jeonsworld / ViT-pytorch

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)

MIT License

1.95k stars 374 forks source link

The Encoder implementation is different from the original "Attention is all need" paper? #5

Closed chaoyanghe closed 4 years ago

chaoyanghe commented 4 years ago

Hi, I checked your code at https://github.com/jeonsworld/ViT-pytorch/blob/878ebc5bd12255d2fffd6c0257b83ee075607a79/models/modeling.py#L154.

Your implementation is: Attention(LayerNorm(x)) + x, but the original Transformer is: LayerNorm(x +Attention(x)). Is this an error or deliberately implemented like this?

jeonsworld commented 4 years ago

As in paper On Layer Normalization in the Transformer Architecture, the position of Layer Normalization in Transformer implementation is used as pre-LN and post-LN. For example, Transformer Encoder-based BERT uses post-LN, but Vision Transformer uses pre-LN. In conclusion, that implementation is correct.

jeonsworld commented 4 years ago

Additional comments. Attention is all you need uses post-LN.

jeonsworld commented 4 years ago

I believe the issue has been answered and close the issue.

chaoyanghe commented 4 years ago

Thank you!