jadore801120 / attention-is-all-you-need-pytorch

A PyTorch implementation of the Transformer model in "Attention is All You Need".
MIT License
8.78k stars 1.97k forks source link

shared embedding factor bug #138

Closed kaituoxu closed 4 years ago

kaituoxu commented 4 years ago

here maybe not correct: https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py#L160

it should be self.x_logit_scale = (d_model ** 0.5)

jadore801120 commented 4 years ago

Hello @kaituoxu ,

According to the last sentence in the section 3.4:

In the embedding layers, we multiply those weights by √dmodel.

To my understanding, it says that ratio between the embedding weight emb and the logit projection weight word_proj is √dmodel. i.e. emb : word_proj = dmodel^0.5 : 1. i.e. emb : word_proj = 1: dmodel^(-0.5).

Since I implemented the scaling on the word_proj side, I multiply dmodel^(-0.5) on the output of the final layer instead of dmodel^0.5.

If there is anything I missed, please let me know. Thanks.

Hope this helps.

Yu-Hsiang

kaituoxu commented 4 years ago

Sorry, You're right. I multiply this on emb side...

Close it.