Closed kaituoxu closed 4 years ago
Hello @kaituoxu ,
According to the last sentence in the section 3.4:
In the embedding layers, we multiply those weights by √dmodel.
To my understanding, it says that ratio between the embedding weight emb
and the logit projection weight word_proj
is √dmodel.
i.e. emb : word_proj = dmodel^0.5 : 1
.
i.e. emb : word_proj = 1: dmodel^(-0.5)
.
Since I implemented the scaling on the word_proj
side, I multiply dmodel^(-0.5) on the output of the final layer instead of dmodel^0.5.
If there is anything I missed, please let me know. Thanks.
Hope this helps.
Yu-Hsiang
Sorry, You're right. I multiply this on emb
side...
Close it.
here maybe not correct: https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py#L160
it should be
self.x_logit_scale = (d_model ** 0.5)