jadore801120 / attention-is-all-you-need-pytorch

A PyTorch implementation of the Transformer model in "Attention is All You Need".
MIT License
8.78k stars 1.97k forks source link

About Position Embedding and mask #134

Closed Zessay closed 4 years ago

Zessay commented 4 years ago

Hi, Huang! As far as I know, we hope the pad embedding is a zero-vector, even when it add the position embedding. However, in your new code, the pad embedding is not a zero-vector when the word-embedding add the position embedding. Does it matter? What's more, the encoder output will not multiply the non-pad mask, will this affect the final result? Thanks for your code! Look forward to you reply.

Zessay commented 4 years ago

Q2: In the file Translator.py , line 51-55, the best_k_probs and best_k_idx are 2-dim tensor. Obviously, we hope the scores is a 2-dim tensor whose size is [1, beam_size], so I think it's unnecessary to use best_k_probs.unsqueeze(0) which make the dim of score become 3. If I misunderstand, call for your help! Thanks!

1

jadore801120 commented 4 years ago

Hello @Zessay,

As you mentioned, currently, the vectors on the padding position will not be zero vectors. To check if it will affect the training processing, we have to inspect two aspects that could mess up the training:

1. Will the padding position vectors affect the loss calculation? 2. Will the padding position vectors affect other non-padding position vectors during forward process?

For the first issue, since we set the ignore_index as the padding index in the loss function torch.nn.functional.cross_entropy, so even if the padding-position logits do not point to the padding token, it will not be included into the loss calculation, and hence will not affect the model weight updating.

For the second issue, we want to make sure that the non-padding position vector never use the padding position vectors, so we have to inspect all the layer forwarding.

With the reasons mentioned above, I believe that we do not need to zero out the padding position vectors between the sublayers. If there is anything I did not consider carefully or unclear explanation, feel free point me out! I could be wrong.

Hope this helps.

Best, Yu-Hsiang

jadore801120 commented 4 years ago

Hello @Zessay , Thanks for your Q2! You are right. It is my mistake that we only need two dimension actually. Sincerely appreciate that! I have fixed the bug, please check the commit 3ab11cc. Best, Yu-Hsiang

Zessay commented 4 years ago

@jadore801120 Thanks for your detailed reply!