About Position Embedding and mask

Zessay commented 4 years ago

Hi, Huang! As far as I know, we hope the pad embedding is a zero-vector, even when it add the position embedding. However, in your new code, the pad embedding is not a zero-vector when the word-embedding add the position embedding. Does it matter? What's more, the encoder output will not multiply the non-pad mask, will this affect the final result? Thanks for your code! Look forward to you reply.

Zessay commented 4 years ago

Q2: In the file Translator.py , line 51-55, the best_k_probs and best_k_idx are 2-dim tensor. Obviously, we hope the scores is a 2-dim tensor whose size is [1, beam_size], so I think it's unnecessary to use best_k_probs.unsqueeze(0) which make the dim of score become 3. If I misunderstand, call for your help! Thanks!

jadore801120 commented 4 years ago

Hello @Zessay,

As you mentioned, currently, the vectors on the padding position will not be zero vectors. To check if it will affect the training processing, we have to inspect two aspects that could mess up the training:

1. Will the padding position vectors affect the loss calculation? 2. Will the padding position vectors affect other non-padding position vectors during forward process?

For the first issue, since we set the ignore_index as the padding index in the loss function torch.nn.functional.cross_entropy, so even if the padding-position logits do not point to the padding token, it will not be included into the loss calculation, and hence will not affect the model weight updating.

For the second issue, we want to make sure that the non-padding position vector never use the padding position vectors, so we have to inspect all the layer forwarding.

For position-wise feed forward layers, the vectors will not affect each other.
For attention layers, we masked out all the padding positions in keys vectors, which means that the attention weight of the padding-position key vectors will always be zero. The masking operation assures that we will not use the padding-position value vectors during forwarding. You may notice that the padding position query vectors could be non-zero after attention, but as we mentioned beforehand, the padding position vectors will not affect others during position-wise feed forward layers and will be masked out in attention if it serve as the key/value vectors.

With the reasons mentioned above, I believe that we do not need to zero out the padding position vectors between the sublayers. If there is anything I did not consider carefully or unclear explanation, feel free point me out! I could be wrong.

Hope this helps.

Best, Yu-Hsiang

jadore801120 commented 4 years ago

Hello @Zessay , Thanks for your Q2! You are right. It is my mistake that we only need two dimension actually. Sincerely appreciate that! I have fixed the bug, please check the commit 3ab11cc. Best, Yu-Hsiang

Zessay commented 4 years ago

@jadore801120 Thanks for your detailed reply!

jadore801120 / attention-is-all-you-need-pytorch

About Position Embedding and mask #134