Closed Zessay closed 4 years ago
Q2:
In the file Translator.py , line 51-55, the best_k_probs
and best_k_idx
are 2-dim tensor. Obviously, we hope the scores
is a 2-dim tensor whose size is [1, beam_size]
, so I think it's unnecessary to use best_k_probs.unsqueeze(0)
which make the dim of score
become 3. If I misunderstand, call for your help! Thanks!
Hello @Zessay,
As you mentioned, currently, the vectors on the padding position will not be zero vectors. To check if it will affect the training processing, we have to inspect two aspects that could mess up the training:
1. Will the padding position vectors affect the loss calculation? 2. Will the padding position vectors affect other non-padding position vectors during forward process?
For the first issue, since we set the ignore_index
as the padding index in the loss function torch.nn.functional.cross_entropy
, so even if the padding-position logits do not point to the padding token, it will not be included into the loss calculation, and hence will not affect the model weight updating.
For the second issue, we want to make sure that the non-padding position vector never use the padding position vectors, so we have to inspect all the layer forwarding.
With the reasons mentioned above, I believe that we do not need to zero out the padding position vectors between the sublayers. If there is anything I did not consider carefully or unclear explanation, feel free point me out! I could be wrong.
Hope this helps.
Best, Yu-Hsiang
Hello @Zessay , Thanks for your Q2! You are right. It is my mistake that we only need two dimension actually. Sincerely appreciate that! I have fixed the bug, please check the commit 3ab11cc. Best, Yu-Hsiang
@jadore801120 Thanks for your detailed reply!
Hi, Huang! As far as I know, we hope the pad embedding is a zero-vector, even when it add the position embedding. However, in your new code, the pad embedding is not a zero-vector when the word-embedding add the position embedding. Does it matter? What's more, the encoder output will not multiply the non-pad mask, will this affect the final result? Thanks for your code! Look forward to you reply.