jadore801120 / attention-is-all-you-need-pytorch

A PyTorch implementation of the Transformer model in "Attention is All You Need".
MIT License
8.82k stars 1.98k forks source link

bugs in the masking code #13

Closed eriche2016 closed 7 years ago

eriche2016 commented 7 years ago

hi, i found that in decoder there is a subsequent mask which mask out the future information here . However, in line 123, you feed in the dec_input(which is the target embeding) at first layer. now check this line and then the MultiHeadAttention moudle's forward function, it has a residual connection and will make dec_input directly reached output, see here. so it doest not use the subsequent mask, which means that the model knows the future. am i correct?

eriche2016 commented 7 years ago

may be that can expalin why i get 99.9% correct on the multi30k validation set using your code.

eriche2016 commented 7 years ago

sorry, i made a mistake, it actually doesnot know the future, because at a paticular timestep, it only use current target words and the attention over the past words.

eriche2016 commented 7 years ago

sorry to re-open this issue, still, during inference, i think your code is wrong, because it feeds the current groundtruth words. check this line

jadore801120 commented 7 years ago

Hi @eriche2016 ,

I think the residual connection issue is a misunderstanding? The residual part will not let the word access the future words.

It is a validation part, not inference part, and it is in a teacher forcing fashion. So I think feed in the groundtruth words is not inappropriate? The validation part in OpenNMT-py may be worth checking.

Besides, I will not do this during inference, please check the code in Translator.py.

eriche2016 commented 7 years ago

no, in evaluation mode, you cannot feed in the current groudtruth words. you can see here, i guess they use previous predicted word and shift the groundtuth sentence

eriche2016 commented 7 years ago

hi, @jadore801120, what's the point using eval in train.py by feeding current groundtruth words to evaluate your model here? i am wandering it will be better if we can evalute it using predicted word instead of current groundtruth word. what is your opinion?

eriche2016 commented 7 years ago

teacher forcing is used to train the model, i doubt it can be used for validating or evaluating the model

eriche2016 commented 7 years ago

check here, i am new machine translating field. If i am correct, using current groundtruth words to validate your model is accceptable? or is it a commonly practice. I think that this validation performance of the MT model is surely unreliable and tells us nothing about the models generatlization ability because it is using the currently groundtruth label.

eriche2016 commented 7 years ago

i think we need to add a teacher forcing ratio and set it to 0 when we want to validate the model.