Closed eriche2016 closed 7 years ago
may be that can expalin why i get 99.9% correct on the multi30k validation set using your code.
sorry, i made a mistake, it actually doesnot know the future, because at a paticular timestep, it only use current target words and the attention over the past words.
sorry to re-open this issue, still, during inference, i think your code is wrong, because it feeds the current groundtruth words. check this line
Hi @eriche2016 ,
I think the residual connection issue is a misunderstanding? The residual part will not let the word access the future words.
It is a validation part, not inference part, and it is in a teacher forcing fashion. So I think feed in the groundtruth words is not inappropriate? The validation part in OpenNMT-py may be worth checking.
Besides, I will not do this during inference, please check the code in Translator.py.
no, in evaluation mode, you cannot feed in the current groudtruth words. you can see here, i guess they use previous predicted word and shift the groundtuth sentence
hi, @jadore801120, what's the point using eval in train.py by feeding current groundtruth words to evaluate your model here? i am wandering it will be better if we can evalute it using predicted word instead of current groundtruth word. what is your opinion?
teacher forcing is used to train the model, i doubt it can be used for validating or evaluating the model
check here, i am new machine translating field. If i am correct, using current groundtruth words to validate your model is accceptable? or is it a commonly practice. I think that this validation performance of the MT model is surely unreliable and tells us nothing about the models generatlization ability because it is using the currently groundtruth label.
i think we need to add a teacher forcing ratio and set it to 0 when we want to validate the model.
hi, i found that in decoder there is a subsequent mask which mask out the future information here . However, in line 123, you feed in the dec_input(which is the target embeding) at first layer. now check this line and then the MultiHeadAttention moudle's forward function, it has a residual connection and will make dec_input directly reached output, see here. so it doest not use the subsequent mask, which means that the model knows the future. am i correct?