First word in translated sentence is always <unk>

ashimajain2595 commented 6 years ago

I have around 0.8 million sentences in the Hindi-English parallel corpus which I have divided into training, dev and test set. I am generating the vocab from the training samples. I have achieved a bleu score of 18.91 on dev and 18.84 on test set. But the first word in the translated sentence is always unk. Few examples from the test set are

src: इस क्लब में पचास सदस्य हैं । ref: There are fifty members in this club . nmt: <unk club consists of fifty members .

src: तुम उस ड्रेस में अच्छी लगती हो । ref: You look nice in that dress . nmt: <unk look nice in that dress .

Any help would be appreciated. Thanks

mcjoshi commented 6 years ago

I am using the nmt model for some other task, I also get the same problem. In most of the cases first and last words are <unk> tokens. I thought of replacing any unk token in output by looking at the highest attention word at that timestamp. But couldn't find the attention matrix.

Any help in locating attention matrix will be appreciated ?

ashimajain2595 commented 6 years ago

The problem in my case was that in the vocab file i was using only the lower case whereas the train and test data had the first letter of the first word in capitals becuase of which the embeddings of all such words were never learnt. The issue got fixed when i used all the words in lower case in train and test data.

@mcjoshi apart from the above, you may also check if you are including the sos and eos tokens. Missing sos and eos tokens may also be the issue behind getting <unk>.

tensorflow / nmt

First word in translated sentence is always <unk> #334