getting only unk symbols and/or same word as output and bleu test/dev score 0.0

kuab2686 commented 5 years ago

Hi all, I have used the same model architecture and could reproduce the results for IWSLT Evaluation Campaign. But when I used the above model with attention for english to english text summarization, I am getting only unk as nmt output. I have used the vocab.en from IWSLT Evaluation Campaign. But still the same. I used nltk.word_tokenize to generate vocab tokens, saved to excel and then I copy pasted manually the word tokens into the vocab.en file vocab.en has the following format and is used for both source and target unk s /s has gun my

The greater than and smaller than symbols with unk, s and /s are not appearing here let me know what I am doing wrong?

kuab2686 commented 5 years ago

update: Cleaned all train, dev and test sets. generated tokens from entire train set changed learning rate to 0.01 unk no longer appearing as output

but now getting the same word 'great' as output in all cases

s4d3 commented 5 years ago

I am doing an experiment as in Vi-En (https://github.com/tensorflow/nmt) example for my Ja-Id, but then get the same result as you: the output_dev and output_test full with with BLEU score 0.0. So, I try with the step as in [https://github.com/google/seq2seq/blob/master/bin/data/wmt16_en_de.sh]

Tokenize the Id dataset using Moses (except for Ja)
Truecase using Moses
Clean the length as 1 80 using Moses
Create vocabulary using $BASE_DIR/bin/tools/generate_vocab.py and cut the first column using awk. Don't try copy paste using Excel. I have tried it too but the model always errors with hash index and always said there are same words (vocab) which have to delete because it duplicates although there's none when I check it. And it seems the result from copy paste has a space after vocabulary. It seems this also makes an error to the model.
Run the NMT model from Vi-En

The BLEU score will be increased and the output_dev/output_test also has the translation result.

I haven't try to change the learning rate. Others also suggest changing the learning rate.

I hope it helps!

tensorflow / nmt

getting only unk symbols and/or same word as output and bleu test/dev score 0.0 #413