tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial
Apache License 2.0
6.38k stars 1.96k forks source link

English to Spanish: zero blue score while training along with <unk> output during inference #396

Open MukundKhandelwal opened 6 years ago

MukundKhandelwal commented 6 years ago

Hi,

I am trying to run the translation from English to Spanish. However, during training, the BLEU score remains zero even after running 3000 steps. As a result, when I run the inference, the output is just unknown.

Here is how I am creating the vocabulary:

_from nltk.corpus import stopwords stoplist = stopwords.words('english') file=open('/home/ubuntu/europarldata/europarl-v7.es-en.en',encoding='utf-8') #English Corpus text = file.read() clean = [word for word in text.split() if word not in stoplist] from collections import Counter count = Counter(clean) frequency = count.most_common(17188) l1,l2=zip(*frequency) with open('/home/ubuntu/mukund_nmt/spanishdata/vocab.en', 'w') as f: for item in l1: f.write("%s\n" % item) #writing the vocab file as a string

Once the vocabulary is created, I run the training as follows:

python -m nmt.nmt --src=en --tgt=es --vocab_prefix=/home/ubuntu/mukund_nmt/spanish_data/vocab --train_prefix=/home/ubuntu/mukund_nmt/spanish_data/new_train --dev_prefix=/home/ubuntu/mukund_nmt/spanish_data/new_dev --test_prefix=/home/ubuntu/mukund_nmt/spanish_data/new_testing --out_dir=/home/ubuntu/mukund_nmt/spanish_data/model1 --num_train_steps=3000 --steps_per_stats=100 --num_layers=2 --num_units=128 --dropout=0.2 --metrics=bleu

It does give some output at the start, which looks normal to me since its just the beginning: image

However, subsequent training outputs are filled with unknown with BLEU score remaining 0 till the end of the training. For this reason, the inference output also comes out to be garbage (shown below): image

Can someone please help me with this. Thanks.