tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.76k forks source link

Seq-to-Seq tutorial model predicts nothing except _UNK #771

Closed cshapeshifter closed 7 years ago

cshapeshifter commented 7 years ago

Model directory: tutorials/rnn/translate

I followed this tutorial with the suggested default parameters and let the network train for 3 days, doing 486800 steps and over 3 epochs. So to train the model, I ran:

python translate.py --data_dir /home/user/nn/seq-to-seq/data/en-to-fr --train_dir checkpoints/

But when I want to try out the resulting model by running

python translate.py --decode --data_dir /home/user/nn/seq-to-seq/data/en-to-fr --train_dir checkpoints/

then I get nothing but "_UNK"s as translations (corresponding to the bucket sizes), e.g.:

Reading model parameters from checkpoints/translate.ckpt-486800
> cat
_UNK _UNK
> these are four words
_UNK _UNK _UNK _UNK _UNK _UNK
> There are three cats on the table
_UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK

I can see nothing wrong with either the data files, vocab files, or anything else created by the script. I made no changes to the code, except for changing the following to lines in translate.py and seq2seq_model.py so that the script uses the local libs. Without this change, the script fails while looking for giga-fren.release2.fr.gz, which got renamed to giga-fren.release2.fixed.fr.gz as correctly referenced by the current version of the tutorial.

#from tensorflow.models.rnn.translate import data_utils
import data_utils
#from tensorflow.models.rnn.translate import seq2seq_model
import seq2seq_model

Can someone confirm the problem? Basically just have TF 0.12 installed, change the imports of data_utils and seq2seq_model to be local and run the commands mentioned higher up to train and try the model. I don't understand what's wrong.

PS: The perplexity during training is suspiciously low, too:

global step 400 learning rate 0.5000 step-time 0.51 perplexity 3.12
  eval: bucket 0 perplexity 7.41
  eval: bucket 1 perplexity 2.93
  eval: bucket 2 perplexity 1.63
  eval: bucket 3 perplexity 1.28
global step 600 learning rate 0.5000 step-time 0.45 perplexity 1.25
  eval: bucket 0 perplexity 1.51
  eval: bucket 1 perplexity 1.26
  eval: bucket 2 perplexity 1.19
  eval: bucket 3 perplexity 1.12
global step 800 learning rate 0.5000 step-time 0.48 perplexity 1.18
  eval: bucket 0 perplexity 1.51
  eval: bucket 1 perplexity 1.27
  eval: bucket 2 perplexity 1.19
  eval: bucket 3 perplexity 1.11
global step 1000 learning rate 0.5000 step-time 0.47 perplexity 1.17
  eval: bucket 0 perplexity 1.47
  eval: bucket 1 perplexity 1.28
  eval: bucket 2 perplexity 1.18
  eval: bucket 3 perplexity 1.11

...

global step 486400 learning rate 0.0011 step-time 0.48 perplexity 1.15
  eval: bucket 0 perplexity 1.47
  eval: bucket 1 perplexity 1.28
  eval: bucket 2 perplexity 1.15
  eval: bucket 3 perplexity 1.10
global step 486600 learning rate 0.0011 step-time 0.48 perplexity 1.15
  eval: bucket 0 perplexity 1.44
  eval: bucket 1 perplexity 1.25
  eval: bucket 2 perplexity 1.16
  eval: bucket 3 perplexity 1.11

I don't think it's learning anything.

cshapeshifter commented 7 years ago

I may have found the problem:

In tutorials/rnn/translate/data_utils.py, the words loaded from the vocabulary are stored as strings instead of bytes. rev_vocab = [line.strip() for line in rev_vocab] should probably be rev_vocab = [tf.compat.as_bytes(line.strip()) for line in rev_vocab] The issue was probably introduced in March when a switch was made to using bytes in translate.py but not in data_utils.py. So the vocabulary is loaded into strings, while the data words being looked up are stored as bytes, so vocabulary.get always returns the default, UNK_ID, in sentence_to_token_ids, because none of the bytes are actually in the vocab. This means that the *.ids40000.* files are filled with nothing but 3s.

Changing the line above in data_utils.py solves the problem.

aselle commented 7 years ago

@nealwu, could you update this example to be Python 3 compatible as @cshapeshifter suggests.

nealwu commented 7 years ago

See https://github.com/tensorflow/models/pull/823

Rpersie commented 7 years ago

hello, @cshapeshifter I update the scripts, but i don't have your problem .But I can not reach so low perplexity as you . Now you change the code , your perplexity is also so low ? And how goog is your translate result ? can you paste your result ? By the way i check the newest code , it doesn't change the rev_vocab = [line.strip() for line in rev_vocab] code . Thank you !

cshapeshifter commented 7 years ago

With the aforementioned fix and using --max_train_data_size 16000000 (about 71% of all available data, I think, as all the data doesn't fit in my 32G memory), I get the following perplexity after training for 1058200 steps (~4.2 epochs over 1.6M sentences):

global step 1058200 learning rate 0.0000 step-time 0.48 perplexity 2.13
  eval: bucket 0 perplexity 1.96
  eval: bucket 1 perplexity 2.21
  eval: bucket 2 perplexity 2.51
  eval: bucket 3 perplexity 3.03

Global perplexity hovers around 2.12. For the buckets it fluctuates between ~1.8 and ~3.10.

The translations aren't particularily good...

Reading model parameters from checkpoints/translate.ckpt-1058400
> Who is the president of the United States?
Qui est le président de la United States ?
> There are several people at the train station.
Il y a plusieurs personnes à la station .
> Yesterday, I went to the mall to buy new clothes.
_UNK , j’ai été _UNK à la maison de _UNK pour acheter des vêtements .
> Tom had a dog named Jerry.
Le capitaine a eu un nom de _UNK .

Could be worse, could be better...

nealwu commented 7 years ago

FYI, I believe there should be an update to this model soon. See https://github.com/tensorflow/models/issues/814.

Going to close this now since it should be solved. If you are still having issues, feel free to reopen it. Thanks!