Closed cshapeshifter closed 7 years ago
I may have found the problem:
In tutorials/rnn/translate/data_utils.py, the words loaded from the vocabulary are stored as strings instead of bytes.
rev_vocab = [line.strip() for line in rev_vocab]
should probably be
rev_vocab = [tf.compat.as_bytes(line.strip()) for line in rev_vocab]
The issue was probably introduced in March when a switch was made to using bytes in translate.py but not in data_utils.py. So the vocabulary is loaded into strings, while the data words being looked up are stored as bytes, so vocabulary.get
always returns the default, UNK_ID, in sentence_to_token_ids, because none of the bytes are actually in the vocab. This means that the *.ids40000.*
files are filled with nothing but 3s.
Changing the line above in data_utils.py solves the problem.
@nealwu, could you update this example to be Python 3 compatible as @cshapeshifter suggests.
hello, @cshapeshifter I update the scripts, but i don't have your problem .But I can not reach so low perplexity as you . Now you change the code , your perplexity is also so low ? And how goog is your translate result ? can you paste your result ? By the way i check the newest code , it doesn't change the rev_vocab = [line.strip() for line in rev_vocab] code . Thank you !
With the aforementioned fix and using --max_train_data_size 16000000
(about 71% of all available data, I think, as all the data doesn't fit in my 32G memory), I get the following perplexity after training for 1058200 steps (~4.2 epochs over 1.6M sentences):
global step 1058200 learning rate 0.0000 step-time 0.48 perplexity 2.13
eval: bucket 0 perplexity 1.96
eval: bucket 1 perplexity 2.21
eval: bucket 2 perplexity 2.51
eval: bucket 3 perplexity 3.03
Global perplexity hovers around 2.12. For the buckets it fluctuates between ~1.8 and ~3.10.
The translations aren't particularily good...
Reading model parameters from checkpoints/translate.ckpt-1058400
> Who is the president of the United States?
Qui est le président de la United States ?
> There are several people at the train station.
Il y a plusieurs personnes à la station .
> Yesterday, I went to the mall to buy new clothes.
_UNK , j’ai été _UNK à la maison de _UNK pour acheter des vêtements .
> Tom had a dog named Jerry.
Le capitaine a eu un nom de _UNK .
Could be worse, could be better...
FYI, I believe there should be an update to this model soon. See https://github.com/tensorflow/models/issues/814.
Going to close this now since it should be solved. If you are still having issues, feel free to reopen it. Thanks!
Model directory:
tutorials/rnn/translate
I followed this tutorial with the suggested default parameters and let the network train for 3 days, doing 486800 steps and over 3 epochs. So to train the model, I ran:
But when I want to try out the resulting model by running
then I get nothing but "_UNK"s as translations (corresponding to the bucket sizes), e.g.:
I can see nothing wrong with either the data files, vocab files, or anything else created by the script. I made no changes to the code, except for changing the following to lines in
translate.py
andseq2seq_model.py
so that the script uses the local libs. Without this change, the script fails while looking for giga-fren.release2.fr.gz, which got renamed to giga-fren.release2.fixed.fr.gz as correctly referenced by the current version of the tutorial.Can someone confirm the problem? Basically just have TF 0.12 installed, change the imports of data_utils and seq2seq_model to be local and run the commands mentioned higher up to train and try the model. I don't understand what's wrong.
PS: The perplexity during training is suspiciously low, too:
...
I don't think it's learning anything.