google / seq2seq

A general-purpose encoder-decoder framework for Tensorflow
https://google.github.io/seq2seq/
Apache License 2.0
5.61k stars 1.3k forks source link

Failed to reproduce the result for NMT #216

Open kyleyeung opened 7 years ago

kyleyeung commented 7 years ago

I've been using the same parameters and datasets in the tutorial for NMT. But it seems the BLEU scores will end up at around 5 instead of going up to 20+. Could anyone please give some advice on this?

export VOCAB_SOURCE=${DATA_PATH}/vocab.bpe.32000 export VOCAB_TARGET=${DATA_PATH}/vocab.bpe.32000 export TRAIN_SOURCES=${DATA_PATH}/train.tok.clean.bpe.32000.en export TRAIN_TARGETS=${DATA_PATH}/train.tok.clean.bpe.32000.de export DEV_SOURCES=${DATA_PATH}/newstest2013.tok.bpe.32000.en export DEV_TARGETS=${DATA_PATH}/newstest2013.tok.bpe.32000.de

export DEV_TARGETS_REF=${DATA_PATH}/newstest2013.tok.de export TRAIN_STEPS=1000000

python -m bin.train \ --config_paths=" ./example_configs/nmt_large.yml, ./example_configs/train_seq2seq.yml, ./example_configs/text_metrics_bpe.yml" \ --model_params " vocab_source: $VOCAB_SOURCE vocab_target: $VOCAB_TARGET" \ --input_pipeline_train " class: ParallelTextInputPipeline params: source_files:

ghost commented 7 years ago

I might have a similar problem.

When training a small network (example config small, 200.000 training sentences, 2000 dev set, 2000 test set, 178.500 steps, batch size 32) on CPU when using the old repo (before the continuous_train_and_eval fix) I managed to get a BLEU of around 15; when I tried to replicate this experiment afterwards with the new repo, I only manage to get around 5 BLEU.

My guess is that there might be something wrong with the fixed training schedule (as this is the only thing that differed between my different runs), but I'm not entirely sure.

matthias-samwald commented 7 years ago

@kyleyeung @milanv1 I have a similar problem (https://github.com/google/seq2seq/issues/197), as mentioned in the comments at https://github.com/google/seq2seq/issues/181#issuecomment-299858690 I have the suspicion that tf-seq2seq is only using a small part of the training data, and the problem could perhaps be related to the data loading and queuing pipeline. However, I have little knoweldge about how this works in Tensorflow, so I can only guess. Unfortunately, the original developers have been inactive here in the last 2-3 weeks.

kyleyeung commented 7 years ago

@matthias-samwald Thanks! Actually I saw your comments about the queuing pipeline, but it's hard for me to dig into that mechanism as well. Gonna try the previous repo metioned by @milanv1.

kyleyeung commented 7 years ago

I tried different versions of train.py before and after the training schedule fix on GPU under both tf 1.0 and tf 1.1, the BLEU scores till topped around ~7.5.

matthias-samwald commented 7 years ago

@kyleyeung Thanks for reporting, I guess I will stop trying now and revisit this project for official updates once in a while. Unfortunately, tf-seq2seq seems unusable to me in its current state.

matthias-samwald commented 7 years ago

Okay instead of giving up I also tried switching to "before continuous eval". Tensorflow 1.1 does not produce the errors during evaluation which motivated the "continuous eval" patch, which is great. Furthermore, the loss curve without the patch now looks much better and does not indicate early overfitting (only 1 GPU day into training, can't yet say how satisfying end results will be).

ghost commented 7 years ago

@matthias-samwald

Sounds promising! Do you have an example of how to implement this?

matthias-samwald commented 7 years ago

@milanv1 I just followed your advice that the previous version worked better. So I checked out the last commit before the continuous train and eval patches were committed (https://github.com/google/seq2seq/commit/93c600a708a3fdd0473c3b3ce64122f3150bc4ef)

ghost commented 7 years ago

@matthias-samwald

Oh okay, my bad, I thought you meant you tried a third kind of schedule 'before continuous eval', but I read your message in the wrong way.

So upgrading to Tensorflow 1.1 and using the previous repo doesn't face the issues with crashing at evaluation? I'm currently training with Tensorflow 1.0.1 and the previous repo, but I had to reduce my dev set again (from 2000 sentences to 96) or training would crash at evaluation..

matthias-samwald commented 7 years ago

@milanv1 No crashes so far, using the same training and validation data (from the NMT tutorial) that previously caused the crashes during evaluation.

ghost commented 7 years ago

@matthias-samwald

Very curious about the results; could you post your results here once training is finished?

matthias-samwald commented 7 years ago

@milanv1 Sure. Currently training the large model modified to use layer norm LSTM with recurrent dropout. Current validation BLEU 7,5 at step 58k.

matthias-samwald commented 7 years ago

@milanv1 Sorry, I cheered too early. Overnight training crashed with an InvalidArgumentError...

kyleyeung commented 7 years ago

@milanv1 @matthias-samwald Could it be that the BLEU metric reported in the tutorial is different from what we see in tensorboard? When checking the scores, it seems scores around 6~8 could possibly be a BLEU-1 score, and scores aroud 20 could be a BLEU-4 score?

ghost commented 7 years ago

@matthias-samwald Did switching to old version work for you? I am also experiencing that the BLEU score is saturating to around 5.0. I am currently at 300K steps.

matthias-samwald commented 7 years ago

@jongsae No, because then other bugs re-emerged. For the time being, I gave up on tf-seq2seq.

ghost commented 7 years ago

@matthias-samwald Well.. the fact that the documented page shows the converging BLEU and log perplexity seems to mean that there was a version that at least works. I was about to launch 93c600a, 761c393, and 731de14. Is there any commit that you already tried and checked?

matthias-samwald commented 7 years ago

The older version before the patch worked in some settings (obviously those used by the developers). But seemingly not for some others like me.

ghost commented 7 years ago

@matthias-samwald I noticed that all of 93c600a, 761c393, and 731de14 do not work. They crash with the InvalidArgumentError just like you mentioned in the previous comment. I just leave this for future references.

ghost commented 7 years ago

@matthias-samwald We used 333fcee and got it working. You should manually specify the model's hyperparameters instead of using the .yml files (e.g., nmt_large.yml). The following is one example:

python -m bin.train \ --train_source $TRAIN_SOURCES \ --train_target $TRAIN_TARGETS \ --dev_source $DEV_SOURCES \ --dev_target $DEV_TARGETS \ --vocab_source $VOCAB_SOURCE \ --vocab_target $VOCAB_TARGET \ --batch_size 64 \ --train_steps $TRAIN_STEPS \ --output_dir $MODEL_DIR \ --hparams ' embedding.dim: 512 embedding.share: false encoder.type: "BidirectionalRNNEncoder" encoder.rnn_cell.cell_spec: {"class": "BasicLSTMCell", "num_units": 512} encoder.rnn_cell.residual_dense: false encoder.rnn_cell.dropout_input_keep_prob: 0.8 encoder.rnn_cell.dropout_output_keep_prob: 1.0 encoder.rnn_cell.residual_combiner: "add" encoder.rnn_cell.num_layers: 2 encoder.rnn_cell.residual_connections: true bridge_spec: {"class": "ZeroBridge"} attention.dim: 512 attention.score_type: "bahdanau" decoder.rnn_cell.cell_spec: {"class": "BasicLSTMCell", "num_units": 512} decoder.rnn_cell.residual_connections: true decoder.rnn_cell.residual_combiner: "add" decoder.rnn_cell.dropout_input_keep_prob: 0.8 decoder.rnn_cell.dropout_output_keep_prob: 1.0 decoder.rnn_cell.num_layers: 4 decoder.rnn_cell.residual_dense: false optimizer.name: "Adam" optimizer.lr_stop_decay_at: 1000000000.0 optimizer.lr_start_decay_at: 0 optimizer.clip_gradients: 5.0 optimizer.lr_decay_steps: 100 optimizer.learning_rate: 0.0001 optimizer.lr_min_learning_rate: 1e-12 optimizer.lr_staircase: false optimizer.lr_decay_rate: 0.99 optimizer.lr_decay_type: "" inference.beam_search.length_penalty_weight: 0.0 inference.max_decode_length: 100 inference.beam_search.choose_successors_fn: "choose_top_k" inference.beam_search.beam_width: 0 source.reverse: true source.max_seq_len: 50 target.max_seq_len: 50 '

In this example script, I enabled the residual connections but you can disable it. When I ran the residual-enabled and disabled versions, the residual-enabled model performed better (reached to higher than 20.0 BLEU score) while the disabled one only reached to about 19.3. We used a Tesla P100 for each model and the training took about three days. Both seem to be converging to around 20, which made me believe that the 333fcee version works.

CruelPaw commented 6 years ago

@matthias-samwald I have the same problem. I use nmt_large.yml and My bleu score is around 3. I think your suspicion may be right. It seems that model will be created before and after evaluating. Since "eval_every_n_steps" is 1000 and batch size is 32, I guess this model only use the first 32000 lines of data. Then I change "eval_every_n_steps" to 10000, and the bleu score ends up at 11~12. When I change "eval_every_n_steps" to 100000, the bleu score is 13.76 at the first evaluating point. So, I think you are right. We should make "eval_every_n_steps" very large to solve this problem. But I don't know how the plot in Tutorial is drawed.

raburabu91 commented 6 years ago

@YFSAO I have the same problem. I use nmt_large.yml and My bleu score is always lower than 3. I also checked the sample file, it seems that the sentences were drawn from the first few lines. Now I'm just following your method, making "eval_every_n_steps" very large and have not finished running. Are there any other methods to this problem? I was upset whether it is too long for waiting for training comes to an end and beginning to evaluate? Does it works well?