BLEU during evaluation in training phase different from inference

sld commented 7 years ago

Hello!

I'm trying to calculate BLEU in two ways:

During evaluation in training, by passing dev set.
Using bin.infer with checkpoint (also in dev set).

I expect equal results, because checkpoint and dev sets are identical in 1. and 2.. But for some reason I've got very different results: 3.03 (train eval) vs 2.26 (inference).

Here is my train script:

python -m bin.train \
  --config_paths="
      ./seq2seq/example_configs/reverse_squad.yml,
      ./seq2seq/example_configs/train_seq2seq.yml,
      ./seq2seq/example_configs/text_metrics_raw_bleu.yml" \
  --model_params "
      vocab_source: $VOCAB_SOURCE
      vocab_target: $VOCAB_TARGET" \
  --input_pipeline_train "
    class: ParallelTextInputPipeline
    params:
      source_files:
        - $TRAIN_SOURCES
      target_files:
        - $TRAIN_TARGETS" \
  --input_pipeline_dev "
    class: ParallelTextInputPipeline
    params:
       source_files:
        - $DEV_SOURCES
       target_files:
        - $DEV_TARGETS" \
  --batch_size 64 \
  --train_steps $TRAIN_STEPS \
  --output_dir $MODEL_DIR \
  --eval_every_n_steps=1000 \
  --keep_checkpoint_max=50

And infer eval script:

python -m bin.infer \
  --tasks "
    - class: DecodeText" \
  --model_dir $MODEL_DIR \
  --batch_size 256 \
  --checkpoint_path $MODEL_DIR/model.ckpt-13001 \
  --input_pipeline "
    class: ParallelTextInputPipeline
    params:
      source_files:
        - $DEV_SOURCES" \
  > ${PRED_DIR}/predictions.txt

./seq2seq/bin/tools/multi-bleu.perl ${DEV_TARGETS_REF} < ${PRED_DIR}/predictions.txt

https://github.com/google/seq2seq/blob/master/seq2seq/metrics/bleu.py#L62 - its a file where bleu calculation performed on train stage. I tried to dump hypothesis and reference files. I see that BLEU in train is right.

I'm thinking that there may be some issue with model initialization in infer mode, because model output in train eval and model output in infer is different.

parajain commented 7 years ago

I am also facing a similar issue. While training BLEU and sample predictions are much better as compared to inference.

parajain commented 7 years ago

Looks like those are from training data so expected them to be better. https://github.com/google/seq2seq/blob/93c600a708a3fdd0473c3b3ce64122f3150bc4ef/seq2seq/training/hooks.py#L141

In that case there is no issue.

sld commented 7 years ago

Evaluation during training are going in dev set. It is done by passing --input_pipeline_dev. Also evaluation is not going in training/hooks.py. It is going here: https://github.com/google/seq2seq/blob/master/seq2seq/metrics/metric_specs.py#L173 So there is no evaluation in training data.

  --input_pipeline_dev "
    class: ParallelTextInputPipeline
    params:
       source_files:
        - $DEV_SOURCES
       target_files:
        - $DEV_TARGETS" \

pooyadavoodi commented 7 years ago

Isn't the evaluation done the same way as training, i.e. feeding all the target tokens to the decoder and comparing the prediction with the target? Whereas inference is done by feeding the previous predicted token to the decoder in order to predict the next token.

If I am right, the difference between evaluation and inference makes sense.

tobyyouup commented 7 years ago

@pooyadavoodi I have also found this problem, evaluation is actually doing: feeding all the target tokens to the decoder and comparing the prediction with the target.

I thinks this way of evaluation is like cheating, because when you do testing, you are feeding the previous predicted token to the decoder.

So the evaluation score can not be used as the indicator of how well model is training. Why evaluation is implemented like this?

pooyadavoodi commented 7 years ago

You are right. The seq2seq algorithm with encoder-decoder is a little different from typical classification algorithms. For both training and validation, prediction is done based on a given target (besides using the target to compute the loss), whereas in typical classification models, target is only used to compute the loss.

A true evaluation of a trained model can be done by inference and a BLEU score (not validation). In fact, I have seen models with very good validation perplexity which obtain bad BLEU score.

lifeiteng commented 7 years ago

@pooyadavoodi Evaluation and Inference all use GreedyEmbeddingHelper, when decoding GreedyEmbeddingHelper sample for decoder's next_inputs (not ground truth)，the results should be same.

pooyadavoodi commented 7 years ago

I am not familiar with this code, but doesn't this use the target sequences (ground truth) as an input to the decoder: https://github.com/google/seq2seq/blob/master/seq2seq/models/basic_seq2seq.py#L82-L83

lifeiteng commented 7 years ago

It's _decode_train code.

jbdel commented 7 years ago

Hello,

I'm trying to actually have an evaluation (during training) without the ground truth like does traininghelper but by sampling like greedyembeddinghelper.

Though doing so

if self.mode == tf.contrib.learn.ModeKeys.INFER or self.mode == tf.contrib.learn.ModeKeys.EVAL:
      return self._decode_infer(decoder, bridge, encoder_output, features,
                                labels)

return an InvalidArgumentError tensorflow.python.framework.errors_impl.InvalidArgumentError: logits and labels must have the same first dimension, got logits shape [672,7063] and labels shape [1152]

Hard to know why it does that. Has anybody an idea ?

google / seq2seq

BLEU during evaluation in training phase different from inference #214