Open sld opened 7 years ago
I am also facing a similar issue. While training BLEU and sample predictions are much better as compared to inference.
Looks like those are from training data so expected them to be better. https://github.com/google/seq2seq/blob/93c600a708a3fdd0473c3b3ce64122f3150bc4ef/seq2seq/training/hooks.py#L141
In that case there is no issue.
Evaluation during training are going in dev set. It is done by passing --input_pipeline_dev
.
Also evaluation is not going in training/hooks.py. It is going here: https://github.com/google/seq2seq/blob/master/seq2seq/metrics/metric_specs.py#L173
So there is no evaluation in training data.
--input_pipeline_dev "
class: ParallelTextInputPipeline
params:
source_files:
- $DEV_SOURCES
target_files:
- $DEV_TARGETS" \
Isn't the evaluation done the same way as training, i.e. feeding all the target tokens to the decoder and comparing the prediction with the target? Whereas inference is done by feeding the previous predicted token to the decoder in order to predict the next token.
If I am right, the difference between evaluation and inference makes sense.
@pooyadavoodi I have also found this problem, evaluation is actually doing: feeding all the target tokens to the decoder and comparing the prediction with the target.
I thinks this way of evaluation is like cheating, because when you do testing, you are feeding the previous predicted token to the decoder.
So the evaluation score can not be used as the indicator of how well model is training. Why evaluation is implemented like this?
You are right. The seq2seq algorithm with encoder-decoder is a little different from typical classification algorithms. For both training and validation, prediction is done based on a given target (besides using the target to compute the loss), whereas in typical classification models, target is only used to compute the loss.
A true evaluation of a trained model can be done by inference and a BLEU score (not validation). In fact, I have seen models with very good validation perplexity which obtain bad BLEU score.
@pooyadavoodi
Evaluation and Inference all use GreedyEmbeddingHelper
, when decoding GreedyEmbeddingHelper
sample for decoder's next_inputs
(not ground truth),the results should be same.
I am not familiar with this code, but doesn't this use the target sequences (ground truth) as an input to the decoder: https://github.com/google/seq2seq/blob/master/seq2seq/models/basic_seq2seq.py#L82-L83
It's _decode_train
code.
Hello,
I'm trying to actually have an evaluation (during training) without the ground truth like does traininghelper but by sampling like greedyembeddinghelper.
Though doing so
if self.mode == tf.contrib.learn.ModeKeys.INFER or self.mode == tf.contrib.learn.ModeKeys.EVAL:
return self._decode_infer(decoder, bridge, encoder_output, features,
labels)
return an InvalidArgumentError tensorflow.python.framework.errors_impl.InvalidArgumentError: logits and labels must have the same first dimension, got logits shape [672,7063] and labels shape [1152]
Hard to know why it does that. Has anybody an idea ?
Hello!
I'm trying to calculate BLEU in two ways:
I expect equal results, because checkpoint and dev sets are identical in 1. and 2.. But for some reason I've got very different results: 3.03 (train eval) vs 2.26 (inference).
Here is my train script:
And infer eval script:
https://github.com/google/seq2seq/blob/master/seq2seq/metrics/bleu.py#L62 - its a file where bleu calculation performed on train stage. I tried to dump hypothesis and reference files. I see that BLEU in train is right.
I'm thinking that there may be some issue with model initialization in infer mode, because model output in train eval and model output in infer is different.