Lower loss does not necessarily imply high BLEU?

lkluo commented 6 years ago

I trained a model with millions of sentence pairs, and tested the model with thousands of testing data. Then I happened to continue the training with the testing data (sort of over-fitting the model for the testing data only). And the loss continued to decrease to almost zero (0.02). However, I obtained a lower BLEU for the testing data (also used as training data for the new model) compared to the initial model, no matter what beam size was chosen. This is not a problem directly relate to tensor2tensor, but it somewhat puzzled me. Is it the auto-regressive decoding that matters?

martinpopel commented 6 years ago

In general, lower training loss does not necessarily imply higher test-set BLEU. There are several possible causes:

loss (cross-entropy, log likelihood) vs. BLEU
train-set metrics vs. test-set (or dev-set) metric
T2T approx-bleu computed with non-autoregressive cheating vs. real BLEU.

However, your case seems strange. Training on the test data (only?) should result in overfitting and your BLEU should be very high (a decent T2T model should learn to replicate the training data translations, unless using very high dropout or regularization, or too low max_length, or unless the training diverged). I suggest you to measure the real BLEU (with t2t-translate-all) for the train/test set and inspect also the translated sentences. What are the differences from the reference translations in the training data? This way you should at least explore which of the three causes are relevant for your case.

lkluo commented 6 years ago

@martinpopel: Thanks Martin, for quick response. Allow me to provide more information about the experiment.

I trained a model (M1) based on a large corpus, and training was terminated when loss was around 1.0. I have a testing data (T1) with thousands of sentence pairs. The BLEU for T1 based on M1 is e.g., 48.
I continued training with T1 only, and restoring checkpoints from M1. After a few steps, the loss reduced to 0.02, yielding a new model M2. The BLEU for T1 based on M2 is 43. I used transformer_big_single_gpu setting (same as M1), implying a dropout rate 0.1.
All the BLEU scores are computed via t2t-bleu.
I compared the translated sentences with reference, and found some "bad" translations actually can express the basic meaning of the source sentences (or differs from the reference in terms of word choice). One more findings was that, the translation based on M1 and M2 are more or less the same. That says, M2 does not over-fit T1 even though it is trained on T1 and the loss converges to zero.

It is excepted that the BLEU for T1 based on M2 should be far greater than M1, as you mentioned that transformer should be able to replicate the training data translations.

martinpopel commented 6 years ago

This is strange, I have no clear explanation, just some thoughts: 47.5 and 48 is very close and also both is quite high (for some language pairs and domains a second human reference may have BLEU<30). What are the differences between the MT output and the reference? Maybe the translations are just too short (maybe because of max_length) so it is mostly the Brevity penalty which makes the BLEU lower than 100. This could be one explanation.

Another explanation: the "thousands of sentences" in T1 are noisy - i.e. sometimes the same source has different translations, so it is impossible to overfit to fully replicate the training data. However, in this case I would expect higher training loss than 0.02 (but I am not sure - with the non-autoregressive way of computing the loss - it may be enough to provide just a single word from the reference translation and it disambiguates which of the possible reference translations should be used, so for the rest of the decoding the loss is almost zero).

lkluo commented 6 years ago

@martinpopel: Sorry for the typo that the M2 BLEU should be around 43 when loss is approaching to 0.02. I appreciate your explanations.

Yes, it is strange. I may have to repeat the experiment in case of inappropriate steps. Indeed, T1 is noisy in that there are different translations to the same source sentences in training data. This could be one of the roots.

lkluo commented 6 years ago

I think I find the root, that the data fed into the model training were noisy (i.e., the target sentences were not exactly the references). The model did overfit, and replicate the training data (unluckily the noisy data I mistakenly used). It was a silly mistake.

The explanation regarding training loss and actual BLEU are quite helpful, thanks @martinpopel.

tensorflow / tensor2tensor

Lower loss does not necessarily imply high BLEU? #903