Closed lkluo closed 6 years ago
In general, lower training loss does not necessarily imply higher test-set BLEU. There are several possible causes:
However, your case seems strange. Training on the test data (only?) should result in overfitting and your BLEU should be very high (a decent T2T model should learn to replicate the training data translations, unless using very high dropout or regularization, or too low max_length, or unless the training diverged).
I suggest you to measure the real BLEU (with t2t-translate-all
) for the train/test set and inspect also the translated sentences. What are the differences from the reference translations in the training data? This way you should at least explore which of the three causes are relevant for your case.
@martinpopel: Thanks Martin, for quick response. Allow me to provide more information about the experiment.
transformer_big_single_gpu
setting (same as M1), implying a dropout rate 0.1.t2t-bleu
. It is excepted that the BLEU for T1 based on M2 should be far greater than M1, as you mentioned that transformer should be able to replicate the training data translations.
This is strange, I have no clear explanation, just some thoughts: 47.5 and 48 is very close and also both is quite high (for some language pairs and domains a second human reference may have BLEU<30). What are the differences between the MT output and the reference? Maybe the translations are just too short (maybe because of max_length) so it is mostly the Brevity penalty which makes the BLEU lower than 100. This could be one explanation.
Another explanation: the "thousands of sentences" in T1 are noisy - i.e. sometimes the same source has different translations, so it is impossible to overfit to fully replicate the training data. However, in this case I would expect higher training loss than 0.02 (but I am not sure - with the non-autoregressive way of computing the loss - it may be enough to provide just a single word from the reference translation and it disambiguates which of the possible reference translations should be used, so for the rest of the decoding the loss is almost zero).
@martinpopel: Sorry for the typo that the M2 BLEU should be around 43 when loss is approaching to 0.02. I appreciate your explanations.
Yes, it is strange. I may have to repeat the experiment in case of inappropriate steps. Indeed, T1 is noisy in that there are different translations to the same source sentences in training data. This could be one of the roots.
I think I find the root, that the data fed into the model training were noisy (i.e., the target sentences were not exactly the references). The model did overfit, and replicate the training data (unluckily the noisy data I mistakenly used). It was a silly mistake.
The explanation regarding training loss and actual BLEU are quite helpful, thanks @martinpopel.
I trained a model with millions of sentence pairs, and tested the model with thousands of testing data. Then I happened to continue the training with the testing data (sort of over-fitting the model for the testing data only). And the loss continued to decrease to almost zero (0.02). However, I obtained a lower BLEU for the testing data (also used as training data for the new model) compared to the initial model, no matter what beam size was chosen. This is not a problem directly relate to tensor2tensor, but it somewhat puzzled me. Is it the auto-regressive decoding that matters?