Closed LeeSureman closed 3 years ago
Is there the similar validation and test bleu result in your experiment?
Yes, we observed the same. I guess fairseq validation performance is not accurate. We doubted there might be something wrong during detokenization while validation but we used --eval-bleu-print-samples
and checked the output, they seemed to be okay. So, we couldn't come to any conclusion that why the BLEU score is so low.
If you find the issue, feel free to report it.
I find that There is a big gap between the bleu of validation and test. for example, on java language, the best validation bleu is about 7, but the test validation bleu is about 18+. Why? (the validation bleu is shown in the fairseq-train, and the test bleu is shown by the extra evaluation script).