The large bleu gap in validation and test dataset

wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

https://arxiv.org/abs/2103.06333

MIT License

186 stars 35 forks source link

The large bleu gap in validation and test dataset #22

Closed LeeSureman closed 3 years ago

LeeSureman commented 3 years ago

I find that There is a big gap between the bleu of validation and test. for example, on java language, the best validation bleu is about 7, but the test validation bleu is about 18+. Why? (the validation bleu is shown in the fairseq-train, and the test bleu is shown by the extra evaluation script).

LeeSureman commented 3 years ago

Is there the similar validation and test bleu result in your experiment?

wasiahmad commented 3 years ago

Yes, we observed the same. I guess fairseq validation performance is not accurate. We doubted there might be something wrong during detokenization while validation but we used --eval-bleu-print-samples and checked the output, they seemed to be okay. So, we couldn't come to any conclusion that why the BLEU score is so low.

If you find the issue, feel free to report it.