microsoft / ProphetNet

A research project for natural language generation, containing the official implementations by MSRA NLC team.
MIT License
651 stars 104 forks source link

Wrong Tokenization in SquadQG Evaluation Scripts #50

Open hzhwcmhf opened 2 years ago

hzhwcmhf commented 2 years ago

Thanks for the great work.

I am reproducing the result reported in GLGE but find that the SquadQG evaluation script seem to use wrong tokenization.

In /script/evaluate/qg/eval_on_unilm_qg.py, the generated text are post-processed by fix_tokenization:

https://github.com/microsoft/ProphetNet/blob/0a1b59cb95783319b7b58ede65b768587dc49daf/GLGE_baselines/script/script/evaluate/qg/eval_on_unilm_qg.py#L40-L117

For example, it turns . . . to ..., " to '', 1 , 000 to 1,000.

However, the original data do not like the sentence after fix_tokenization. Here are some samples from the test set:

What did Harff define as " short - lived outbursts by mobs . . . ? "
Who sang " Girls Love Beyoncé " in 2013 ?
What city in Montana has over 100 , 000 people ?

Moreover, I reproduce MASS-base and find the results are higher if we disable fix_tokenization:

BLEU METEOR ROUGE-L
MASS-base reported in GLGE 20.1 24.4 49.4
MASS-base reproduce with fix_tokenization 20.69 24.92 49.21
MASS-base reproduce without fix_tokenization 22.54 25.03 50.27

I wonder whether I miss somthing or the reported results use a wrong tokenization? I also hope that, if possible, the model outputs can be released to support fair and detailed comparisons.

Looking forward to your reply