Wrong Tokenization in SquadQG Evaluation Scripts

Thanks for the great work.

I am reproducing the result reported in GLGE but find that the SquadQG evaluation script seem to use wrong tokenization.

In /script/evaluate/qg/eval_on_unilm_qg.py, the generated text are post-processed by fix_tokenization:

https://github.com/microsoft/ProphetNet/blob/0a1b59cb95783319b7b58ede65b768587dc49daf/GLGE_baselines/script/script/evaluate/qg/eval_on_unilm_qg.py#L40-L117

For example, it turns . . . to ..., " to '', 1 , 000 to 1,000.

However, the original data do not like the sentence after fix_tokenization. Here are some samples from the test set:

What did Harff define as " short - lived outbursts by mobs . . . ? "
Who sang " Girls Love Beyoncé " in 2013 ?
What city in Montana has over 100 , 000 people ?

Moreover, I reproduce MASS-base and find the results are higher if we disable fix_tokenization:

	BLEU	METEOR	ROUGE-L
MASS-base reported in GLGE	20.1	24.4	49.4
MASS-base reproduce with fix_tokenization	20.69	24.92	49.21
MASS-base reproduce without fix_tokenization	22.54	25.03	50.27

I wonder whether I miss somthing or the reported results use a wrong tokenization? I also hope that, if possible, the model outputs can be released to support fair and detailed comparisons.

Looking forward to your reply

microsoft / ProphetNet

Wrong Tokenization in SquadQG Evaluation Scripts #50