For example, it turns . . . to ..., " to '', 1 , 000 to 1,000.
However, the original data do not like the sentence after fix_tokenization. Here are some samples from the test set:
What did Harff define as " short - lived outbursts by mobs . . . ? "
Who sang " Girls Love Beyoncé " in 2013 ?
What city in Montana has over 100 , 000 people ?
Moreover, I reproduce MASS-base and find the results are higher if we disable fix_tokenization:
BLEU
METEOR
ROUGE-L
MASS-base reported in GLGE
20.1
24.4
49.4
MASS-base reproduce with fix_tokenization
20.69
24.92
49.21
MASS-base reproduce without fix_tokenization
22.54
25.03
50.27
I wonder whether I miss somthing or the reported results use a wrong tokenization?
I also hope that, if possible, the model outputs can be released to support fair and detailed comparisons.
Thanks for the great work.
I am reproducing the result reported in GLGE but find that the SquadQG evaluation script seem to use wrong tokenization.
In /script/evaluate/qg/eval_on_unilm_qg.py, the generated text are post-processed by
fix_tokenization
:https://github.com/microsoft/ProphetNet/blob/0a1b59cb95783319b7b58ede65b768587dc49daf/GLGE_baselines/script/script/evaluate/qg/eval_on_unilm_qg.py#L40-L117
For example, it turns
. . .
to...
,"
to''
,1 , 000
to1,000
.However, the original data do not like the sentence after
fix_tokenization
. Here are some samples from the test set:Moreover, I reproduce MASS-base and find the results are higher if we disable
fix_tokenization
:I wonder whether I miss somthing or the reported results use a wrong tokenization? I also hope that, if possible, the model outputs can be released to support fair and detailed comparisons.
Looking forward to your reply