About BLEU score in the evaluation script and in the paper

shmsw25 / AmbigQA

An original implementation of EMNLP 2020, "AmbigQA: Answering Ambiguous Open-domain Questions"

https://arxiv.org/abs/2004.10645

117 stars 22 forks source link

About BLEU score in the evaluation script and in the paper #9

Closed Yifan-Gao closed 4 years ago

Yifan-Gao commented 4 years ago

Hi @shmsw25, in the evaluation script, F1 bleu1~4 are computed:

https://github.com/shmsw25/AmbigQA/blob/f0f17a2614808447660e98026a907bb38d1862b0/ambigqa_evaluate_script.py#L28

This makes me a little bit confused since in the paper only F1 BLEU is reported. Is F1 BLEU in the paper an average over all n-gram BLEUs? What's the correct way to compare the results from the evaluation scripts and the scores reported in the paper?

shmsw25 commented 4 years ago

Great question, we used "F1 bleu4". (This metric already reflects the average of all n-gram BLEUs.)

Yifan-Gao commented 4 years ago

thanks!