This makes me a little bit confused since in the paper only F1 BLEU is reported.
Is F1 BLEU in the paper an average over all n-gram BLEUs?
What's the correct way to compare the results from the evaluation scripts and the scores reported in the paper?
Hi @shmsw25, in the evaluation script,
F1 bleu1~4
are computed:https://github.com/shmsw25/AmbigQA/blob/f0f17a2614808447660e98026a907bb38d1862b0/ambigqa_evaluate_script.py#L28
This makes me a little bit confused since in the paper only
F1 BLEU
is reported. IsF1 BLEU
in the paper an average over all n-gram BLEUs? What's the correct way to compare the results from the evaluation scripts and the scores reported in the paper?