Closed tuzhucheng closed 2 years ago
Hi, thanks for organizing this workshop!
I believe there is an error in the baseline Final results F1 | EM table for XOR QA for baseline 1 (multilingual DPR + multilingual seq2seq (CORA without iterative training)).
https://github.com/mia-workshop/MIA-Shared-Task-2022/blob/02ed407f6b0891373446d6844e50c66a54a32a46/README.md?plain=1#L239-L250
The Macro-Average for (1) EM should be 29.1 and not 26.8. 26.8 is the BLEU.
I checked the numbers using:
python mia-organizer/eval_scripts/eval_xor_full.py \ --data_file mia-organizer/data/eval/mia_2022_dev_xorqa.jsonl --pred_file data/baseline1_mdpr_mgen/baseline2_xor_dev_results.json
yielding the results:
avg f1: 38.87920157968395 avg em: 29.10566839259774 avg bleu: 26.76662074445292
and also the average of the rows for that column is 29.1.
Oh, that's a great catch! You're right. I'll fix the README, thanks!
Hi, thanks for organizing this workshop!
I believe there is an error in the baseline Final results F1 | EM table for XOR QA for baseline 1 (multilingual DPR + multilingual seq2seq (CORA without iterative training)).
https://github.com/mia-workshop/MIA-Shared-Task-2022/blob/02ed407f6b0891373446d6844e50c66a54a32a46/README.md?plain=1#L239-L250
The Macro-Average for (1) EM should be 29.1 and not 26.8. 26.8 is the BLEU.
I checked the numbers using:
yielding the results:
and also the average of the rows for that column is 29.1.