Open Pzzzzz5142 opened 2 years ago
Just wanted to make a note that this effects more than just enzh. German English is also affected when using the default scripts provided by wmt too
Thanks for reporting this. Sorry for the delay; I was on vacation and away GitHub. I will try to fix this issue soon and release a new version.
Thanks, @khayrallah for the pointer!
You are right, WMT21 test refs have multiple translators, which is different from the previous years.
What is causing the delay is that not all files have multiple refs, and when we do have multiple refs, not all translators translate every segment. I will need a bit more time to fix it properly.
$ for i in ~/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.*xml;
do basename $i; grep -o 'translator="[^"]*"' $i | sort | uniq -c ; done
newstest2021.cs-en.xml
167 translator="A"
62 translator="B"
newstest2021.de-en.xml
67 translator="A"
61 translator="B"
newstest2021.de-fr.xml
61 translator="A"
newstest2021.en-cs.xml
201 translator="A"
68 translator="B"
newstest2021.en-de.xml
74 translator="A"
68 translator="C"
68 translator="D"
newstest2021.en-ha.xml
3524 translator="A"
newstest2021.en-is.xml
65 translator="A"
newstest2021.en-ja.xml
65 translator="A"
newstest2021.en-ru.xml
77 translator="A"
68 translator="B"
newstest2021.en-zh.xml
77 translator="A"
68 translator="B"
newstest2021.fr-de.xml
74 translator="A"
newstest2021.ha-en.xml
3559 translator="A"
newstest2021.is-en.xml
47 translator="A"
newstest2021.ja-en.xml
81 translator="A"
newstest2021.ru-en.xml
116 translator="A"
107 translator="B"
newstest2021.zh-en.xml
165 translator="A"
thanks for the update! It might be a good idea to make a note on the main WMT page, since it is linked as the way to download the WMT data.
Thanks for the suggestion! I have sent a pull request to wmt22 page. When it is merged, we will see a note under “limitations” section.
here is my mtdata.recipes.wmt22-constrained.yaml config
when download the test set using the following command,
it will raise error, and here is the error log.
2022-06-07 15:19:36 data.add_parts_sequential:329 ERROR:: Unable to add Statmt-newstest_enzh-2021-eng-zho: /Users/pzzzzz/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.en-zh.xml has unequal number of segs: 1845 == 2847?
it seems that for the 2021 en2zh test has multiple ref sentences for each src sentence, the assert statement will cause the error ahead.
the code cause this issue is at sgm.py line 79.