thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
147 stars 22 forks source link

Cannot Download wmt21 en2zh test data #116

Open Pzzzzz5142 opened 2 years ago

Pzzzzz5142 commented 2 years ago

here is my mtdata.recipes.wmt22-constrained.yaml config


- id: wmt22-zhen-t
  langs: zho-eng
  desc: WMT 22 General MT
  url: https://www.statmt.org/wmt22/translation-task.html
  dev:
  test:
    - Statmt-newstest_enzh-2021-eng-zho
  train:

when download the test set using the following command,

mtdata get-recipe -ri wmt22-zhen-t -o .

it will raise error, and here is the error log.

2022-06-07 15:19:36 data.add_parts_sequential:329 ERROR:: Unable to add Statmt-newstest_enzh-2021-eng-zho: /Users/pzzzzz/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.en-zh.xml has unequal number of segs: 1845 == 2847?

it seems that for the 2021 en2zh test has multiple ref sentences for each src sentence, the assert statement will cause the error ahead.

image

the code cause this issue is at sgm.py line 79.

srcs = list(xpath_all(tree.getroot(), xpath=".//src//seg"))
tgts = list(xpath_all(tree.getroot(), xpath=".//ref//seg"))
assert len(srcs) == len(tgts), f'{data} has unequal number of segs: {len(srcs)} == {len(tgts)}?'
khayrallah commented 2 years ago

Just wanted to make a note that this effects more than just enzh. German English is also affected when using the default scripts provided by wmt too

thammegowda commented 2 years ago

Thanks for reporting this. Sorry for the delay; I was on vacation and away GitHub. I will try to fix this issue soon and release a new version.

thammegowda commented 2 years ago

Thanks, @khayrallah for the pointer!

You are right, WMT21 test refs have multiple translators, which is different from the previous years.

What is causing the delay is that not all files have multiple refs, and when we do have multiple refs, not all translators translate every segment. I will need a bit more time to fix it properly.

$ for i in ~/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.*xml; 
  do basename $i; grep -o 'translator="[^"]*"' $i | sort | uniq -c ;  done 

newstest2021.cs-en.xml
    167 translator="A"
     62 translator="B"
newstest2021.de-en.xml
     67 translator="A"
     61 translator="B"
newstest2021.de-fr.xml
     61 translator="A"
newstest2021.en-cs.xml
    201 translator="A"
     68 translator="B"
newstest2021.en-de.xml
     74 translator="A"
     68 translator="C"
     68 translator="D"
newstest2021.en-ha.xml
   3524 translator="A"
newstest2021.en-is.xml
     65 translator="A"
newstest2021.en-ja.xml
     65 translator="A"
newstest2021.en-ru.xml
     77 translator="A"
     68 translator="B"
newstest2021.en-zh.xml
     77 translator="A"
     68 translator="B"
newstest2021.fr-de.xml
     74 translator="A"
newstest2021.ha-en.xml
   3559 translator="A"
newstest2021.is-en.xml
     47 translator="A"
newstest2021.ja-en.xml
     81 translator="A"
newstest2021.ru-en.xml
    116 translator="A"
    107 translator="B"
newstest2021.zh-en.xml
    165 translator="A"
khayrallah commented 2 years ago

thanks for the update! It might be a good idea to make a note on the main WMT page, since it is linked as the way to download the WMT data.

thammegowda commented 2 years ago

Thanks for the suggestion! I have sent a pull request to wmt22 page. When it is merged, we will see a note under “limitations” section.