mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.03k stars 162 forks source link

Check if all docs have domain attribute #267

Closed jkawamoto closed 1 month ago

jkawamoto commented 2 months ago

I looked through the newest CI errors and could reproduce them locally. The failure of test_dataset::test_process_to_text occurs when wmt22 is chosen (since this test randomly selects 10 datasets, the failure doesn’t always happen). The failure is because the domain files are shorter than the other files in some language pairs. For example, ru-en has this problem:

$ pwd
~/.sacrebleu/wmt22

$ wc -l wmt22.ru-en.domain wmt22.ru-en.src wmt22.ru-en.ALMAnaCH-Inria
    1512 wmt22.ru-en.domain
    2016 wmt22.ru-en.src
    2016 wmt22.ru-en.ALMAnaCH-Inria
    5544 total

Then, I checked the original file at https://github.com/wmt-conference/wmt22-news-systems/blob/main/xml/wmttest2022.ru-en.all.xml, and found that not all doc elements have the domain attribute.

_unwrap_wmt21_or_later should check if the length of domains matches the lengths of the other data.

martinpopel commented 2 months ago

Thanks for finding the source of the CI errors. I agree it is our priority to fix them, so we should perhaps merge this as soon as possible. That said, I think I have fixed this already in March in a more general way. Unfortunately, I didn't have time to polish it into a PR and since then @mjpost pushed another solution (not so general) and I have forgotten all the details. I am still too busy to look at this, but if there is a volunteer, please try to merge (and review) https://github.com/martinpopel/sacreBLEU/tree/xml-domains

jkawamoto commented 2 months ago

Thank you for the detailed explanation. I will review the branch you have been working on.

martinpopel commented 1 month ago

Thank you for merging my commits. Please, fix also the failing test, i.e. add domain=ALL to the expected output.

jkawamoto commented 1 month ago

I'm trying to fix the errors, but it might take some time.

martinpopel commented 1 month ago

I would hope it is a matter of adding domain=ALL to the two lines here and here (or alternatively prevent the printing). Unfortunately, I am not capable to try it now.

jkawamoto commented 1 month ago

Thank you for your advice. I was able to pass the tests by adding domain=ALL to the four lines and removing a couple of lines. I’m not sure if removing those lines is acceptable. I’ll look for another way to solve the CI failures.

jkawamoto commented 1 month ago

The CI errors are fixed. Could you please review this PR?

martinpopel commented 1 month ago

Thank you very much, once again. The last commits were exactly the missing pieces which prevented me to finish the PR in March. I am happy I could merge it into the master now.

jkawamoto commented 1 month ago

My pleasure. I’m also glad that this PR was finally merged.