masakhane-io / masakhane-mt

Machine Translation for Africa
MIT License
278 stars 206 forks source link

Create test sets for all languages. #158

Open cdleong opened 3 years ago

cdleong commented 3 years ago

Following #157, check what languages are not covered in https://github.com/juliakreutzer/masakhane/tree/master/jw300_utils/test, and create custom test sets for those. @juliakreutzer I think I can give this a go, but do I need to do a pull request to... your forked version of masakhane-mt?

Alternate language code list, looks the same: https://opus.nlpl.eu/opusapi/?languages=True&corpus=JW300

juliakreutzer commented 3 years ago

Yes, or we incorporate the code here completely.

cdleong commented 3 years ago

Ah, well we'd have to update the notebooks as well, as they point directly to the forked version

cdleong commented 3 years ago

Some languages, e.g. ady, lack alignment files for English: https://opus.nlpl.eu/JW300.php

cdleong commented 3 years ago

test_letter_a_new.zip Did every language code which starts with the letter "a". Here's the ones that weren't already in there.

cdleong commented 3 years ago

Got to bfi before I started actually practicing "quality at a glance" and looking at the data. Turns out bfi is just... English data?

cdleong commented 3 years ago

Oh, it's "British Sign Language". What the heck? https://en.wikipedia.org/wiki/British_Sign_Language

cdleong commented 3 years ago

test_ba_thru_btg_new.zip ba thru btg codes, not already in the global test set

juliakreutzer commented 3 years ago

Oh yeah, maybe we should do a blacklist for all the language codes that have issues according to the tables in the appendix of https://arxiv.org/abs/2103.12028. Btw the sources of the test set were selected based on in how many African languages they were translated into, so there is a bias towards frequent/general sentences. This is important to keep in mind as we extend the test set to more languages, since this initial selection of languages played a role in the selection.