Open cdleong opened 3 years ago
Yes, or we incorporate the code here completely.
Ah, well we'd have to update the notebooks as well, as they point directly to the forked version
Some languages, e.g. ady
, lack alignment files for English: https://opus.nlpl.eu/JW300.php
test_letter_a_new.zip Did every language code which starts with the letter "a". Here's the ones that weren't already in there.
Got to bfi
before I started actually practicing "quality at a glance" and looking at the data. Turns out bfi
is just... English data?
Oh, it's "British Sign Language". What the heck? https://en.wikipedia.org/wiki/British_Sign_Language
test_ba_thru_btg_new.zip
ba
thru btg
codes, not already in the global test set
Oh yeah, maybe we should do a blacklist for all the language codes that have issues according to the tables in the appendix of https://arxiv.org/abs/2103.12028. Btw the sources of the test set were selected based on in how many African languages they were translated into, so there is a bias towards frequent/general sentences. This is important to keep in mind as we extend the test set to more languages, since this initial selection of languages played a role in the selection.
Following #157, check what languages are not covered in https://github.com/juliakreutzer/masakhane/tree/master/jw300_utils/test, and create custom test sets for those. @juliakreutzer I think I can give this a go, but do I need to do a pull request to... your forked version of masakhane-mt?
Alternate language code list, looks the same: https://opus.nlpl.eu/opusapi/?languages=True&corpus=JW300