thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
147 stars 22 forks source link

The variable versions for one langauge is not avilable #64

Closed pluiez closed 3 years ago

pluiez commented 3 years ago

Hi,

It seems that mtdata currently does not support multiple variations of one language. For example, for Portuguese there are 4 variations listed on OPUS, including pt_PT, pt_BR, pt_br and pt. And for Chinese there are 8 corresponding language name variations. So in total there are 32 compositions for Portuguese-Chinese corpus.

However, the TED2020-v1 corpus in pt-zh is different from which in pt-zh_cn, running mtdata list -l pt-zh shows that only TED2020-v1 from pt-zh is available but the one from pt-zh_cn is not included. Thus the corpus collections seems not collecting corpus from all language name variations, making the url list incomplete.

kpu commented 3 years ago

This is why I wanted BCP47 #47.

By the way, the case shouldn't matter. pt_br shouldn't be separate from pt_BR; that is an OPUS bug. Same for zh_cn and zh_CN, zh_tw and zh_TW.

thammegowda commented 3 years ago

This is one of the current limitations, partly due to reliance on ISO 639-3 which does not support country-specific language variants, but OPUS and few others use codes that are hard to map to ISO 639-3.

I had seen on OPUS where en is a superset of en_US and en_UK, so I was hoping that zh be a superset of all of its variants including zh_cn, but this is not the case with TED2020-v1 Language code zh comes before zh_cn on TED2020-v1, and our current system keeps the first one.

I thought of switching to BCP47 for the future versions, but IMHO the BCP47 codes are way more complicated (and most people are is still using two-letter codes, three-letter codes itself a hard sell). We are thinking of how to resolve this (without complicating the CLI mtdata list -l pt-zh)... will try to fix it in the next version.
Suggestions are welcome.

thammegowda commented 3 years ago

the fix I am thinking of is currently, the primary key in our index is (name, src, tgt) where src and tgt are ISO 639-3 codes. This doesn't allow duplicates so we have problem with country-specific variants

we could make (name, src, tgt, srcvariant, tgtvariant) as primary key; here srcvariant and tgtvariant are BCP47 suffixes excluding the language code.

And then

mtdata list -l pt-zh or mtdata list -l por-zho should show all the variants of pt and zh (and similarly mtdata get)

Then mtdata list -l pt_br-zh_cn or even mtdata list -l por_br-zho_cn should work with the specific variants selected (_br, _cn).

kpu commented 3 years ago

BCP47 is a complicated beast but I think it's the right thing to aim for.

Even BCP47 can be broken; try making a code for Russian transliterated to English (and yes it can depend on which language one is transliterating to, not just the script) using one of the many standards.

Agree it's definitely useful to be able to get data just by language code. As a native speaker of en-US, I still want parallel corpora between fr and en-150 in a fr->en translation system.

Conversely, separating zh-Hans from zh-Hant in training a system matters. There's also the fun of what happens when a user asks for zh-Hant and you want to return zh-HK and zh-TW. Long term that would be awesome, but short term we can live without it.

WMT Kazakh has Cyrillic, Latin, and Arabic scripts which BCP47 is necessary to support.

thammegowda commented 3 years ago

Added in version 0.3.0

mtdata list -l pt-zh | cut -f1
2021-10-21 17:38:42 main.lang_pair:98 INFO:: Suggestion: Use codes por-zho instead of pt-zh. Let's make a little space for all languages of our planet 😢.
2021-10-21 17:38:42 __init__.get_instance:48 INFO:: Loading index from cache /Users/tg/.mtdata/mtdata.index.0.3.0.pkl
2021-10-21 17:38:44 main.list_data:19 INFO:: Found 64
Statmt-news_commentary-14-por-zho
Statmt-news_commentary-15-por-zho
Statmt-news_commentary-16-por-zho
OPUS-gnome-1-por-zho_CN
OPUS-gnome-1-por_BR-zho_CN
OPUS-gnome-1-por_PT-zho_CN
OPUS-gnome-1-por-zho_HK
OPUS-gnome-1-por_BR-zho_HK
OPUS-gnome-1-por_PT-zho_HK
OPUS-gnome-1-por-zho_TW
OPUS-gnome-1-por_BR-zho_TW
OPUS-gnome-1-por_PT-zho_TW
OPUS-kde4-2-por-zho_CN
OPUS-kde4-2-por_BR-zho_CN
OPUS-kde4-2-por-zho_HK
OPUS-kde4-2-por_BR-zho_HK
OPUS-kde4-2-por-zho_TW
OPUS-kde4-2-por_BR-zho_TW
OPUS-kdedoc-1-por-zho_TW
OPUS-multiccaligned-1.1-por-zho_CN
OPUS-multiccaligned-1.1-por-zho_TW
OPUS-newscommentary-11-por-zho
OPUS-newscommentary-14-por-zho
OPUS-newscommentary-9.1-por-zho
OPUS-opensubtitles-1-por-zho
OPUS-opensubtitles-1-por_BR-zho
OPUS-opensubtitles-2016-por-zho
OPUS-opensubtitles-2016-por_BR-zho
OPUS-opensubtitles-2016-por-zho_TW
OPUS-opensubtitles-2016-por_BR-zho_TW
OPUS-opensubtitles-2018-por-zho_CN
OPUS-opensubtitles-2018-por_BR-zho_CN
OPUS-opensubtitles-2018-por-zho_TW
OPUS-opensubtitles-2018-por_BR-zho_TW
OPUS-php-1-por_BR-zho
OPUS-php-1-por_BR-zho_TW
OPUS-qed-2.0a-por-zho
OPUS-ted2020-1-por-zho
OPUS-ted2020-1-por_BR-zho
OPUS-ted2020-1-por-zho_CN
OPUS-ted2020-1-por_BR-zho_CN
OPUS-ted2020-1-por-zho_TW
OPUS-ted2020-1-por_BR-zho_TW
OPUS-tanzil-1-por-zho
OPUS-ubuntu-14.10-por-zho
OPUS-ubuntu-14.10-por_BR-zho
OPUS-ubuntu-14.10-por-zho_CN
OPUS-ubuntu-14.10-por_BR-zho_CN
OPUS-ubuntu-14.10-por_PT-zho_CN
OPUS-ubuntu-14.10-por-zho_HK
OPUS-ubuntu-14.10-por_BR-zho_HK
OPUS-ubuntu-14.10-por_PT-zho_HK
OPUS-ubuntu-14.10-por-zho_TW
OPUS-ubuntu-14.10-por_BR-zho_TW
OPUS-ubuntu-14.10-por_PT-zho_TW
OPUS-wikimatrix-1-por-zho
OPUS-bibleuedin-1-por-zho
OPUS-tico19-20201028-por-zho
OPUS-wikimedia-20210402-por-zho
Facebook-wikimatrix-1-por-zho
Neulab-tedtalks_train-1-por-zho
Neulab-tedtalks_test-1-por-zho
Neulab-tedtalks_dev-1-por-zho
LinguaTools-wikititles-2014-por-zho
Total 64 entries