thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
147 stars 22 forks source link

Language code inacurrate for chinese languages #65

Closed kirianguiller closed 3 years ago

kirianguiller commented 3 years ago

Hi,

I noticed this morning that the language code for Chinese languages are not so accurate.

Indeed, I can see in opus_to_iso3.py that the opus code zh , zh_TW, zh_CN, zh_HK all redirect to "Chinese". Or, they should redirect to Chinese , UNKNWOWN, Yue Chinese and Mandarin Chinese

/!\ : We have here a little problem for zh_TW that actually has no iso-639-3 code. I just learnt on the Wikipedia page (and further researches) that unfortunately, the mandarin version spoken in Taiwan has no proper iso-3 classification... It's problematic as their written script is different than the written script of mainland Mandarin Chinese (Taiwan use traditional character whereas china use simplified character). Someone knows if there is a commission that can decide to add languages to the ISO639-3 list ?

Also, regarding Cantonese and Mandarin, we can't continue to map zh_HK (Cantonese) to the same iso3 as zh_CN (Mandarin) as they are drastically different (both languages are not mutually intelligible, and they don't use the same written script). I will commit a PR soon to propose to fix this :).

kirianguiller commented 3 years ago

Ah, it looks like the script I referred in the last message is not the one responsible for converting opus code to iso3.

For deciding which ISO3 each opus code will represent, you split on dash the opus code (here is the code) and then look at this prefix in a table. I don't know for similar cases in other languages, but it is problematic for Chinese languages (zh-TW, zh-CN and zh-HK can't be seen as one language at all. Although zh-TW and zh-CN could be seen as just two different versions of the same language but with 2 different scripts, zh-HK is really different than the others (like Spanish would be to French for instance)).

@thammegowda , do you have an idea of how we could overcome this problem ? I will think about it too.

kpu commented 3 years ago

This is really a duplicate of #47 and #64 yes?

kirianguiller commented 3 years ago

It is Indeed. And it seems it's already fixed and merged too. Thanks for pointing out to these issues, didn't know what was BCP codes :)