Closed kirianguiller closed 3 years ago
Ah, it looks like the script I referred in the last message is not the one responsible for converting opus code to iso3.
For deciding which ISO3 each opus code will represent, you split on dash the opus code (here is the code) and then look at this prefix in a table. I don't know for similar cases in other languages, but it is problematic for Chinese languages (zh-TW, zh-CN and zh-HK can't be seen as one language at all. Although zh-TW and zh-CN could be seen as just two different versions of the same language but with 2 different scripts, zh-HK is really different than the others (like Spanish would be to French for instance)).
@thammegowda , do you have an idea of how we could overcome this problem ? I will think about it too.
This is really a duplicate of #47 and #64 yes?
It is Indeed. And it seems it's already fixed and merged too. Thanks for pointing out to these issues, didn't know what was BCP codes :)
Hi,
I noticed this morning that the language code for Chinese languages are not so accurate.
Indeed, I can see in
opus_to_iso3.py
that the opus codezh
,zh_TW
,zh_CN
,zh_HK
all redirect to "Chinese". Or, they should redirect toChinese
,UNKNWOWN
,Yue Chinese
andMandarin Chinese
/!\ : We have here a little problem for zh_TW that actually has no iso-639-3 code. I just learnt on the Wikipedia page (and further researches) that unfortunately, the mandarin version spoken in Taiwan has no proper iso-3 classification... It's problematic as their written script is different than the written script of mainland Mandarin Chinese (Taiwan use traditional character whereas china use simplified character). Someone knows if there is a commission that can decide to add languages to the ISO639-3 list ?
Also, regarding Cantonese and Mandarin, we can't continue to map zh_HK (Cantonese) to the same iso3 as zh_CN (Mandarin) as they are drastically different (both languages are not mutually intelligible, and they don't use the same written script). I will commit a PR soon to propose to fix this :).