Language code inconsistencies

learningequality / le-utils

Utilities and constants shared across Kolibri, Ricecooker, and Kolibri Studio

MIT License

3 stars 31 forks source link

Language code inconsistencies #23

Closed ivanistheone closed 7 years ago

ivanistheone commented 7 years ago

While working on the youtube subs for the TE chef, @divad12 noticed inconsistencies in the short language codes defined in le_utils/resources/languagelookup.json. There is a mix of two letter codes like pt-BR and three letter codes zul.

The conventions are consistent between chefs, cc server, and kolibri, but loading data from external sources can be problematic. For example, a youtube video can provide subtitles for Zulu as zu but we need to upload them as zul to the cc server so that Kolibri will recognize them.

Should we consider standardizing on two letter codes? Possibly with a fallback/retro-compatible mode for three letter codes? (sidequestion: What is the ka_name used for?)

A change to two-letter codes will require revisiting:

le_utils
Existing chefs
Existing content channels in the content-curation server
Frontend code
Other places?

jayoshih commented 7 years ago

I'm really hesitant on changing the language codes because the language code is used as the model id. Changing the language codes could then break the content that uses 3-letter codes as the corresponding id would no longer exist. Also, certain languages have 3-letter codes because the 2-letter representation has been taken (e.g. Finnish and Filipino), so I'm not sure how cleanly certain languages will be translated to a new 2-letter scheme. An alternative solution could be to update the getlang method to look for a "closest match" or have some sort of mapping from 2-letter codes to 3-letter codes

jayoshih commented 7 years ago

@aronasorman Might be good to get your opinion on this too

ivanistheone commented 7 years ago

@jayoshih I see. If we can't change the existing data model, we should aim to provide helper functions as you suggested:

one for looking up by 2-letter codes (ISO 639-1)
one for looking up by 3-letter codes (ISO 639-2~=ISO 639-5 or ISO 639-3)
lookup by name?

If these lookup functions are the only "public" API, then we can use whatever internal format we want (e.g. keep the existing one).

Related to this, Jamie just posted on slack a much longer list of varied languages (for African Storybook channel) so we might need to also consider an "extensible language" setting.

rtibbles commented 7 years ago

I think having an API is a good approach, however, I would urge you both to consider that both the data and the API should be accessible from both Python and Javascript, as we will need all of this language data during content render to make decisions about text directionality within content renderers.

rtibbles commented 7 years ago

Also, for even more languages, c.f. https://www.ethnologue.com/

ivanistheone commented 7 years ago

Fixed in https://github.com/learningequality/le-utils/pull/28