Standardize Languages - Githubissues

Current Thinking (2021 July 22)

Language: ISO 639-3 (eng, hin, etc.)
Script: ISO 15924 only for Chinese (hans vs. hant)
- Also see the IANA Subtag Registry
Kosa: Construct a flat list of languages on the server side, but content is divided by text vs. audio/video. Non-Chinese text and audio happen to have the same keys. Chinese text content is either keyed as zho-hant or zho-hans, but nothing else. Audio content is keyed as cmn, yue, nan, or hak.
Mobile App: Users select a language and the app constructs a "preferred languages list" behind the scenes. All languages have a backup language of English. Selecting a Chinese language presents the user with a script selection box as well, with options of Traditional or Simplified. The "preferred" language is always zho-hant or zho-hans and the second-most-preferred language will be the spoken Chinese language (of cmn, yue, nan, or hak). Chinese users also get a final backup of English.
Thinking: This system allows Kosa to serve content with a flat language key for anything, greatly simplifying how it tracks languages and preventing a language tree from emerging anywhere in the API. The "preferred languages list" allows us to (a) back up everything with English content and (b) add flexible language preferences and new script options later, if required. The language-selection algorithm can be dumb-but-flexible, allowing us to avoid lookup trees entirely.

[ ] embed standardized language names in Dart and Ruby/Clojure using ICU libraries (?)
[ ] a minimum required set of languages include:
- pali
- english
- espanol
- italiano
- simplified chinese
- francais
- portugues
- srpsko-hrvatski (serbo-croatian)

The complete list of languages currently supported by Pariyatti:

It seems that ISO 639-3 (an extension of ISO 639-3) has reasonably comprehensive support:

My current thinking is ISO 639-3 + (optional) region specifier. Alternatively, some BCP 47 subset... but it's just so complicated.

Wikipedia uses a number of hacks to get around BCP 47 limitations:

My first round of research turned up this:

A Language should have three fields: IANA code, English name ("Hindi"), Actual name ("हिंदी")

Prefer tag combinations were are nearest matches to the Gettext locale standard, wherever possible:

Ooooohhhkkkayyyyy. It looks like THIS is maybe the standard way to do this? At least according to friends at Wikipedia:

This list is available through ICU libraries. This CLDR format also contains the language name equivalents (आनगराी / English vs. Hindi / हिंदी vs every other possible combination).

The canonical ICU webpage is here: http://site.icu-project.org/home

The Ruby library is listed here (gem icu): http://site.icu-project.org/related