pariyatti / kosa

Digital library service
GNU Affero General Public License v3.0
8 stars 3 forks source link

Standardize Languages #8

Closed deobald closed 2 years ago

deobald commented 3 years ago

Current Thinking (2021 July 22)

  1. Language: ISO 639-3 (eng, hin, etc.)
  2. Script: ISO 15924 only for Chinese (hans vs. hant)
  3. Kosa: Construct a flat list of languages on the server side, but content is divided by text vs. audio/video. Non-Chinese text and audio happen to have the same keys. Chinese text content is either keyed as zho-hant or zho-hans, but nothing else. Audio content is keyed as cmn, yue, nan, or hak.
  4. Mobile App: Users select a language and the app constructs a "preferred languages list" behind the scenes. All languages have a backup language of English. Selecting a Chinese language presents the user with a script selection box as well, with options of Traditional or Simplified. The "preferred" language is always zho-hant or zho-hans and the second-most-preferred language will be the spoken Chinese language (of cmn, yue, nan, or hak). Chinese users also get a final backup of English.
  5. Thinking: This system allows Kosa to serve content with a flat language key for anything, greatly simplifying how it tracks languages and preventing a language tree from emerging anywhere in the API. The "preferred languages list" allows us to (a) back up everything with English content and (b) add flexible language preferences and new script options later, if required. The language-selection algorithm can be dumb-but-flexible, allowing us to avoid lookup trees entirely.

Requirements

The complete list of languages currently supported by Pariyatti:

It seems that ISO 639-3 (an extension of ISO 639-3) has reasonably comprehensive support:

My current thinking is ISO 639-3 + (optional) region specifier. Alternatively, some BCP 47 subset... but it's just so complicated.

Wikipedia uses a number of hacks to get around BCP 47 limitations:

Examples explaining why flattening Chinese languages won't work:

  1. Taiwan speaks cmn, nan, hak but always uses zho-hant
  2. Fujian / Guangdong (China) speak nan and hak but always use zho-hans

Chinese scripts can be decoded here:

https://www.chineseconverter.com/en/convert/find-out-if-simplified-or-traditional-chinese


Old notes from Asana:

1:

My first round of research turned up this:

A Language should have three fields: IANA code, English name ("Hindi"), Actual name ("हिंदी")

IANA tag registry is here: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Prefer tag combinations were are nearest matches to the Gettext locale standard, wherever possible:

https://www.gnu.org/software/gettext/manual/html_node/Locale-Names.html#Locale-Names


2:

Ooooohhhkkkayyyyy. It looks like THIS is maybe the standard way to do this? At least according to friends at Wikipedia:

https://github.com/unicode-org/cldr/tree/release-37/common/main


3:

The canonical ICU webpage is here: http://site.icu-project.org/home

The Ruby library is listed here (gem icu): http://site.icu-project.org/related

There is a Dart package: https://pub.dev/packages/icu


4: (post-Asana)

Clojure: https://github.com/Vincit/satakieli (wraps ICU4J) Java: http://site.icu-project.org (ICU4J)

deobald commented 2 years ago

This is completed for Kosa with the inclusion of a complete superset of both dhamma.org and pariyatti.org languages (as a flattened list).