rhdunn / cainteoir-engine

The Cainteoir Text-to-Speech core engine
http://reecedunn.co.uk/cainteoir/
GNU General Public License v3.0
43 stars 8 forks source link

support RFC4646 and CLDR for language tag processing #20

Closed rhdunn closed 12 years ago

rhdunn commented 12 years ago

RFC4646: http://www.ietf.org/rfc/rfc4646.txt CLDR: http://cldr.unicode.org/

Language codes (en, zh, af, owl, etc.), scripts (Hans, Latin, Ogham, etc.) and territories (CN, US, TK, etc.).

Language code aliases (e.g. i-klingon => tlh, spa => es) -- map to their primary Unicode/ISO-639 code (shortest code).

Expand to full canonical form (e.g. en => en-Latn-US; es-MX => es-Latn-MX).

Equivalent language forms (e.g. en <> en-US <> en-Latn-US).

Compare language codes -- exact match ; partial match.

Related languages (language tree) e,g, Middle Irish Gaelic is ancestral to Scottish Gaelic ==> use for sharing language rules (e.g. between Afrikaans and Dutch).

rhdunn commented 12 years ago

RFC4647: http://www.ietf.org/rfc/rfc4647.txt (Matching of language tags)

NOTE: The CLDR uses underscore ('_') instead of hyphen ('-') to split language code segments, so both need to be supported.

rhdunn commented 12 years ago

NOTE: This should also support POSIX/Linux locale codes -- specifically: 1/ locale@script -- e.g. en@boldquot, uz@cyrillic, sr@ije, sr@latin, sr@Latn 2/ locale.codepage -- e.g. nb_NO.ISO-8859-1

rhdunn commented 12 years ago

See http://www.iana.org/assignments/language-subtag-registry for a list of tags (language, region, script).

rhdunn commented 12 years ago

RFC5646: http://www.ietf.org/rfc/rfc5646.txt

NOTE: This replaces RFC4646.

rhdunn commented 12 years ago

The core language, script, region decoding has been implemented with tests along with support for espeak tags and the IANA language subtag registry. The remaining parts (along with POSIX/Linux locale codes) should be implemented as required.