Using qualified language names instead of language code

Omikhleia commented 1 year ago

Most of the code (and the manual also states it) assumes document.language is an ISO 639 language code (e.g. fr, en...). There are a number of cases where this is not sufficient for actual typography.

Hyphenation. E.g. Serbian is "sr", which currently has Cyrillic hyphenation patterns, but in written form the language is "digraphic", using the Cyrillic ("sr-Cyrl") or Latin ("sr-Latn") script. Likewise for Azeri, and a bunch of others.
Internationalization (fluent and friends)
- For the same reason as above, obviously
- But also because various regional use of a base language might have other habits. E.g. "es" and "pt" vs resp. "es-AR", "pt-BR", etc.
Number formatting #1630 (and, slightly related, #1248 - for cases when either latin digits or a different native script are used)
Smart typography. E.g. see smartquotes.sile (a dependency of my markdown package): "fr-FR" and "fr-CH" would need being distinguished, but also "en-US" and "en-UK".
...

This also indirectly relates to #1367 and #1157.

Near duplicate of #1368.

alerque commented 1 year ago

Yes, all true. In fact I thought I had an issue for tracking this but I don't see it. I suppose this comment is what I was thinking of.

The next step is probably to figure out if changing document.language to a region qualified language identifier by default is going to be a breaking change.

Omikhleia commented 1 year ago

Breaking for the user? No, it shouldn't. I've been using "en-GB" and "fr-CA" for the fun, with just one clever but ugly hack to the existing code. The question now is how far are we ready to go for refactoring the internal logic of several things to avoid the really ugly hack... I can push a draft branch ~~by the week-end~~, annotated with comments, if you think that might help ascertaining the problem and finding a good solution.

alerque commented 3 months ago

Breaking for the user? No, it shouldn't. I've been using "en-GB" and "fr-CA" for the fun,

I'd love for you to be right here, but I'm having trouble visualizing it. Using fully qualified names like en_GB in a document and shimming it to work in SILE with 'simple' names is a relatively easy automatic downgrade. I'm having trouble visualizing the other way around where a document (like almost all of them in existence right now) specify a simple name and we need to upgrade it to a fully qualified one. In order for this to not be a breaking change we'll need a function to cast up an otherwise ambiguous language code into the most likely fully qualified name. No? Doable, just not simple. Or am I missing something here?

Omikhleia commented 3 months ago

Using fully qualified names like en_GB

By the way, that would be en-GB if we stick to the BCP47 format -- which I would recommend, it is what the Web standards mostly use, and it's a slightly different format than "locale codes" (what your en_GB could be); and there are rules for mapping one to the other (as well as canonicalization rule, and this is what our existing ICU wrapper does actually).

I'm having trouble visualizing the other way around where a document (like almost all of them in existence right now) specify a simple name and we need to upgrade it to a fully qualified one.

But we don't necessarily have to upgrade documents, en is valid BCP47 for "Standard English" -- one only needs to upgrade to, say en-US, en-GB or en-CA in order to enable features specific to the variants (if any exist), but the bare 2-letter code is still valid (usually considered to mean en-US).

In most cases, the 2-letter code is the canonical form of the "main language", e.g. fr is French for France (hence fr-FR does not really exists per se, but fr-CH, fr-CA etc. do have a meaning); es is always understood as es-ES (Castillan) and only needs extra qualification when referring to a variant such as es-MX (Spanish from Mexico)

In other terms, it seems to me that the crux of the matter is not to enforce fully qualified names (you wouldn't want to enforce the use of the very qualified but cumbersome en-Latn-US for standard English in Latin script) but just to support them, with fallback to the shortest supported form (which is what my WIP PR #1641 did).

There are only a few cases where the non-qualified name is ambiguous (sr could be sr-Latn or sr-Cyrl) but there is usually a default interpretation.

Or did I misunderstand your question?

Omikhleia commented 3 months ago

And I stand corrected:

Sometimes, we might have to map reciprocally a 2-letter language "pt" to something "more qualified", just because the files we may need to load want it.

(For the curious-minded, this screenshot is from the CSL locales, using BCP47 but with even extra explicit qualification)

sile-typesetter / sile

Using qualified language names instead of language code #1631