Open Omikhleia opened 1 year ago
Yes, all true. In fact I thought I had an issue for tracking this but I don't see it. I suppose this comment is what I was thinking of.
The next step is probably to figure out if changing document.language to a region qualified language identifier by default is going to be a breaking change.
Breaking for the user? No, it shouldn't.
I've been using "en-GB" and "fr-CA" for the fun, with just one clever but ugly hack to the existing code.
The question now is how far are we ready to go for refactoring the internal logic of several things to avoid the really ugly hack... I can push a draft branch by the week-end, annotated with comments, if you think that might help ascertaining the problem and finding a good solution.
Breaking for the user? No, it shouldn't. I've been using "en-GB" and "fr-CA" for the fun,
I'd love for you to be right here, but I'm having trouble visualizing it. Using fully qualified names like en_GB
in a document and shimming it to work in SILE with 'simple' names is a relatively easy automatic downgrade. I'm having trouble visualizing the other way around where a document (like almost all of them in existence right now) specify a simple name and we need to upgrade it to a fully qualified one. In order for this to not be a breaking change we'll need a function to cast up an otherwise ambiguous language code into the most likely fully qualified name. No? Doable, just not simple. Or am I missing something here?
Using fully qualified names like
en_GB
By the way, that would be en-GB
if we stick to the BCP47 format -- which I would recommend, it is what the Web standards mostly use, and it's a slightly different format than "locale codes" (what your en_GB
could be); and there are rules for mapping one to the other (as well as canonicalization rule, and this is what our existing ICU wrapper does actually).
I'm having trouble visualizing the other way around where a document (like almost all of them in existence right now) specify a simple name and we need to upgrade it to a fully qualified one.
But we don't necessarily have to upgrade documents, en
is valid BCP47 for "Standard English" -- one only needs to upgrade to, say en-US
, en-GB
or en-CA
in order to enable features specific to the variants (if any exist), but the bare 2-letter code is still valid (usually considered to mean en-US
).
In most cases, the 2-letter code is the canonical form of the "main language", e.g. fr
is French for France (hence fr-FR
does not really exists per se, but fr-CH
, fr-CA
etc. do have a meaning); es
is always understood as es-ES
(Castillan) and only needs extra qualification when referring to a variant such as es-MX
(Spanish from Mexico)
In other terms, it seems to me that the crux of the matter is not to enforce fully qualified names (you wouldn't want to enforce the use of the very qualified but cumbersome en-Latn-US
for standard English in Latin script) but just to support them, with fallback to the shortest supported form (which is what my WIP PR #1641 did).
There are only a few cases where the non-qualified name is ambiguous (sr
could be sr-Latn
or sr-Cyrl
) but there is usually a default interpretation.
Or did I misunderstand your question?
And I stand corrected:
Sometimes, we might have to map reciprocally a 2-letter language "pt" to something "more qualified", just because the files we may need to load want it.
(For the curious-minded, this screenshot is from the CSL locales, using BCP47 but with even extra explicit qualification)
Most of the code (and the manual also states it) assumes
document.language
is an ISO 639 language code (e.g.fr
,en
...). There are a number of cases where this is not sufficient for actual typography.fluent
and friends)This also indirectly relates to #1367 and #1157.
Near duplicate of #1368.