tc39 / ecma402

Status, process, and documents for ECMA 402
https://tc39.es/ecma402/
Other
535 stars 105 forks source link

Consider relaxing locale resolution for `Intl.Segmenter` #895

Open jedel1043 opened 4 months ago

jedel1043 commented 4 months ago

Alternative name: Make %Intl.Segmenter%.[[AvailableLocales]] be the full set of all syntactically valid locales (with some caveats like canonicalization/extensions)

Rationale

While most Intl services do require passing a locale for a correct behaviour at runtime, the Segmenter service is in this weird position where it supports almost all locales you throw at it, and the provided locale is just used as a suggestion to segment certain special cases.

This apparently makes it difficult for libraries such as ICU4X to determine if a locale is on their list of [[AvailableLocales]] or not; in that case, only a couple of locales ("km", "lo", "my", "th") are "supported" in the sense that they load some amount of data for them on their data provider. However, the rest of locales are very much "supported", they just don't load locale specific data at runtime. (asking for @sffc's help to add more context about this)

What then? Well, if virtually all locales are "supported" by Segmenter, why not just consider all (see alternative name) syntactically valid locales as supported locales for that service? This would mean making APIs such as Intl.Segmenter.supportedLocalesOf always return everything, which doesn't sound too bad for a service that is basically a low level text processing utility.

anba commented 4 months ago

Implementations return all locales supported by ICU4C, which seems like a reasonable thing to do, because there's at least some guarantee that segmentation works for these locales. Returning everything could give the false impression that any locale works here, including locales like Klingon (tlh), Egyptian (egy), Akkadian (akk), etc.

eemeli commented 4 months ago

One option would be to return an explicit und for locales that are supported, but for which no additional data is needed.

sffc commented 4 months ago

Text processing utilities, including Segmenter and Collator, work based on scripts and properties more than locales. It doesn't make a whole lot of sense to ask a Segmenter or a Collator "what locales do you support", because they support all locales written in scripts that are encoded in Unicode.

It's a known issue that Segmenter favors majority languages in scripts over minority languages written in the same script (such as Cantonese (yue)). However, CLDR has data for yue in other services, and both Firefox and Safari return that yue is supported in Segmenter, even though it is not really supported that well. (Chrome does not ship with yue.)

> Intl.DateTimeFormat.supportedLocalesOf(["yue", "zh"])
Array [ "yue", "zh" ]

> Intl.Segmenter.supportedLocalesOf(["yue", "zh"])
Array [ "yue", "zh" ]

It's not entirely clear to me why each component has its own list, especially since, as @anba notes, in practice they all just return the list of locales in ICU, even if they don't make sense for a particular component. If we were designing this from scratch, I feel like better behavior would be a single Intl.supportedLocalesOf and leave it at that.

sffc commented 4 months ago

Additional context: https://github.com/unicode-org/icu4x/issues/3284

The CLDR design group agreed earlier this year that type: "grapheme" segmenters should not take a locale parameter at all; they are purely algorithmic based on Unicode properties. The other types of segmenters may use the locale hint to tailor behavior, but it is only a hint, and the fallback is always algorithmic. This is very different from other components such as DateTimeFormat, which has an actual failure mode of falling back to the system locale if it can't find data in the requested locale.