unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.34k stars 174 forks source link

Canonicalize extensions in `icu_locid_transform` #3483

Open robertbastian opened 1 year ago

robertbastian commented 1 year ago

LocaleCanonicalizer is only really a LanguageIdentifierCanonicalizer at the moment, as it does not canonicalize any extensions. However, Unicode extensions for example can be canonicalized (de-u-co-standard -> de).

Canonicalizing extensions lets us avoid one lookup in fallback (i.e. the one for de-u-co-standard which will always fail).

Discuss with:

Optional:

sffc commented 1 year ago

Related: there is a comment that doesn't seem to have a tracking issue in the LocaleCanonicalizer:

    /// Some BCP47 canonicalization data is not part of the CLDR json package. Because
    /// of this, some canonicalizations are not performed, e.g. the canonicalization of
    /// `und-u-ca-islamicc` to `und-u-ca-islamic-civil`. This will be fixed in a future
    /// release once the missing data has been added to the CLDR json data.

One key use case here is mapping from deprecated variants to unicode extensions, like de-PHONEBOOK to de-u-co-phonebk

However, I'm not sure if de-u-co-standard to de is "canonicalization". It could be seen as minimizing likely subtags, perhaps.

sffc commented 1 year ago

Conclusion: Add a new function.

Good first bug; @zbraniecki happy to mentor.

LGTM: @zbraniecki @sffc @robertbastian

sffc commented 5 months ago

Minimizing extensions is closely related to the fallback work, #3867. Re-triaging the issue accordingly.