Canonicalize extensions in `icu_locid_transform`

unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.

https://icu4x.unicode.org

Other

1.34k stars 174 forks source link

Canonicalize extensions in `icu_locid_transform` #3483

Open robertbastian opened 1 year ago

robertbastian commented 1 year ago

LocaleCanonicalizer is only really a LanguageIdentifierCanonicalizer at the moment, as it does not canonicalize any extensions. However, Unicode extensions for example can be canonicalized (de-u-co-standard -> de).

Canonicalizing extensions lets us avoid one lookup in fallback (i.e. the one for de-u-co-standard which will always fail).

Discuss with:

@sffc
@zbraniecki

Optional:

@robertbastian

sffc commented 1 year ago

Related: there is a comment that doesn't seem to have a tracking issue in the LocaleCanonicalizer:

    /// Some BCP47 canonicalization data is not part of the CLDR json package. Because
    /// of this, some canonicalizations are not performed, e.g. the canonicalization of
    /// `und-u-ca-islamicc` to `und-u-ca-islamic-civil`. This will be fixed in a future
    /// release once the missing data has been added to the CLDR json data.

One key use case here is mapping from deprecated variants to unicode extensions, like de-PHONEBOOK to de-u-co-phonebk

However, I'm not sure if de-u-co-standard to de is "canonicalization". It could be seen as minimizing likely subtags, perhaps.

sffc commented 1 year ago

@zbraniecki - The operation seems fairly straightforward.
@sffc - The Intl Locale Info proposal will require that we add this data (default keyword values). The default data can be a list in descending order of preference.
@zbraniecki - While trimming default values seems like a stable operation, it may make an explicit static value a variable; for example, if I have en-US-u-hc-h12, that canonicalizes to en-US, but if 2 versions later we change the -u-hc default, we've lost that information.
@robertbastian - It's more like removing likely subtags.
@zbraniecki - This thread is about canonicalize, not minimize.
@robertbastian - The intent was to make this operation available somewhere; if not canonicalize, maybe minimize.
@sffc - Can we introduce this as a new operation? Given that minimize/maximize is very much focused on LSRV.
@zbraniecki - It could be an option on the existing function.
@robertbastian - Data slicing works better with a new operation.
@zbraniecki - Do we care about data slicing here?
@sffc - Yes

Conclusion: Add a new function.

Good first bug; @zbraniecki happy to mentor.

LGTM: @zbraniecki @sffc @robertbastian

sffc commented 5 months ago

Minimizing extensions is closely related to the fallback work, #3867. Re-triaging the issue accordingly.