unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.39k stars 180 forks source link

Replace metazones with a more compact identifier #528

Closed sffc closed 2 years ago

sffc commented 3 years ago

The metazone identifier, like "Mexico_Pacific", is internal to CLDR. Using that string as the map key in the data provider for symbol lookup is not ideal, in large part because there's no reason to require a heap allocation for what should be an internal identifier string.

I would prefer to use something like the 5-character BCP47 time zone name. At first, I suggested mapping from metazone to golden zone using metaZones.json to BCP47 zone using bcp47/timezone.xml, sich that "Mexico_Pacific" maps to America/Mazatlan which maps to "mxmzt". However, @macchiati pointed out that the mapping from metazone to golden zone is not stable. For example, the golden zone for Indochina was changed from Asia/Saigon to Asia/Bangkok.

Instead, we should introduce new short names or integers to identify metazones in locale data. I would like if we could add these short names or integers directly to CLDR.

sffc commented 3 years ago

@yumaoka says:

We define golden zones for mapping a zone display name to a time zone.

For example, when we parse date/time string "March 5, 2021 12:05:56 PM Eastern Standard Time", we need to map "Eastern Standard Time" to a time zone object in ICU. The display name "Eastern Standard Time" is "currently" shared by multiple time zones including "America/New_York", "America/Detroit", "America/Toronto" and others. CLDR defines "America/New_York" as the golden zone for the metazone "America_Eastern" ("Eastern Standard Time" is this metazone's long standard time name in English locale), and "America/Toronto" as the regional golden zone for Canada.

When you parse "March 5, 2021 12:05:56 PM Eastern Standard Time", a time zone object created in the result calendar object will be changed depending on locale of date format object.

  • en: America/New_York
  • en_US: America/New_York
  • en_CA: America/Toronoto
  • en_GB: America/New_York
  • en_JM: America/Jamaica

I think there are some design problems in time zone parsing, and maintaining the golden zone data is sometimes so complicated. With the current architecture, I think more zones will be disqualified as golden zone.

If ICU4X does not support time zone parsing from display name, then you don't need to carry golden zone data. This data is not used for formatting. I think ICU's time zone parsing implementation is not good for ICU4X. It creates a trie including all possible time zone display names at runtime. I'd suggest ICU4X to support these display names only for formatting, but not parsing.

sffc commented 3 years ago

Tentative agreement: time zone input should have three fields:

We will discuss this further on 2021-03-19

sffc commented 3 years ago

Blocked on #561

dminor commented 3 years ago

@nordzilla We plan to release 0.3 in about 3 weeks. Do you expect this to be unblocked in time for the 0.3 release, or should we punt to 0.4?

sffc commented 3 years ago

The data is available:

https://github.com/unicode-org/cldr-json/blob/wip-alpha3/cldr-json/cldr-core/supplemental/metaZones.json

sffc commented 2 years ago

Updated link: https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-core/supplemental/metaZones.json

sffc commented 2 years ago

@samchen61661 The next step here is to change the maps to use TinyAsciiStr instead of str. We can discuss today.