unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.39k stars 180 forks source link

Time zone variant calculator: does it let us fully handle zoned datetime formatting? #5466

Open sffc opened 3 months ago

sffc commented 3 months ago

At its core, ICU4X time zones have 4 fields, which fully determine the strings to be selected for formatting:

pub struct CustomTimeZone {
    pub gmt_offset: Option<GmtOffset>,
    pub time_zone_id: Option<TimeZoneBcp47Id>,
    pub metazone_id: Option<MetazoneId>,
    pub zone_variant: Option<ZoneVariant>,
}

Let's say someone gives us an IXDTF string like: 2024-08-29T11:53:18-0700[America/Los_Angeles]

From this string, we can already populate two fields:

We have MetazoneCalculator, which takes the time portion of the string and lets us calculate the metazone field:

However, how do we calculate the ZoneVariant field?

I learned today that tzif files, at least version 2 and 3 files, contain a footer that looks like this:

$ tail -n1 /usr/share/zoneinfo/America/Los_Angeles 
PST8PDT,M3.2.0,M11.1.0

The "8" in that footer means that this time zone has a standard offset of 8 hours behind UTC. (note that the offset is negated from what we normally see)

Does this mean that we could build a table with standard offsets and use that table to generate zone variants? For example, we could create a data file with the following data, which can all be generated from the TZDB:

Time Zone ID Standard Offset
America/Los_Angeles -8
America/Chicago -6
Asia/Kabul +4:30
Asia/Manila +8
... ...

Then, when reading the IXDTF string, we use the following algorithm to select the zone variant:

  1. Look up the Standard Offset from the IXDTF string's Time Zone Identifier.
  2. If the Standard Offset matches the IXDTF string's Offset: set zone_variant to Standard.
  3. Else, if the Standard Offset is 1 less than the IXDTF string's Offset: set zone_variant to Daylight.
  4. Else, leave the zone_variant undefined.

Mechanically, we can generate this table by using a combination of our own tzif crate, which contains a struct ZoneVariantInfo with this information pre-parsed, and a tzif source, which could potentially be jiff_tzdb.

Note: the Time Zone ID would probably be stored in BCP-47 and Standard Offset would be bitpacked to an i8. It's possible we could stuff this data into one of our existing data structs to be more efficient.

Note: I assume that this mapping of time zone IDs to standard offsets is fairly stable over time, such that we do not need to worry about shipping updates at a cadence different than normal CLDR data updates.

Please help me understand: is the proposed algorithm correct and robust, or is it flawed in some edge cases?

@nekevss @leftmostcat @nordzilla @yumaoka @justingrant

srl295 commented 3 months ago

Else, if the Standard Offset is 1 less than the IXDTF string's Offset: set zone_variant to Daylight.

Instead of '1 less' couldn't you query the tz data to look for a transition from that data and use it? In other words, couldn't your table have both a standard offset and a daylight offset?

Time Zone ID Standard Offset Daylight Offset
America/Los_Angeles 8 7

Actually, querying the offset table for that exact time 2024-08-29T11:53:18 for America/Los_Angeles should result in an offset of 0700 from GMT.

sffc commented 3 months ago

My goal is, assuming that an IXDTF string is correct (has the correct offset for the given date, time, and time zone), format that data without relying directly on the TZDB at runtime.

I can store both the standard offset and daylight offset for each time zone. I guess my questions then would be:

  1. Does each IANA zone have a stable mapping of what offset is "standard" and which offset is "daylight"?
  2. Is the daylight offset ever not 1 hour more than the standard offset?
srl295 commented 3 months ago

@sffc

  1. yes. In tzdb it's the SAVE column
  2. in modern zones I'm not sure, but it's not a good reason to hard code it.
sffc commented 3 months ago

Actually I guess the counter example is when a city switches from one metazone to another metazone, not just changing its transition dates, such as what happened last year in Chihuahua, Mexico, which switched from Mountain Time to Central Time

https://www.timeanddate.com/time/zone/mexico/chihuahua

So maybe this mapping needs to be from metazones, not time zones, to what their standard and daylight offsets are?

srl295 commented 3 months ago

Actually I guess the counter example is when a city switches from one metazone to another metazone, not just changing its transition dates, such as what happened last year in Chihuahua, Mexico, which switched from Mountain Time to Central Time

https://www.timeanddate.com/time/zone/mexico/chihuahua

So maybe this mapping needs to be from metazones, not time zones, to what their standard and daylight offsets are?

a metazone's offsets are valid for that zone for a certain time period. So the Mexico_Pacific and America_Central offsets will be different.

https://github.com/eggert/tz/blob/main/northamerica#L2731-L2732

            <timezone type="America/Chihuahua">
                <usesMetazone to="1998-04-05 09:00" mzone="America_Central"/>
                <usesMetazone to="2022-10-30 08:00" from="1998-04-05 09:00" mzone="Mexico_Pacific"/>
                <usesMetazone from="2022-10-30 08:00" mzone="America_Central"/>
            </timezone>
sffc commented 3 months ago

Does a particular metazone always have the same offsets corresponding to its standard and daylight variants?

sffc commented 3 months ago

It seems that ICU4C determines the zone variant by reading "is the current datetime DST or not" from the TZDB.

That bit appears fetchable from tzif, and it is in the tzif crate:

https://unicode-org.github.io/icu4x/rustdoc/tzif/data/tzif/struct.LocalTimeTypeRecord.html

I think my previous question though is still a valid question to ask. Does a particular metazone always have the same offsets corresponding to its standard and daylight variants? That could perhaps be data that could be added to CLDR.

Also, regarding whether the DST shift should be fixed at 1 hour: it seems that the ICU4C code currently assumes this in multiple places, such as https://github.com/unicode-org/icu/blob/eda184e6af63d6eee1b3a59c61d1695eef44fcb4/icu4c/source/i18n/timezone.cpp#L1241

BurntSushi commented 3 months ago

Also, regarding whether the DST shift should be fixed at 1 hour: it seems that the ICU4C code currently assumes this in multiple places

My favorite counter-example to this is Antarctica/Troll, which uses a DST shift of 2 hours:

$ tail -n1 /usr/share/zoneinfo/Antarctica/Troll
<+00>0<+02>-2,M3.5.0/1,M10.5.0/3

And then there is also the case of Ireland, whose DST shift is inverted from what's typical:

$ tail -n1 /usr/share/zoneinfo/Europe/Dublin
IST-1GMT0,M10.5.0,M3.5.0/1

As you noted, TZ strings invert the sign. So Europe/Dubin uses +0100 for standard time and +0000 for DST.

nekevss commented 3 months ago

FWIW, here's a markdown table of the output of find -L /usr/share/zoneinfo/ -maxdepth 3 -type f,l | xargs tail -n1. Although, I think it does pull in some noise from /usr/share/zoneinfo/right/.

nekevss commented 3 months ago

It's already been noted regarding the sign in the POSIX tz string. But just found the below quote in the TZ Variable section of the GNU C LIbrary manual.

This is positive if the local time zone is west of the Prime Meridian and negative if it is east. The hour must be between 0 and 24, and the minute and seconds between 0 and 59.

robertbastian commented 2 months ago

I think the question "how to set the ZoneVariant" is an XY problem. For formatting, we need a way to look up a time zone name given an offset (this is the only use case for ZoneVariant). The straightforward solution to this would be to instead of

"ampa": {
   "dt": "Pacific Daylight Time",
   "st": "Pacific Standard Time"
}

store

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-8:00": "Pacific Standard Time"
}

This doesn't require any additional lookup at runtime, as we already have the offset, and naturally handles any kind of DST (even multiple).

nordzilla commented 2 months ago

From @robertbastian

I think the question "how to set the ZoneVariant" is an XY problem. For formatting, we need a way to look up a time zone name given an offset (this is the only use case for ZoneVariant). The straightforward solution to this would be to instead of

"ampa": {
   "dt": "Pacific Daylight Time",
   "st": "Pacific Standard Time"
}

store

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-8:00": "Pacific Standard Time"
}

This doesn't require any additional lookup at runtime, as we already have the offset, and naturally handles any kind of DST (even multiple).


I agree that I think data in this format would be ideal.

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-8:00": "Pacific Standard Time"
}

This data could be added to supplemental/metaZones.xml in CLDR.

However, there are a few things to consider:


1) Has a metazone ever changed its associated time variants?

If not, the data is straightforward, exactly as shown above.

If so, this data could still reasonably be captured and added to the file.

Consider a hypothetical situation where America_Central (amce) decided to move its standard-time offset for all of its associated time zones by half an hour for one year, and then changed it back to the way it was before:

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-8:00": "Pacific Standard Time"
},
"amce": {
  "usesTimeVariants": {
    "-5:00": "Central Daylight Time",
    "-6:00": "Central Standard Time",
    "_to": "2024-09-06 00:00"
  },
  "usesTimeVariants": {
    "-5:00": "Central Daylight Time",
    "-5:30": "Central Standard Time",
    "_from": "2024-09-06 00:00",
    "_to": "2025-09-06 00:00"
  },
  "usesTimeVariants": {
    "-5:00": "Central Daylight Time",
    "-6:00": "Central Standard Time",
    "_from": "2025-09-06 00:00"
  },
},

This format seems reasonable and is the same structure as how Time Zone ID's are mapped to MetaZones in the same file.


2) What would happen if a time zone within an associated metazone observes the same time-variants offsets, but transitions among them at different datetimes than other zones within that metazone?

One relevant example of this is the recent proposal for some of the West Coast states to observe permanent Daylight Savings Time:

https://www.opb.org/article/2024/02/20/oregon-bill-to-end-daylight-saving-time-fails-legislature/

If this were the case, then the offset would remain UTC-7 year round, and those time zones, e.g. America/Los_Angeles would just format to Pacific Daylight Time year round.

This all seems okay to me.


3) What would happen if an individual time zone wants to use use different offsets than the current time-variant offsets established by the metazone?

I am not aware of any such case like this that exists, but I think there are two reasonable solutions:

A) That time zone could switch to a new metazone (either new or preexisting) that matches its desired offsets. This happens all the time.

B) We could add that offset data to CLDR.

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-7:30": "Pacific Cool New Time",
   "-8:00": "Pacific Standard Time"
},

The time zones that use the prior offsets would go on as usual, and the time zone with the new offset would have its new localized name.

I recall a conversation with @sffc years ago that perhaps daylight_time and standard_time are not great identifiers within the icu4x code base, because sometimes it's formatted as "Summer Time" for example, and in the future it may be possible that there are more than 2 variants.

A format such as this would allow us to be agnostic of naming conventions, instead tying the internationalized name of the variant to an offset.

However, there are a few more considerations to take into account in this case:

3.1) What if a time zone wants to add a new offset, but have the same localized name as another offset?

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-7:30": "Pacific Standard Time",
   "-8:00": "Pacific Standard Time"
},

This probably wouldn't cause a data ambiguity issue, but I think it would be incredibly confusing, as "Pacific Standard Time" would now be semantically ambiguous.

This should not be allowed.

3.2) What if a metazone wants to add a new localized name for an offset that is already present?

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-7:00": "Pacific Cool New Time",
   "-8:00": "Pacific Standard Time"
},

This would cause a data issue and should not be allowed.


Conclusion

I don't feel that I have the cycles to take on this work myself right now, but I would support collaborating on making this data available (if people agree it is sound).

Here is an example of when the short metazone identifiers were added to that same CLDR file: https://unicode-org.atlassian.net/browse/CLDR-14607

Filing an issue on Jira would be a good next step if we reach a consensus here.

robertbastian commented 2 months ago

All questions of the form "what if a timezone wants to do something different than the rest of the metazone" should be answered by creating a new metazone. My expectation is that all zones in a metazone fully agree on offsets today and in the future, but maybe that's not guaranteed.

nordzilla commented 2 months ago

All questions of the form "what if a timezone wants to do something different than the rest of the metazone" should be answered by creating a new metazone. My expectation is that all zones in a metazone fully agree on offsets today and in the future, but maybe that's not guaranteed.

That would be much simpler and more stringent. I would agree with imposing these restrictions. I was just trying to think of all the cases.

sffc commented 2 months ago

My favorite counter-example to this is Antarctica/Troll, which uses a DST shift of 2 hours:

Another counter-example to the 60-minute transition: https://www.atlasobscura.com/places/lord-howe-islands-time

sffc commented 2 months ago

I agree with the workaround of creating a new metazone if the offset invariants ever break down. Metazones are purely a CLDR/ICU construction, not TZDB, so we have a lot of latitude for how we handle them.

For example, if all US West Coast states decided to abolish daylight savings time and that Pacific Time should be GMT-7 instead of GMT-8 (a proposal I don't support but which is good for illustrative purposes), then we would need to create a new metazone such as amp2 meaning "version 2 of ampa".

It is highly likely that such changes already occurred in the last 50 years, and we should probably look for them in datagen.

sffc commented 2 months ago

As far as data sources are concerned, it seems perfectly fine to me for this data to be derived from TZDB. Currently ICU4C uses TZDB to determine which zone variant to use when formatting, so if ICU4X used TZDB during datagen, then we should be able to guarantee consistency with ICU4C. ICU4X could manually spawn new "private use" metazones as needed.

sffc commented 2 months ago

OK, one other issue I realized. There are numerous countries that use their own country name as the metazone. The first one I pulled is "kyrg", Kyrgyzstan:

https://en.wikipedia.org/wiki/Kyrgyzstan_Time

Kyrgyzstan has switched between UTC+5 and UTC+6 multiple times, but presumably the metazone has not changed.

justingrant commented 2 months ago

https://en.wikipedia.org/wiki/Kyrgyzstan_Time

Kyrgyzstan has switched between UTC+5 and UTC+6 multiple times, but presumably the metazone has not changed.

Yeah, this was gonna be my concern: cases where oddball metazones are tidally locked to a country. I assume this fact means that the "use the offset only" idea won't work?

sffc commented 2 months ago

Yeah, this was gonna be my concern: cases where oddball metazones are tidally locked to a country. I assume this fact means that the "use the offset only" idea won't work?

I think it can still "work"; it's just something we need to factor in. A few ways of resolving this:

  1. Should Kyrgystan even have a specific (offset-based) time zone name, since it doesn't have a useful meaning? It is a generic (location-based) time zone name, not a specific time zone name. We could just remove it and fall back to the generic time zone name.
  2. If we need to have a specific time zone name, we could just add both UTC+5 and UTC+6 as offsets with the same name.
  3. Or, we could split it into two metazones.
sffc commented 2 months ago

One other note: I very frequently encounter people using "PST" to mean Pacific Time, not specifically Pacific Standard Time, and similarly with EST and CST and others. For example, it is very common to see people say "let's meet in San Francisco on September 7 at 10am PST", and if you show up at that time according to the TZDB/CLDR definition, unless it is a time zone nerds meetup, you will be an hour late.

What this means: this is all so imprecise anyway, so let's just land something reasonable and otherwise encourage people to use city-based time zone names. Maybe CLDR can focus on adding a short location format, such as "LA Time" or "NYC Time" to use instead of the ambiguous things it currently uses.

justingrant commented 2 months ago

Maybe CLDR can focus on adding a short location format, such as "LA Time" or "NYC Time" to use instead of the ambiguous things it currently uses.

Normal people (other than those who are super-familiar with how IANA timezones work, which is a very small Venn diagram overlap with "normal people") don't use "LA Time" or "NYC time". So I'm not sure it'd make sense to add that to CLDR. I understand the desire for consistency, but this seems to be a case where there's no evading the inconsistency of human language use.

sffc commented 2 months ago

My hypothesis is that "normal people" would understand what you meant by "LA Time", even if they haven't often seen it before, and it is also the most unambiguous definition for an i18n library to produce.

yumaoka commented 2 months ago

Random comment for earlier replies.

I think the concept of ZoneVariant in the struct is problematic.

robertbastian commented 2 months ago

Random observation:

Same time Formatted with generic TZ
2024-07-01T12:00:00-06:00[America/Denver] 12:00 Mountain Time
2024-07-01T11:00:00-07:00[America/Phoenix] 11:00 Mountain Time
2024-07-01T18:00:00Z 18:00 UTC
nordzilla commented 2 months ago

From @robertbastian:

Random observation:

Same time Formatted with generic TZ
2024-07-01T12:00:00-06:00[America/Denver] 12:00 Mountain Time
2024-07-01T11:00:00-07:00[America/Phoenix] 11:00 Mountain Time
2024-07-01T18:00:00Z 18:00 UTC

These are all technically correct, though confusing. They're both Mountain Time. It's just that Denver is in Mountain Daylight Time and Phoenix is in Mountain Standard Time because Arizona does not observe DST.

I would argue that this is a reason why populating the ZoneVariant struct whenever possible is worthwhile.


EDIT:

Though, to clarify, the above "Mountain Time" formats are "Generic non-location format".

The UTS-35 spec defines several formats with fallbacking:

Generic non-location format

Examples: "Pacific Time" (long), "PT" (short)

Generic partial location format

Examples: "Pacific Time (Canada)" (long), "PT (Whitehorse)" (short)

Generic location format

Examples: "France Time", "Italy Time"

Specific non-location format

Examples: "Pacific Standard Time" (long), "PST" (short), "Pacific Daylight Time" (long), "PDT" (short)

Localized GMT format

Examples: "GMT+03:30" (long), "GMT+3:30" (short), "UTC-03.00" (long), "UTC" (for zero offset)

ISO 8601 time zone formats

Examples: "-0800" (basic), "-08:00" (extended), "Z" (for UTC)

It was years ago, so I'm not sure if the current implementations within ICU4X are exactly the same, but I tried to implement the fallbacking rules according to the spec.

The above strings have enough information available to utilize either Generic location format e.g. Phoenix Time, or Generic partial location format e.g. Mountain Time (Denver).

justingrant commented 2 months ago

Generic partial location format

Examples: "Pacific Time (Canada)" (long), "PT (Whitehorse)" (short)

FWIW, I think this is a nice solution to this problem described above, where if there's a colloquial name for a time zone like "Pacific Time", it's still used but with a disambiguator for less common cases like Arizona.

sffc commented 2 months ago

The observation about generic non-location being ambiguous is well known and largely working as intended. It should only be used if the location of the event is known from context. Here is the language I wrote for how to select your time zone style in semantic skeleta:

sffc commented 2 months ago

Example use cases where generic time zone style is acceptable:

Note: In most or all of these cases, it would be acceptable to say "local time" or simply drop the qualifier.

Example where generic time is not acceptable and a different style should be used, unless the location is otherwise known from context:

My point is that there are enough legitimate use cases for generic non-location format, but since it could introduce ambiguity, it should only be used if the developer opts in.

robertbastian commented 2 months ago

Generic partial location format

Examples: "Pacific Time (Canada)" (long), "PT (Whitehorse)" (short)

This seems to be the non-ambiguous version of the generic non-location format. We don't seem to support this in ICU4X, however?


What we need for full correctness is a ZoneVariantCalculator that maps (TimeZoneBcp47Id, DateTime<Iso>) -> (UtcOffset, Option<UtcOffset>). It would do this by storing a sequence of ISO minutes with associated offsets for each zone, similar to MetaZonePeriodsV1.

If there is sufficient overlap between the offset list and the metazone list for each location, they could be combined, as the bulk of these structures will be the keys.

robertbastian commented 2 months ago

Re generic partial location format, it sounds like we're meant to detect when a metazone is ~not specific~ ambiguous, and add the location to it. We can do that, I've found a lot of ~non-specific~ ambiguous metazones in #5515. We can extend the return value of MetazoneCalculator with an is_ambiguous flag, in which case the formatter would add the location (or the offset if locations aren't available).

sffc commented 2 months ago

What we need for full correctness is a ZoneVariantCalculator that maps (TimeZoneBcp47Id, DateTime<Iso>) -> (UtcOffset, Option<UtcOffset>). It would do this by storing a sequence of ISO minutes with associated offsets for each zone, similar to MetaZonePeriodsV1.

If there is sufficient overlap between the offset list and the metazone list for each location, they could be combined, as the bulk of these structures will be the keys.

LGTM