Open sffc opened 3 months ago
Else, if the Standard Offset is 1 less than the IXDTF string's Offset: set zone_variant to Daylight.
Instead of '1 less' couldn't you query the tz data to look for a transition from that data and use it? In other words, couldn't your table have both a standard offset and a daylight offset?
Time Zone ID | Standard Offset | Daylight Offset |
---|---|---|
America/Los_Angeles | 8 | 7 |
Actually, querying the offset table for that exact time 2024-08-29T11:53:18
for America/Los_Angeles should result in an offset of 0700 from GMT.
My goal is, assuming that an IXDTF string is correct (has the correct offset for the given date, time, and time zone), format that data without relying directly on the TZDB at runtime.
I can store both the standard offset and daylight offset for each time zone. I guess my questions then would be:
@sffc
Actually I guess the counter example is when a city switches from one metazone to another metazone, not just changing its transition dates, such as what happened last year in Chihuahua, Mexico, which switched from Mountain Time to Central Time
https://www.timeanddate.com/time/zone/mexico/chihuahua
So maybe this mapping needs to be from metazones, not time zones, to what their standard and daylight offsets are?
Actually I guess the counter example is when a city switches from one metazone to another metazone, not just changing its transition dates, such as what happened last year in Chihuahua, Mexico, which switched from Mountain Time to Central Time
https://www.timeanddate.com/time/zone/mexico/chihuahua
So maybe this mapping needs to be from metazones, not time zones, to what their standard and daylight offsets are?
a metazone's offsets are valid for that zone for a certain time period. So the Mexico_Pacific and America_Central offsets will be different.
https://github.com/eggert/tz/blob/main/northamerica#L2731-L2732
<timezone type="America/Chihuahua">
<usesMetazone to="1998-04-05 09:00" mzone="America_Central"/>
<usesMetazone to="2022-10-30 08:00" from="1998-04-05 09:00" mzone="Mexico_Pacific"/>
<usesMetazone from="2022-10-30 08:00" mzone="America_Central"/>
</timezone>
Does a particular metazone always have the same offsets corresponding to its standard and daylight variants?
It seems that ICU4C determines the zone variant by reading "is the current datetime DST or not" from the TZDB.
That bit appears fetchable from tzif, and it is in the tzif crate:
https://unicode-org.github.io/icu4x/rustdoc/tzif/data/tzif/struct.LocalTimeTypeRecord.html
I think my previous question though is still a valid question to ask. Does a particular metazone always have the same offsets corresponding to its standard and daylight variants? That could perhaps be data that could be added to CLDR.
Also, regarding whether the DST shift should be fixed at 1 hour: it seems that the ICU4C code currently assumes this in multiple places, such as https://github.com/unicode-org/icu/blob/eda184e6af63d6eee1b3a59c61d1695eef44fcb4/icu4c/source/i18n/timezone.cpp#L1241
Also, regarding whether the DST shift should be fixed at 1 hour: it seems that the ICU4C code currently assumes this in multiple places
My favorite counter-example to this is Antarctica/Troll
, which uses a DST shift of 2 hours:
$ tail -n1 /usr/share/zoneinfo/Antarctica/Troll
<+00>0<+02>-2,M3.5.0/1,M10.5.0/3
And then there is also the case of Ireland, whose DST shift is inverted from what's typical:
$ tail -n1 /usr/share/zoneinfo/Europe/Dublin
IST-1GMT0,M10.5.0,M3.5.0/1
As you noted, TZ strings invert the sign. So Europe/Dubin
uses +0100
for standard time and +0000
for DST.
FWIW, here's a markdown table of the output of find -L /usr/share/zoneinfo/ -maxdepth 3 -type f,l | xargs tail -n1
. Although, I think it does pull in some noise from /usr/share/zoneinfo/right/
.
It's already been noted regarding the sign in the POSIX tz string. But just found the below quote in the TZ Variable section of the GNU C LIbrary manual.
This is positive if the local time zone is west of the Prime Meridian and negative if it is east. The hour must be between 0 and 24, and the minute and seconds between 0 and 59.
I think the question "how to set the ZoneVariant
" is an XY problem. For formatting, we need a way to look up a time zone name given an offset (this is the only use case for ZoneVariant
). The straightforward solution to this would be to instead of
"ampa": {
"dt": "Pacific Daylight Time",
"st": "Pacific Standard Time"
}
store
"ampa": {
"-7:00": "Pacific Daylight Time",
"-8:00": "Pacific Standard Time"
}
This doesn't require any additional lookup at runtime, as we already have the offset, and naturally handles any kind of DST (even multiple).
From @robertbastian
I think the question "how to set the
ZoneVariant
" is an XY problem. For formatting, we need a way to look up a time zone name given an offset (this is the only use case forZoneVariant
). The straightforward solution to this would be to instead of"ampa": { "dt": "Pacific Daylight Time", "st": "Pacific Standard Time" }
store
"ampa": { "-7:00": "Pacific Daylight Time", "-8:00": "Pacific Standard Time" }
This doesn't require any additional lookup at runtime, as we already have the offset, and naturally handles any kind of DST (even multiple).
I agree that I think data in this format would be ideal.
"ampa": {
"-7:00": "Pacific Daylight Time",
"-8:00": "Pacific Standard Time"
}
This data could be added to supplemental/metaZones.xml
in CLDR.
However, there are a few things to consider:
1) Has a metazone ever changed its associated time variants?
If not, the data is straightforward, exactly as shown above.
If so, this data could still reasonably be captured and added to the file.
Consider a hypothetical situation where America_Central
(amce
) decided to move its standard-time offset for all of its associated time zones by half an hour for one year, and then changed it back to the way it was before:
"ampa": {
"-7:00": "Pacific Daylight Time",
"-8:00": "Pacific Standard Time"
},
"amce": {
"usesTimeVariants": {
"-5:00": "Central Daylight Time",
"-6:00": "Central Standard Time",
"_to": "2024-09-06 00:00"
},
"usesTimeVariants": {
"-5:00": "Central Daylight Time",
"-5:30": "Central Standard Time",
"_from": "2024-09-06 00:00",
"_to": "2025-09-06 00:00"
},
"usesTimeVariants": {
"-5:00": "Central Daylight Time",
"-6:00": "Central Standard Time",
"_from": "2025-09-06 00:00"
},
},
This format seems reasonable and is the same structure as how Time Zone ID's are mapped to MetaZones in the same file.
2) What would happen if a time zone within an associated metazone observes the same time-variants offsets, but transitions among them at different datetimes than other zones within that metazone?
One relevant example of this is the recent proposal for some of the West Coast states to observe permanent Daylight Savings Time:
https://www.opb.org/article/2024/02/20/oregon-bill-to-end-daylight-saving-time-fails-legislature/
If this were the case, then the offset would remain UTC-7
year round, and those time zones, e.g. America/Los_Angeles
would just format to Pacific Daylight Time
year round.
This all seems okay to me.
3) What would happen if an individual time zone wants to use use different offsets than the current time-variant offsets established by the metazone?
I am not aware of any such case like this that exists, but I think there are two reasonable solutions:
A) That time zone could switch to a new metazone (either new or preexisting) that matches its desired offsets. This happens all the time.
B) We could add that offset data to CLDR.
"ampa": {
"-7:00": "Pacific Daylight Time",
"-7:30": "Pacific Cool New Time",
"-8:00": "Pacific Standard Time"
},
The time zones that use the prior offsets would go on as usual, and the time zone with the new offset would have its new localized name.
I recall a conversation with @sffc years ago that perhaps daylight_time
and standard_time
are not great identifiers within the icu4x
code base, because sometimes it's formatted as "Summer Time" for example, and in the future it may be possible that there are more than 2 variants.
A format such as this would allow us to be agnostic of naming conventions, instead tying the internationalized name of the variant to an offset.
However, there are a few more considerations to take into account in this case:
3.1) What if a time zone wants to add a new offset, but have the same localized name as another offset?
"ampa": {
"-7:00": "Pacific Daylight Time",
"-7:30": "Pacific Standard Time",
"-8:00": "Pacific Standard Time"
},
This probably wouldn't cause a data ambiguity issue, but I think it would be incredibly confusing, as "Pacific Standard Time" would now be semantically ambiguous.
This should not be allowed.
3.2) What if a metazone wants to add a new localized name for an offset that is already present?
"ampa": {
"-7:00": "Pacific Daylight Time",
"-7:00": "Pacific Cool New Time",
"-8:00": "Pacific Standard Time"
},
This would cause a data issue and should not be allowed.
Conclusion
I don't feel that I have the cycles to take on this work myself right now, but I would support collaborating on making this data available (if people agree it is sound).
Here is an example of when the short metazone identifiers were added to that same CLDR file: https://unicode-org.atlassian.net/browse/CLDR-14607
Filing an issue on Jira would be a good next step if we reach a consensus here.
All questions of the form "what if a timezone wants to do something different than the rest of the metazone" should be answered by creating a new metazone. My expectation is that all zones in a metazone fully agree on offsets today and in the future, but maybe that's not guaranteed.
All questions of the form "what if a timezone wants to do something different than the rest of the metazone" should be answered by creating a new metazone. My expectation is that all zones in a metazone fully agree on offsets today and in the future, but maybe that's not guaranteed.
That would be much simpler and more stringent. I would agree with imposing these restrictions. I was just trying to think of all the cases.
My favorite counter-example to this is Antarctica/Troll, which uses a DST shift of 2 hours:
Another counter-example to the 60-minute transition: https://www.atlasobscura.com/places/lord-howe-islands-time
I agree with the workaround of creating a new metazone if the offset invariants ever break down. Metazones are purely a CLDR/ICU construction, not TZDB, so we have a lot of latitude for how we handle them.
For example, if all US West Coast states decided to abolish daylight savings time and that Pacific Time should be GMT-7 instead of GMT-8 (a proposal I don't support but which is good for illustrative purposes), then we would need to create a new metazone such as amp2
meaning "version 2 of ampa
".
It is highly likely that such changes already occurred in the last 50 years, and we should probably look for them in datagen.
As far as data sources are concerned, it seems perfectly fine to me for this data to be derived from TZDB. Currently ICU4C uses TZDB to determine which zone variant to use when formatting, so if ICU4X used TZDB during datagen, then we should be able to guarantee consistency with ICU4C. ICU4X could manually spawn new "private use" metazones as needed.
OK, one other issue I realized. There are numerous countries that use their own country name as the metazone. The first one I pulled is "kyrg", Kyrgyzstan:
https://en.wikipedia.org/wiki/Kyrgyzstan_Time
Kyrgyzstan has switched between UTC+5 and UTC+6 multiple times, but presumably the metazone has not changed.
https://en.wikipedia.org/wiki/Kyrgyzstan_Time
Kyrgyzstan has switched between UTC+5 and UTC+6 multiple times, but presumably the metazone has not changed.
Yeah, this was gonna be my concern: cases where oddball metazones are tidally locked to a country. I assume this fact means that the "use the offset only" idea won't work?
Yeah, this was gonna be my concern: cases where oddball metazones are tidally locked to a country. I assume this fact means that the "use the offset only" idea won't work?
I think it can still "work"; it's just something we need to factor in. A few ways of resolving this:
One other note: I very frequently encounter people using "PST" to mean Pacific Time, not specifically Pacific Standard Time, and similarly with EST and CST and others. For example, it is very common to see people say "let's meet in San Francisco on September 7 at 10am PST", and if you show up at that time according to the TZDB/CLDR definition, unless it is a time zone nerds meetup, you will be an hour late.
What this means: this is all so imprecise anyway, so let's just land something reasonable and otherwise encourage people to use city-based time zone names. Maybe CLDR can focus on adding a short location format, such as "LA Time" or "NYC Time" to use instead of the ambiguous things it currently uses.
Maybe CLDR can focus on adding a short location format, such as "LA Time" or "NYC Time" to use instead of the ambiguous things it currently uses.
Normal people (other than those who are super-familiar with how IANA timezones work, which is a very small Venn diagram overlap with "normal people") don't use "LA Time" or "NYC time". So I'm not sure it'd make sense to add that to CLDR. I understand the desire for consistency, but this seems to be a case where there's no evading the inconsistency of human language use.
My hypothesis is that "normal people" would understand what you meant by "LA Time", even if they haven't often seen it before, and it is also the most unambiguous definition for an i18n library to produce.
Random comment for earlier replies.
IANA TZ Database files has DST flags. But the information is lost in standard zone data binaries. If you just look at the content of zone data binary file, you cannot tell if a given time is in DST or not. Of course, you can guess DST or not by looking around offset around the time. For example, UTC offset of America/Los_Angeles on 2024-09-01T00:00:00Z is UTC--07:00. But there is no info about whether it's DST or not in zone data binary. ICU want to keep the info to support old TimeZone API, and ICU zone compiler made some modification to store the flag along with zone offset transition data.
IANA TZ Database contains DST offset not exactly 1 hour. For example, Australia/Lord_Howe advances 30 minutes in DST. There are many other zones using non-1 hour DST changes historically.
Metazone is not associated with specific UTC offsets. Metazone is associated with a set of names. Because North America and Europe assign names associated with standard offsets, you might think standard offset and Metazone are related. Someone commented Metazone with multiple historic standard offsets are odd balls, but I would say North America/Europe are actually exceptional.
I think the concept of ZoneVariant in the struct is problematic.
Random observation:
Same time | Formatted with generic TZ |
---|---|
2024-07-01T12:00:00-06:00[America/Denver] | 12:00 Mountain Time |
2024-07-01T11:00:00-07:00[America/Phoenix] | 11:00 Mountain Time |
2024-07-01T18:00:00Z | 18:00 UTC |
From @robertbastian:
Random observation:
Same time Formatted with generic TZ 2024-07-01T12:00:00-06:00[America/Denver] 12:00 Mountain Time 2024-07-01T11:00:00-07:00[America/Phoenix] 11:00 Mountain Time 2024-07-01T18:00:00Z 18:00 UTC
These are all technically correct, though confusing. They're both Mountain Time. It's just that Denver is in Mountain Daylight Time and Phoenix is in Mountain Standard Time because Arizona does not observe DST.
I would argue that this is a reason why populating the ZoneVariant
struct whenever possible is worthwhile.
EDIT:
Though, to clarify, the above "Mountain Time" formats are "Generic non-location format".
The UTS-35 spec defines several formats with fallbacking:
Generic non-location format
Examples: "Pacific Time" (long), "PT" (short)
Generic partial location format
Examples: "Pacific Time (Canada)" (long), "PT (Whitehorse)" (short)
Generic location format
Examples: "France Time", "Italy Time"
Specific non-location format
Examples: "Pacific Standard Time" (long), "PST" (short), "Pacific Daylight Time" (long), "PDT" (short)
Localized GMT format
Examples: "GMT+03:30" (long), "GMT+3:30" (short), "UTC-03.00" (long), "UTC" (for zero offset)
ISO 8601 time zone formats
Examples: "-0800" (basic), "-08:00" (extended), "Z" (for UTC)
It was years ago, so I'm not sure if the current implementations within ICU4X are exactly the same, but I tried to implement the fallbacking rules according to the spec.
The above strings have enough information available to utilize either Generic location format e.g. Phoenix Time, or Generic partial location format e.g. Mountain Time (Denver).
Generic partial location format
Examples: "Pacific Time (Canada)" (long), "PT (Whitehorse)" (short)
FWIW, I think this is a nice solution to this problem described above, where if there's a colloquial name for a time zone like "Pacific Time", it's still used but with a disambiguator for less common cases like Arizona.
The observation about generic non-location being ambiguous is well known and largely working as intended. It should only be used if the location of the event is known from context. Here is the language I wrote for how to select your time zone style in semantic skeleta:
Example use cases where generic time zone style is acceptable:
Note: In most or all of these cases, it would be acceptable to say "local time" or simply drop the qualifier.
Example where generic time is not acceptable and a different style should be used, unless the location is otherwise known from context:
My point is that there are enough legitimate use cases for generic non-location format, but since it could introduce ambiguity, it should only be used if the developer opts in.
Generic partial location format
Examples: "Pacific Time (Canada)" (long), "PT (Whitehorse)" (short)
This seems to be the non-ambiguous version of the generic non-location format. We don't seem to support this in ICU4X, however?
What we need for full correctness is a ZoneVariantCalculator
that maps (TimeZoneBcp47Id, DateTime<Iso>) -> (UtcOffset, Option<UtcOffset>)
. It would do this by storing a sequence of ISO minutes with associated offsets for each zone, similar to MetaZonePeriodsV1
.
If there is sufficient overlap between the offset list and the metazone list for each location, they could be combined, as the bulk of these structures will be the keys.
Re generic partial location format, it sounds like we're meant to detect when a metazone is ~not specific~ ambiguous, and add the location to it. We can do that, I've found a lot of ~non-specific~ ambiguous metazones in #5515. We can extend the return value of MetazoneCalculator
with an is_ambiguous
flag, in which case the formatter would add the location (or the offset if locations aren't available).
What we need for full correctness is a
ZoneVariantCalculator
that maps(TimeZoneBcp47Id, DateTime<Iso>) -> (UtcOffset, Option<UtcOffset>)
. It would do this by storing a sequence of ISO minutes with associated offsets for each zone, similar toMetaZonePeriodsV1
.If there is sufficient overlap between the offset list and the metazone list for each location, they could be combined, as the bulk of these structures will be the keys.
LGTM
At its core, ICU4X time zones have 4 fields, which fully determine the strings to be selected for formatting:
Let's say someone gives us an IXDTF string like:
2024-08-29T11:53:18-0700[America/Los_Angeles]
From this string, we can already populate two fields:
We have MetazoneCalculator, which takes the time portion of the string and lets us calculate the metazone field:
However, how do we calculate the ZoneVariant field?
I learned today that tzif files, at least version 2 and 3 files, contain a footer that looks like this:
The "8" in that footer means that this time zone has a standard offset of 8 hours behind UTC. (note that the offset is negated from what we normally see)
Does this mean that we could build a table with standard offsets and use that table to generate zone variants? For example, we could create a data file with the following data, which can all be generated from the TZDB:
Then, when reading the IXDTF string, we use the following algorithm to select the zone variant:
Mechanically, we can generate this table by using a combination of our own tzif crate, which contains a struct ZoneVariantInfo with this information pre-parsed, and a tzif source, which could potentially be jiff_tzdb.
Note: the Time Zone ID would probably be stored in BCP-47 and Standard Offset would be bitpacked to an i8. It's possible we could stuff this data into one of our existing data structs to be more efficient.
Note: I assume that this mapping of time zone IDs to standard offsets is fairly stable over time, such that we do not need to worry about shipping updates at a cadence different than normal CLDR data updates.
Please help me understand: is the proposed algorithm correct and robust, or is it flawed in some edge cases?
@nekevss @leftmostcat @nordzilla @yumaoka @justingrant