Reconsider time zone canonicalization behavior given forking of IANA Time Zone Database

tc39 / proposal-temporal

Provides standard objects and functions for working with dates and times.

https://tc39.es/proposal-temporal/docs/

Other

3.35k stars 153 forks source link

Reconsider time zone canonicalization behavior given forking of IANA Time Zone Database #2509

Closed justingrant closed 1 year ago

justingrant commented 1 year ago

While working on #2493, I learned that the IANA Time Zone Database has been forked due to a disagreement between that database's maintainer and some prominent users of the database.

Background

The two forks differ as follows:

"Primary" fork - Many time zones that have had the same rules since 1970 have been merged into one canonical identifier, with the old identifiers remaining as links. Examples include: Europe/Copenhagen => Europe/Berlin and Atlantic/Reykyavik => African/Abidjan. There are many more examples like this. This fork is preferred by the TZDB maintainer, and therefore is exposed by the official IANA downloads of TZDB releases.
"Unmerged fork - The merges described above are reverted. This fork is preferred by reps from Java, NetBSD, and probably others too. It's available via downloads from the fork repo (https://github.com/JodaOrg/global-tz), or by building TZDB from source using the new PACKRATLIST build option. That build option was added by the maintainer to ensure that both forks could be built out of the same repo. See discussion here and here.

You can read more about the fork in the TZDB mailing list archives. A few relevant threads:

The fork seems to represent a philosophical difference about the purpose of the TZDB. One camp (which includes the maintainer) sees the goal of TZDB as simply providing a way to convert post-1970 zoned timestamps into exact instants, and wants to reduce the TZDB size and maintenance hassle of dealing with pre-1970 data. The other camp (supporting the unmerged fork) adds additional use cases:

a) Resolving pre-1970 zoned timestamps to instants, even if those pre-1970 data are known to be less reliable and more subject to revision.
b) Providing metadata that may be useful in the future in case countries change their time zones or DST rules, even if no such changes have happened since 1970. The unmerged fork guarantees at least one canonical zone per ISO 3166-1 country code, which is sensible because time zone and DST changes typically happen at the country-code level except for the largest countries.
c) Reducing "canonicalization confusion" where users set one zone and end up with a zone that seems completely different, and maybe even in a different continent like Iceland => Cote d'Ivoire. This seems particularly sensitive in the case of Denmark, Sweden, and other European countries being canonicalized to Germany, which for obvious reasons may trigger historical sensitivity.

I'm not sure how much Temporal cares about pre-1970 dates, but the latter two issues seem quite important to Temporal users. The second one will make calendaring apps more resilient to country-level timezone/DST changes, while the third will prevent developer confusion and consternation.

Also, given the complaints about the changes, it's possible that the TZDB may revert these changes in the future, which would cause further churn.

Options

Anyway, now that we know this fork exists, we need to figure out what to do about it in the Temporal spec. Options include:

1. Recommend that implementers use the Primary Fork

Pro: this is the status quo, so probably easiest to do
Con: breaks (b) and (c) above; risks more confusion if changes are reverted later

2. Recommend that implementers use the Unmerged Fork

Pro: Better backwards compatibility with existing timestamps; less geopolitical confusion going fwd
Con: larger TZDB data because it includes pre-1970 rules for many more zones; requires changing implementer build processes; may cause bug reports because JS output will vary from other sources (e.g. Wikipedia) who are using the main fork.

3. Don't recommend anything; implementers are free to choose.

Pro: allows implementer flexibility
Con: code will work differently across implementations, which already causes problems even before these controversial merges. See https://github.com/tc39/ecma402/issues/272#issuecomment-423928522, https://bugs.chromium.org/p/chromium/issues/detail?id=580195, etc.

4. Stop canonicalizing time zones (thanks to @pipobscure for this suggestion)

Pro: supports (b) and (c) above without needing to pick a fork; makes ISO strings round-trippable even if canonicalization has changed since the string was stored; avoids test breakages and other results caused by canonicalization changes; solves "wrong canonical spelling" bugs like this chromium bug; ensures that code works more similarly across implementations and across time; more consistent equality comparisons with other Temporal types that use an equals method; avoids triggering geopolitical sensitivities caused by modifying user input point to an unexpected country or name.
Con: Probably requires adding a Temporal.TimeZone.equals method to help users identify equivalent time zones like Asia/Calcutta vs. Asia/Kolkata; may require modifying existing ICU behavior (per this comment, it sounds like Firefox already does similar mods).

Discussion

Of the above options, my strong preference is for (4), because it solves both the forking issue as well as the existing canonicalization issues like Calcutta vs. Kolkata. Also, I think retaining user input as-is will be quite helpful to reduce confusion in cases where code takes input from some other source, modifies that data, and then sends or stores the modified data. If the time zone identifier varies a lot between the original and modified ZDT, I think that will generate user confusion that avoiding canonicalization would prevent.

If we want to go with (4), here's a few questions to answer:

i) How should Intl.DateTimeFormat.p.resolvedOptions().timeZone behave? Should it also stop canonicalizing? If yes, should it add a new canonicalTimeZone property?
ii) Will there be any change to user-visible output of Intl.DateTimeFormat.p.format or Date.p.toLocaleString? I suspect that the answer is "no" because localized descriptions of time zones don't usually surface the IANA identifiers, but not 100% sure about this.
iii) What changes (if any) would be required to CLDR and/or ICU to support this change?
iv) Even if we avoid the canonicalization mess, there's still the pre-1970-data question. The unmerged fork will have it, the merged fork will only have it for the merged zones. This would mean, for example, that Europe/Copenhagen pre-1970 results could vary by fork. So which fork should we recommend that implementers use? I don't have a strong opinion here. It'd be nice to understand the size of pre-1970 data to know how much smaller browser downloads would get if this data were removed.
v) Should case differences still be canonicalized, e.g. Europe/Paris vs. europe/paris? My opinion: yes, we should canonicalize.
vi) Should spelling differences due to renaming also be canonicalized, e.g. Asia/Calcutta vs. Asia/Kolkata. My opinion: no, because by not canonicalizing id in this case we can avoid user complaints like this chromium bug, and we can ensure future compatibility & round-trippability even if zones are renamed in the future. Note that equals should probably report these as true though. (See below.)
vii) Should we add a TimeZone.p.equals method? I think we should, both for consistency across Temporal types and to help code be robust in the face of past or future renames of cities which seems to happen fairly often globally. JS code should be able to ask "Is this date in the India time zone" without having to worry that that code will be broken by a past or future rename.
viii) If we add equals should we also add a method that tests if all rules are the same across time zones, e.g. Atlantic/Reykyavik vs. Africa/Abidjan? I don't think this is needed. Userland code can always use getNextTransition in a loop to check for this kind of equality, and if there's user demand we could always add it in a later release.
ix) How should UTC zone be handled? I think this is straightforward: all zones whose canonical identifier is Etc/UTC should resolve to UTC in ECMAScript, matching current behavior. There's no value in changing this existing behavior.
x) In order for the PACKRATLIST option to work, TZDB data must provide a way to differentiate "merged" links like Atlantic/Reykyavik => Africa/Abidjan from "renamed" links like Asia/Calcutta vs. Asia/Kolkata. How does this differentiation work, and does is work for all links or are there gaps? It sounds like @anba may know how this works.

If we add equals, here's a suggestion for its behavior:

It should accept objects or strings.
If the receiver and/or the argument is a custom zone, use its id property.
Treat different casings as equal, e.g. Europe/Paris vs. europe/paris.
Treat different spellings of the same location as equal, e.g. Asia/Calcutta vs. Asia/Kolkata, because they represent the same thing with different spelling.
If both receiver and argument canonicalize to Etc/UTC then treat them as equal.
DO NOT treat different locations (like Atlantic/Reykyavik vs. Africa/Abidjan) as equal, even if all their time zone transitions are the same, because future changes could make those locations have different time zone rules. Per above, if users want to evaluate "all rules are the same" then can do this in userland by comparing time zone transitions in a loop. Although honestly I'm skeptical that this will be a popular use case. Who cares if the rules are equal?

Pinging @jasonwilliams @ptomato @sffc @gibson042 @pipobscure for your opinions.

gibson042 commented 1 year ago

I like option 4, but would rename it to something like "Canonicalization doesn't follow Links" and would also make clear that Link chains terminating in "Etc/UTC" or "Etc/GMT" are still followed and canonicalized to "UTC" (unless that option splits into e.g. 4a and 4b with different conclusions regarding this behavior).

i) How should Intl.DateTimeFormat.p.resolvedOptions().timeZone behave? Should it also stop canonicalizing? If yes, should it add a new canonicalTimeZone property?

That definitely gets into backwards compatibility territory, and it is plausible—if not likely—that some existing code already uses dtf.resolvedOptions().timeZone to get the host-reported canonical spelling of a time zone name. So I don't think changing that is on the table.

ii) Will there be any change to user-visible output of Intl.DateTimeFormat.p.format or Date.p.toLocaleString? I suspect that the answer is "no" because localized descriptions of time zones don't usually surface the IANA identifiers, but not 100% sure about this.

https://github.com/tc39/ecma402/issues/119 does propose exposing the IANA names, but I don't recall any existing text requiring implementations to do so (although they certainly could anyway; ECMA-402 gives formatters lots of flexibility).

iii) What changes (if any) would be required to CLDR and/or ICU to support this change?

I defer to @sffc.

iv) Even if we avoid the canonicalization mess, there's still the pre-1970-data question. The unmerged fork will have it, the merged fork will only have it for the merged zones. This would mean, for example, that Europe/Copenhagen pre-1970 results could vary by fork. So which fork should we recommend that implementers use? I don't have a strong opinion here. It'd be nice to understand the size of pre-1970 data to know how much smaller browser downloads would get if this data were removed.

I don't have a strong opinion here either.

v) Should case differences still be canonicalized, e.g. Europe/Paris vs. europe/paris? My opinion: yes, we should canonicalize.

Yes. Canonicalization still happens, it just doesn't follow Links (except for special-casing GMT and UTC).

vi) Should spelling differences due to renaming also be canonicalized, e.g. Asia/Calcutta vs. Asia/Kolkata. My opinion: no, because by not canonicalizing id in this case we can avoid user complaints like this chromium bug, and we can ensure future compatibility & round-trippability even if zones are renamed in the future. Note that equals should probably report these as true though. (See below.)

No. Doing this would make the behavior less comprehensible and would sacrifice potential benefits.

vii) Should we add a TimeZone.p.equals method? I think we should, both for consistency across Temporal types and to help code be robust in the face of past or future renames of cities which seems to happen fairly often globally. JS code should be able to ask "Is this date in the India time zone" without having to worry that that code will be broken by a past or future rename.

There should definitely be some way to identify that there is a Link chain establishing equality between two time zones with different names, and ideally a way to determine its directionality (e.g., detecting that Atlantic/Reykjavik is a Link to Africa/Abidjan rather than the reverse).

viii) If we add equals should we also add a method that tests if all rules are the same across time zones, e.g. Atlantic/Reykyavik vs. Africa/Abidjan? I don't think this is needed. Userland code can always use getNextTransition in a loop to check for this kind of equality, and if there's user demand we could always add it in a later release.

Agreed; not needed at this time.

ix) How should UTC zone be handled? I think this is straightforward: all zones whose canonical identifier is Etc/UTC should resolve to UTC in ECMAScript, matching current behavior. There's no value in changing this existing behavior.

Note that current behavior also maps "Etc/GMT" to "UTC": https://tc39.es/proposal-temporal/#sup-canonicalizetimezonename

x) In order for the PACKRATLIST option to work, TZDB data must provide a way to differentiate "merged" links like Atlantic/Reykyavik => Africa/Abidjan from "renamed" links like Asia/Calcutta vs. Asia/Kolkata. How does this differentiation work, and does is work for all links or are there gaps? It sounds like @anba may know how this works.

AFAICT, there's no explicit differentiation... rather, just a zone.tab file identifying for each ISO 3166-1 alpha-2 country code the corresponding time zones, which can theirselves identify Links (as is currently the case for DE/IS/etc.).

If we add equals, here's a suggestion for its behavior:

It should accept objects or strings.

If the receiver and/or the argument is a custom zone, use its id property.

Disagree on this; custom time zones should be compared by referential object identity. A custom time zone that happens to have id "UTC" is not equal to the built-in UTC time zone, and two custom time zones with the same id can have very different behavior. This probably means that distinct objects representing built-in time zones should also be reported non-equal, and it might therefore make sense to act on strings only (since authors already have === and Object.is for comparing objects). But I'd prefer to keep object input, such that Temporal.TimeZone.equals(zdt.timeZone, zdt.timeZone) is always acceptable (with an internal implementation that uses Link-aware canonicalization when both inputs are strings and otherwise validates both inputs as time zones using standard behavior such as ToTemporalTimeZone [or a more appropriate alternative without e.g. timeZone and ToString fallbacks] and compares the result using SameValue).

Treat different casings as equal, e.g. Europe/Paris vs. europe/paris.

If both receiver and argument canonicalize to Etc/UTC then treat them as equal.

:+1:

Treat different spellings of the same location as equal, e.g. Asia/Calcutta vs. Asia/Kolkata, because they represent the same thing with different spelling.

DO NOT treat different locations (like Atlantic/Reykyavik vs. Africa/Abidjan) as equal, even if all their time zone transitions are the same, because future changes could make those locations have different time zone rules. Per above, if users want to evaluate "all rules are the same" then can do this in userland by comparing time zone transitions in a loop. Although honestly I'm skeptical that this will be a popular use case. Who cares if the rules are equal?

I also disagree on these last points, although really it seems to be mostly a question of modeling—I don't think Temporal should classify Links as "alternate spelling" vs. "real", but rather just treat any Link relationship as establishing equality. In an implementation that uses the standard IANA time zone data, Atlantic/Reykyavik and Africa/Abidjan are equal until and unless a policy change causes them to diverge.

gilmoreorless commented 1 year ago

Treat different spellings of the same location as equal, e.g. Asia/Calcutta vs. Asia/Kolkata, because they represent the same thing with different spelling.

DO NOT treat different locations (like Atlantic/Reykyavik vs. Africa/Abidjan) as equal, even if all their time zone transitions are the same, because future changes could make those locations have different time zone rules. Per above, if users want to evaluate "all rules are the same" then can do this in userland by comparing time zone transitions in a loop. Although honestly I'm skeptical that this will be a popular use case. Who cares if the rules are equal?

I also disagree on these last points, although really it seems to be mostly a question of modeling—I don't think Temporal should classify Links as "alternate spelling" vs. "real", but rather just treat any Link relationship as establishing equality. In an implementation that uses the standard IANA time zone data, Atlantic/Reykyavik and Africa/Abidjan are equal until and unless a policy change causes them to diverge.

I agree with @gibson042, partly for a far more practical reason: maintenance. The IANA source doesn't have any API differences between "links due to similar clocks" and "links due to renames". The backward file was tidied up in the wake of this forking discussion — it now has commented groups of links based on their reasons. But this is only a convention in a single file, and not guaranteed to be a stable API.

If Temporal was to distinguish between the two cases in an API, there would need to be a stable maintenance process for adding brand new links to the correct category.

justingrant commented 1 year ago

I also disagree on these last points, although really it seems to be mostly a question of modeling—I don't think Temporal should classify Links as "alternate spelling" vs. "real", but rather just treat any Link relationship as establishing equality. In an implementation that uses the standard IANA time zone data, Atlantic/Reykyavik and Africa/Abidjan are equal until and unless a policy change causes them to diverge.

My assumption is that, ideally, there'd be two categories of links:

Links that NOW AND FOREVER IN THE FUTURE have the same rules as the canonical name, like Asia/Calcutta vs. Asia/Kolkata
Links that RIGHT NOW BUT MAYBE NOT IN THE FUTURE have the same rules as the canonical name, like Atlantic/Reykyavik and Africa/Abidjan

The first type of link (let's call them "synonyms") conveys no semantic value. Programs will never behave differently depending on which ones you use (other than when comparing the id strings themselves).

The second type of link (let's call them "merges") conveys semantically different information that could change the behavior of future programs beyond string comparison.

The particular use case I had in mind where it's helpful to know that difference is helping is when a program has logic like this: "I want to do special processing for timestamps for X" (where "X" is a particular country like India or Sweden). Like this:

if (Temporal.TimeZone.from('Europe/Copenhagen').equals(zdt.timeZoneId)) {
  // do India-specific stuff
} else {
  // non-India-specfic logic
}

It would be bad if future changes in the spelling of the desired English transliteration of "Copenhagen" caused the code above to break. So it's probably good practice for any code that checks for a specific time zone (or that wants to compare two ZDT timestamps to know if they're semantically identical) to use equals instead of comparing id.

But it'd also be bad if the price of protecting against future spelling changes meant that you'd need to false-negatively run jurisdiction-specific logic for other jurisdictions that coincidentally share the same time zone rules.

It's true that, continuing that example above, if Denmark split into multiple time zones then the code above would break. But I think this is OK, because the change happened in Denmark so of course Denmark-specific code will need to change. My main concern is that if you treat all aliases the same, then equals becomes riskier because you can never predict what other semantically-different zones are being lumped into the same bucket.

So I do think there's a case that being able to distinguish these cases is important. But...

I agree with @gibson042, partly for a far more practical reason: maintenance. The IANA source doesn't have any API differences between "links due to similar clocks" and "links due to renames". The backward file was tidied up in the wake of this forking discussion — it now has commented groups of links based on their reasons. But this is only a convention in a single file, and not guaranteed to be a stable API.

One possible (needs validation) solution using existing data would be to use zone.tab which includes pre-merge data. If a link from backward is also present in zone.tab them it's a merge, otherwise it's a synonym. I haven't done the work to validate that this will work perfectly, though!

If Temporal was to distinguish between the two cases in an API, there would need to be a stable maintenance process for adding brand new links to the correct category.

Agree, if the approach above won't work. We'd want to work with the IANA folks (or maybe ICU/CLDR?) to ensure that distinction is maintained in the future via some other solution.

There's less than 300 total links so this isn't a lot of ongoing maintenance work (would probably add <1hr/year of work to someone's plate) but someone would have to be willing to commit to doing the work long-term.

BTW, I'd volunteer do make an initial PR into TZDB, if it's decided that this split would be good to maintain AND if the data files need to change somehow.

gibson042 commented 1 year ago

One possible (needs validation) solution using existing data would be to use zone.tab which includes pre-merge data. If a link from backward is also present in zone.tab them it's a merge, otherwise it's a synonym. I haven't done the work to validate that this will work perfectly, though!

You'd also need to consider backzone, because e.g. Africa/Timbuktu does not appear in zone.tab but is a "merge" (to use your term) of Africa/Abidjan in the primary data but (presumably) a synonym of Africa/Bamako in the pre-1970 data, and I think the same applies to everything in the "Non-zone.tab locations with timestamps since 1970 that duplicate those of an existing location" section mentioned below.

if the approach above won't work. We'd want to work with the IANA folks (or maybe ICU/CLDR?) to ensure that distinction is maintained in the future via some other solution.

That seems like a goal that exceeds the scope of Temporal v1.

BTW, I'd volunteer do make an initial PR into TZDB, if it's decided that this split would be good to maintain AND if the data files need to change somehow.

AFAICT tzdata Links are all created equal—the only existing data that could be used is unstructured section-heading comment text like "Pre-2013 practice, which typically had a Zone per zone.tab line" and "Non-zone.tab locations with timestamps since 1970 that duplicate those of an existing location". So I guess you'd be proposing something like a new merged file that exclusively contains the content from those section(s) and a Temporal equality comparison that ignores its contents?

gilmoreorless commented 1 year ago

BTW, I'd volunteer do make an initial PR into TZDB, if it's decided that this split would be good to maintain AND if the data files need to change somehow.

It's probably best to read this whole discussion thread first: https://mm.icann.org/pipermail/tz/2021-November/031074.html That thread is what eventually produced the current grouped-under-comment-headings format of backward, despite calls for the changes to be easier to determine programmatically.

I would definitely like a change to the current format (I commented in that linked thread). But part of the reason the tzdb structure doesn't change often is the sheer number and variety of downstream consumers that have to be able to handle any new format.

justingrant commented 1 year ago

Yep, you're right: backzone was needed too. I'm building a quick proof-of-concept to understand how Intl is currently canonicalizing Links in the IANA Time Zone Database. Will share shortly. So far I see two results:

It looks like the existing TZDB does contain enough info to differentiate merges from synonyms, with a small number of possible exceptions that I'm digging into now.
Intl currently ignores the majority of non-UTC links today. Part of this probably reflects that the TZDB has not been updated recently in browsers. It also means that there would be a lot of broken apps if implementations were to match IANA behavior, which means that the status quo may also break the web. :-(

Will share more results when I finish the investigation.

justingrant commented 1 year ago

Initial investigation is complete. Results are here: https://4rylir.csb.app (full-screen view) and https://codesandbox.io/s/iana-vs-es-4rylir (source code). You can filter or sort to understand the various kinds of links.

Summary

Safari and Chrome report the same results, while Firefox is a bit different.
Of 244 Links in the latest 2022g release of TZDB, only 84 (98 in Firefox) Links are followed by Intl and the Temporal polyfill. The rest of the Links are:
- Not followed at all - 132 (118 in FF)
- Followed one step down a 2-step link chain, but not all the way - 3 (13 in FF)
- Canonicalized to a different name than IANA does- 10 (none in FF; these probably migrated to the "one step down the link chain" category above)
- UTC aliases - 15
Browsers (esp. non-FF) seem to be applying many overrides of the TZDB canonicalization.
Browsers seem to be far behind the latest TZDB data, implying a huge amount of canonicalization churn when they catch up... unless they choose to apply even more overrides.
Given the massive amount of churn that will happen when (if?) the latest TZDB is applied to browsers, I strongly recommend that no engine should upgrade TZDB until we figure out the longer-term plan to deal with the TZDB fork and ongoing changes.
It looks straightforward (see below for details) to use existing TZDB data (not the section comments) to automatically categorize each Link as a "synonym" or "merge" , with almost no special-casing required.

Categorizing Synonyms vs. Merges

I took a first pass at classifying links as synonyms or merges based on the following algorithm:

If a Link is in backward but is also a Zone in backzone, then by definition it is a merge because it had separate pre-1970 rules.
If a Link goes through an intermediate second link before being resolved to a final name, then the first step is a synonym and there second is a merge. Presumably because there'd be no need for a 2-step process if the first was just a rename/synonym.
- It's possible that 2-step merges may be unnecessary using the PACKRATLIST (old style) data; I didn't check this because I just used the raw TZDB source files instead of generated files.
Because the second step in a 2-step link chain is always a merge, then the second step alone is also a merge. For example, there's a 2-step merge of Antarctica/South_Pole => Antarctica/McMurdo => Pacific/Auckland in the latest TZDB. Therefore, Antarctica/McMurdo => Pacific/Auckland is a merge.
Links resolving to UTC are always synonyms. In the web app linked below I broke them out into their own category for ease of filtering them out.
Any other Link not falling into the criteria above was categorized as a synonym.

I manually verified all 86 synonyms identified by the algorithm above. There were these patterns:

Places that were renamed, e.g. Calcutta => Kolkata with the old name kept around for backwards compatibility.
Different ways of describing the same place, e.g. America/Indianapolis vs. America/Indiana/Indianapolis
Backwards compatibility mappings of old-style zone names that predated adoption of the "CONTINENT/CITY" format, e.g. Iceland => Atlantic/Reykyavik, or PRC => Asia/Shanghai. While a pedantic person could argue that these are not synonyms, in practice these IDs are deprecated and shouldn't be used anyways, so IMO a synonym seems reasonable here.
Only 4 synonyms didn't fit the categories above. These are borderline cases that could be either merges or synonyms. They could be special-cased into merges by engines and/or CLDR, or could be left as synonyms because all of them are nearby locations in the same country that could be reasonably treated as the same zone and seem very unlikely to vary their zones from each other in the future.
- America/Atka => America/Adak - These are two sparsely populated Aleutian islands. Definitely different places, but merged in TZDB. I found no docs about why they were merged.
- America/Fort_Wayne => America/Indiana/Indianapolis - different cities in the same state that have the same rules. I could find no documentation about why America/Fort_Wayne has its own Link.
- America/Santa_Isabel => America/Tijuana - according to the TZDB's history docs, adding America/Santa_Isabel was a mistake based on bad source info, and this Link reverted this mistake.
- Pacific/Yap => Pacific/Truk (or its synonym Pacific/Chuuk) - These are two small islands in Micronesia which are close to each other and which, even in backzone seem to have had the same time zone rules. I found no docs explaining why they were merged.

I also manually checked through the Links identified as merges , and I was unable to find any that looked like they should be synonyms.

sffc commented 1 year ago

My initial reaction is that it's not the job of Temporal to tell implementations what they can/should and can't/shouldn't do in this area. I can at least say that any solution that involves "don't canonicalize time zone names" likely means that ICU's time zone utilities can no longer be used for data storage; they can be used for calculations, but Temporal glue code will need to be implemented to conform to the spec rather than just following with ICU behavior as we've been doing for a long time.

justingrant commented 1 year ago

My initial reaction is that it's not the job of Temporal to tell implementations what they can/should and can't/shouldn't do in this area.

Before I did this research I probably would have agreed with you, but now that I've dug into the problem I'm quite concerned about the impact of canonicalization on the stability of ECMAScript code across engines and across time. From what I've seen, canonicalization changes very frequently, and implementations seem to vary quite a bit in how they apply canonicalization.

This has really made me question the value of exposing canonicalized IDs to userland developers. We're already seen (in this repo, in Chrome's bugs, etc.) user complaints about canonicalization when differences are usually limited to only minor variations like Calcutta vs. Kolkata. And that's with almost 2/3 of Links in the current IANA TZDB not being followed by engines to IANA's canonical IDs.

If engines start resolving Canadian time zones to Panama, Iceland to Cote d'Ivoire, and Stockholm to Berlin, we can expect many more complaints, user confusion, broken tests, etc.

Who'd be a good person to talk with to understand how ICU currently approaches this problem? How do they determine which Links to follow and which to ignore?

likely means that ICU's time zone utilities can no longer be used for data storage; they can be used for calculations, but Temporal glue code will need to be implemented to conform to the spec

I assume that implementations would need to store both the caller's (case-normalized) original string input as well as a pointer to the data structure that ICU uses to represent a canonicalized time zone. Is that what you mean by "storage"?

The stored string would be used by #2482's ToTemporalTimeZoneIdentifier, which in turn powers TimeZone's id and toString, ZDT's toString, etc. The ICU pointer would be used for all calculations. Does that match what you had in mind?

If we also wanted to offer a TimeZone.p.equals and if it only returned true for synonyms, then presumably there'd need to be support added (to ICU? by implementations?) to compare two time zones for "synonym equality" per discussion above. This wouldn't be needed if we don't offer this method, or if it compares only the id or ICU's fully-canonicalized identifier.

Other than above, what other glue code would be needed?

sffc commented 1 year ago

@yumaoka and @pedberg-icu know the most about ICU4C time zone handling.

For ICU4X, we currently persist time zones by BCP-47 ID. We can (or will be able to) take IANA strings and map them to BCP-47, and then we lookup the canonical ID to go in the other direction. There is an issue (https://github.com/unicode-org/icu4x/issues/2909) discussing which source of truth we should use for canonicalization.

I'm currently neutral on the actual usability issue. I'm just pointing out that we're in effect moving more responsibility out of ICU[4X] and into the Temporal glue code. This logic about how to compare time zones for equality, what form of canonicalization to apply to them, etc., is not easy, as your OP shows. ICU/CLDR already solves these problems in its own way, as it has been doing for a long time. Moving these problems into Temporal glue code just makes Temporal harder to implement and harder to test. If the champions think that the problem is big enough to warrant the additional (nontrivial) implementation cost, so be it.

ptomato commented 1 year ago

If the champions think that the problem is big enough to warrant the additional (nontrivial) implementation cost, so be it.

I don't, for one! I think the TZDB fork is a problem which JS implementations can coordinate among themselves to solve. Pulling the responsibility for solving the problem into our domain will delay the proposal, while delivering an incomplete solution (because this is a problem that applies outside of Temporal as well, and those parts we can't solve.)

sffc commented 1 year ago

Question. Can this behavior be changed as a Temporal V2 follow-up?

Logistically, I think it's fair to say that moving forward with this change is going to delay Temporal implementations by another several months, given that we need to discuss this in various venues to achieve consensus, then write the spec text, then the tests, then the ICU functions discussed above, then in-flight implementations need to be updated.

justingrant commented 1 year ago

An appendix to the synonym vs. merge investigation above: CLDR helpfully provides synonym data here. Example:

        "inccu": {
          "_description": "Kolkata, India",
          "_alias": "Asia/Calcutta Asia/Kolkata"
        },

If CLDR is the source of truth for time zone identifiers, then it's easy to distinguish merges from aliases.

TZDB fork is a problem which JS implementations can coordinate among themselves to solve.

My concern is that implementations have had years to do this coordination... and haven't done it. With Temporal V1 we have a one-time opportunity to reduce churn in the ecosystem forever... and from what I've seen coming down the road from IANA, avoiding the whole "what's the right canonical ID?" question forever (at least for Temporal) seems appealing.

For ICU4X, we currently persist time zones by BCP-47 ID.

Is the current plan for V8 to implement Temporal using ICU4C or ICU4X?

Question. Can this behavior be changed as a Temporal V2 follow-up?

TimeZone.p.equals could be deferred to V2. But in V1...

We'd need to decide whether TimeZone.p.id and ZDT.p.timeZoneId are canonicalized, and if so how. I doubt that changing this in V2 would be web-compatible. (One could argue that canonicalization is already unpredictable so making changes in V2 could be OK, but most of the web has kept Asia/Calcutta stable for over a decade. So I'm not sure that argument could get consensus.)
ZDT.p.equals needs to have an opinion about what time zone equality means. I doubt that it'd be web-compatible for the code below to return true in V1 and false in V2.
```
zdt = Temporal.ZonedDateTime.from('2020-01-01T00:00[Europe/Copenhagen]');
zdt.equals('2020-01-01T00:00[Europe/Berlin]');
```

One approach that I think might be web-compatible would be to not canonicalize TimeZone.p.id and ZDT.p.timeZoneId at all in V1 (except setting them to 'UTC' for backwards compat). Given that we'd document that all id comparison should be case-insensitive, then it might maybe be web-compatible to do case-normalization on the identifier so that case-sensitive comparisons would work too. Not sure about this though.

Logistically, I think it's fair to say that moving forward with this change is going to delay Temporal implementations by another several months

Yep, agree. Although if we went with the "don't canonicalize IDs except UTC" solution above, that would require zero changes from ICU, and would only require a small change from implementers which could be bundled with the changes in #2482 which will already change how TimeZone slots are stored and used. The delta of additional implementer effort seems quite small.

But I agree that once we start asking for any different canonicalization behavior, I agree this would introduce delay. Which might be an argument for the "no-canonicalize" solution or the "full canonicalize" status quo as the best options for V1.

sffc commented 1 year ago

If we let ICU keep canonicalizing the .id and .timeZoneId values, which are known to be variable over time, then a change where we standardize on one particular canonicalization solution over another is likely to be web-compatible.

In other words, if we went with option 3 now, we could adopt options 1 or 2 (or even 4) later.

Option 4 has implementation concerns just like options 1 and 2. The laundry list of 10 questions in the OP is well thought out, but they are questions we need to resolve if we were to implement option 4, and, again, Temporal needs to persist the user-specified time zone alongside the ICU time zone (unless it computes the ICU time zone on the fly when it is needed).

My concern is that implementations have had years to do this coordination... and haven't done it. With Temporal V1 we have a one-time opportunity to reduce churn in the ecosystem forever... and from what I've seen coming down the road from IANA, avoiding the whole "what's the right canonical ID?" question forever (at least for Temporal) seems appealing.

I don't think Temporal is the right vehicle to force this type of ecosystem change. Temporal is already a really tall order for implementations. I do hope that implementations would be more amenable to solving the problem if there were a future proposal narrowly focused on this problem space.

justingrant commented 1 year ago

Sharing more stuff I've learned: CLDR metadata, not IANA TZDB, is currently the source of time zone canonicalization mappings in ECMAScript engines, per this comment:

From ICU’s point of view, which one is main one, and which one is specified by Link - is not important, because we don’t really expose the zoneinfo data directly to API. CLDR defines a set of “canonical zone IDs” for stability reason - and for example, both Europe/Berlin and Europe/Oslo are “canonical” zones. We don’t handle them one is an alias of another.

I think this means that we don't really care that much about the TZDB fork, as long as:

Engines continue to use CLDR metadata to drive canonicalization behavior.
CLDR does not change its canonicalization model to follow IANA's aggressive merging.
Engines like FF that override CLDR behavior (e.g. to fix Calcutta=>Kolkata) also don't follow IANA's aggressive merging.
There's nothing in the Temporal spec that forces engines to use TZDB's canonicalization.

The last bullet is a problem! Currently the spec says this:

If ianaTimeZone is a Link name, let ianaTimeZone be the String value of the corresponding Zone name as specified in the file backward of the IANA Time Zone Database.

If ianaTimeZone is "Etc/UTC" or "Etc/GMT", return "UTC".

This language, combined with other spec text encouraging use of the latest TZDB, will force implementers to use IANA's canonicalization strategy because the spec text is very prescriptive about use of backward which now (at least in the default IANA build) aggressively merges.

If we do want engines (and not Temporal) to decide how canonicalization should work, then this spec text needs to change. Right?

sffc commented 1 year ago

Yeah, it makes a lot of sense to solve this in the section of 402 you're pointing to. I think there's already an issue open for it.

littledan commented 1 year ago

Given that this is already visible in 402, should Temporal be concerned with this issue specifically? Implementations already manage to choose to do something or other. We should just make sure that, whatever the result is, we apply it to 402 and Temporal equally.

justingrant commented 1 year ago

Yeah, it makes a lot of sense to solve this in the section of 402 you're pointing to. I think there's already an issue open for it.

@sffc Are you thinking of https://github.com/tc39/ecma402/issues/272? That issue seems a bit wider than just canonicalization, although it touches on some of the same questions.

Given that this is already visible in 402, should Temporal be concerned with this issue specifically?

@littledan Currently the only way to know the canonical ID is quite hard to discover: DateTimeFormat.p.resolvedOptions().timeZone and has very limited impact because localization output doesn't vary by alias. Unless developers are specifically poking into that API, canonicalization won't affect them at all.

In a Temporal world, canonical IDs will be highly visible in output of ZonedDateTime.p.toString, ZonedDateTime.timeZoneId, and TimeZone.p.id. These strings will be used in comparison logic, will be stored in logs and databases, and developers will (rightly or not) probably expect them to be the same over time.

So although canonicalization exists in 402 today, it will have a lot more visibility and impact once Temporal ships in engines. Hence my concern!

Disagree on this; custom time zones should be compared by referential object identity.

@gibson042 After #2482, if an object is in a ZDT's [[TimeZone]] slot, will we know if it's a custom zone or not? I'm OK to use Object.is to compare custom time zones as long as built-in time zone objects can still use the built-in comparison behavior. I do think it's a slippery slope though. If I subclass TimeZone in order to add a new method but don't change any of the built-in behavior, would I break equals? I'd also be OK with simply using id, e.g. if CLDR knows the ID then canonicalize it, otherwise just compare the string as-is. I don't have a strong opinion here.

justingrant commented 1 year ago

Based on discussion above, and given CLDR's synonym-only canonicalization strategy, I think we can narrow the decision to two basic choices below.

Note that neither option requires any change to ICU or CLDR.

A. Status quo: Follow Links + change 402 to codify existing CLDR practice.

Implementations continue using CLDR, not IANA TZDB, to decide canonicalization.
Unrelated to the Temporal and/or 402 specs, someone (CLDR? ICU? Another proposal? Implementations conspiring together?) figures out a cross-implementation way to fix the 13 outdated canonicalizations like Asia/Calcutta.
Change CanonicalizeTimeZoneName to permit (require?) use of CLDR instead of IANA data.

Pro: Less spec churn; Somewhat easier to implement. Con: Changing canonical aliases will be much less web-compatible.

B. Don't follow non-UTC Links when exposing time zone identifiers from Temporal objects

ZonedDateTime.timeZoneId, TimeZone.p.id, and toString/toJSON of both types would return the original identifier, normalized to the case present in AvailableTimeZones. (Case normalization is needed so that implementations can store a <10-bit enumeration instead of the user's input string.)
Link chains terminating in "Etc/UTC" or "Etc/GMT" are still followed and canonicalized to "UTC".
Add TimeZone.p.equals, using the same algorithm that ZonedDateTime.p.equals uses to compare time zones.
Everything else is the same as option (A), including using CLDR canonicalization for DateTimeFormat.p.resolvedOptions().timeZone and ZonedDateTime.p.equals. The general principle is that we retain (and reflect back) the identifier the caller provided, but when we want to *act* on that identifier by testing for equality, by doing math, by resolving options used for localization, or when emitting localized text, then Link following happens because all CLDR aliases act the same.

Pro: better web compatibility

Future changes to canonical aliases will break fewer apps, making it less likely that outdated aliases will be locked in place forever for fear of breaking the web.
Enables round-trip serialization of ZonedDateTime instances, even after a canonicalization change
Better interop between systems using different versions of CLDR (or between FF and Node/Chrome/Safari if they can't resolve their differences).

Con: More spec churn; Somewhat harder to implement.

In other words, if we went with option 3 now, we could adopt options 1 or 2 (or even 4) later.

Unfortunately, I don't think that (B) above is possible in a V2. For example, it would not be web-compatible to stop considering Asia/Calcutta and Asia/Kolkata as equivalent in ZonedDateTime.p.equals.

anba commented 1 year ago

A. Status quo: Follow Links + change 402 to codify existing CLDR practice.
* Implementations continue using CLDR, not IANA TZDB, to decide canonicalization.

Firefox doesn't use CLDR time zone canonicalisation, but IANA canonicalisation (including backzone) to follow ECMA-402 more closely, which only mentions IANA, but not CLDR. The overrides are in https://searchfox.org/mozilla-central/source/js/src/builtin/intl/TimeZoneDataGenerated.h.

* Change [`CanonicalizeTimeZoneName`](https://tc39.es/proposal-temporal/#sup-canonicalizetimezonename) to permit (require?) use of CLDR instead of IANA data.

CLDR has a stable time zone id policy, which can be problematic for some time zone ids. For example Europe/Kiev is forever the canonical id for Europe/Kyiv. This can lead to endless browser bug reports, similar to what happened for years on the IANA tz data mailing list. https://en.wikipedia.org/wiki/KyivNotKiev has more background information on this topic.

Yeah, it makes a lot of sense to solve this in the section of 402 you're pointing to. I think there's already an issue open for it.

@sffc Are you thinking of https://github.com/tc39/ecma402/issues/272? That issue seems a bit wider than just canonicalization, although it touches on some of the same questions.

https://github.com/tc39/ecma402/issues/272#issuecomment-423928522 has a link to this old bug report from bugs.ecmascript.org: https://tc39.es/archives/bugzilla/1892/.

Some missing bits which aren't yet covered here:

ICU doesn't actually use just CLDR time zone canonicalisation, but also adds its own backward compatibility data on top of it, see https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/tzcode/icuzones. Firefox has extra code to disable these non-IANA and non-CLDR time zone ids, but other browsers return ICU results unchanged. For example new Intl.DateTimeFormat("en", {timeZone: "BET"}) should throw, because "BET" is neither a valid IANA nor CLDR time zone id. So some sort of pre-/post-processing when using ICU is required anyway. (This is also an example where ICU differs from CLDR, e.g. SystemV time zones were removed from IANA in https://github.com/eggert/tz/commit/b3cf2ee42f0799e190c875f3af2ce6e5a7e287ce, ICU still keeps them as zones in icuzones, whereas CLDR uses links.)
ICU doesn't actually include any time zone transitions from backzone. For example new Intl.DateTimeFormat("en", {timeStyle: "full", timeZone: "Europe/Oslo"}).format(Date.UTC(1800, 0, 1)) returns "12:53:28 AM GMT+00:53:28". That's the offset for the IANA canonical time zone Europe/Berlin, Europe/Oslo has a different offset.

The overall situations is more like:

ICU canonicalises according to CLDR, but also applies its own backward compatibility zones/links.
ICU provides transition data for IANA canonical time zones (excluding backzone).
ICU provides localisations for CLDR canonical time zones resp. in most cases the time zone is actually mapped to a meta zone, also see https://github.com/unicode-org/cldr/blob/main/common/supplemental/metaZones.xml. For example Antarctica/McMurdo is a canonical CLDR time zone id, but it's mapped to the meta zone New_Zealand, which can give the (false) impression that it's treated as equivalent to Pacific/Auckland per the backward link from IANA. [1]

There are probably more special cases, too. For example take Canada/East-Saskatchewan: When using CLDR time zone information as the source of truth, TimeZoneIANANameComponent also needs to be changed to handle Canada/East-Saskatchewan, because that id is still valid for CLDR/ICU, but was removed some time ago from IANA, because the name is too long (exceeds the fourteen characters limit).

[1] The meta zone mapping uses optional date information to handle the case when time zone rules change. When no date information is present, ICU restricts the range from 1970-01-01 to 9999-12-31, so it's best not to use dates more than fifty years in the past resp. dates too far into the future when testing this.

js> var dtf = new Intl.DateTimeFormat("en", {timeZone: "Antarctica/McMurdo", timeZoneName:"long"})
js> dtf.format(Date.UTC(1970, 0, 1))
"1/1/1970, New Zealand Standard Time"
js> dtf.format(Date.UTC(1970, 0, -1))
"12/30/1969, GMT+12:00"
js> dtf.format(Date.UTC(9999, 11, 31)) 
"12/31/9999, New Zealand Daylight Time"
js> dtf.format(Date.UTC(9999, 11, 31+1))   
"1/1/10000, GMT+13:00"

justingrant commented 1 year ago

Thanks, this is very useful info.

Firefox doesn't use CLDR time zone canonicalisation, but IANA canonicalisation (including backzone) to follow ECMA-402 more closely, which only mentions IANA, but not CLDR.

@anba - What is Firefox planning to do with the recent changes in IANA to merge unrelated zones together, for example, Europe/Stockholm => Europe/Berlin and Atlantic/Reykyavik => African/Abidjan? Are you planning to follow those links? Or are you planning to use the unmerged fork (https://github.com/JodaOrg/global-tz)? Or something else?

Once Temporal ships, these merges will be very problematic because time zone strings will be much more visible and will be persisted (e.g. in databases) and re-used far in the future. For example, imagine a calendar app that stores meeting times in a database using ZonedDateTime#toString. There's no guarantee that 2024-07-01T09:00[Atlantic/Reykyavik] and 2024-07-01T09:00[Africa/Abidjan] will refer to the same point in time in 2024. If Iceland or Côte d'Ivoire changes their time zone, then attendees will show up at the wrong time.

anba commented 1 year ago

Firefox examines the time zone information from backzone, any time zone rule within backzone will be treated as a canonical time zone id. Time zone links will also be canonicalised according to the information in backzone. For example backzone lists Atlantic/Reykjavik as a time zone rule, so Firefox treats it as a canonical time zone id. The link from Iceland will also canonicalised according to the backzone info, i.e. it'll be canonicalised to Atlantic/Reykjavik.

For Atlantic/Reykjavik, this matches what ICU is already doing, therefore https://searchfox.org/mozilla-central/source/js/src/builtin/intl/TimeZoneDataGenerated.h doesn't include this mapping. (TimeZoneDataGenerated.h is generated by comparing the IANA rules and links, including backzone, against the time zone rules and links from ICU. We don't compare against CLDR, because ICU sometimes doesn't match CLDR time zone definitions.) But for example Asia/Chongqing is treated as a canonical time zone id, because there's a time zone rule for it in backzone and Asia/Chungking is canonicalised according to the backzone link to Asia/Chongqing. This doesn't match ICU, which treats both as links to Asia/Shanghai (matching the definitions in backward resp. common/bcp47/timezone.xml), therefore TimeZoneDataGenerated.h contains overrides to treat Asia/Chongqing as a zone and Asia/Chungking as a link to Asia/Chongqing.

Using backzone avoids some potential issues, for example Europe/Ljubljana, Europe/Sarajevo, Europe/Skopje, and Europe/Zagreb are no longer canonicalised to Europe/Belgrade. Europe/Podgorica is still canonicalised to Europe/Belgrade, because there's no separate time zone rule for it in backzone. But that case is probably is less complicated than the other cases, because there wasn't any open conflict between Serbia and Montenegro.

But just using backzone also means we have entries like Europe/Tiraspol as a canonical time zone id. Time zone transitions and date-time formatting will still handle it equivalent to Europe/Chisinau, though.

justingrant commented 1 year ago

That sounds like a good approach, and definitely better than the current main fork of TZDB. Do you know if what you're doing in FF varies from what https://github.com/JodaOrg/global-tz is doing? They sound quite similar.

justingrant commented 1 year ago

From Temporal and 402 meetings 2023-03-09, we'll follow up on this issue in two ways:

Editorial PR to make time zone canonicalization clearer/simpler in the spec, and to pave the way for...
Standalone proposal for normative changes to 402 to address the issues described above. Goal is to ask for Stage 1 of this proposal in March 2023 plenary.

In the meantime I'll close this issue to remove noise from the Temporal repo.

anba commented 1 year ago

That sounds like a good approach, and definitely better than the current main fork of TZDB. Do you know if what you're doing in FF varies from what https://github.com/JodaOrg/global-tz is doing? They sound quite similar.

I think TZDB with backzone is equivalent to global-tz with their backzone file. I can't easily tell if global-tz without their backzone is equivalent to TZDB with PACKRATLIST=zone.tab, because I don't want to go through each line of https://github.com/JodaOrg/global-tz/blob/main/actions.txt to check the computed zones and links. The News file mentions that PACKRATDATA=backzone PACKRATLIST=zone.tab gives the same results as global-tz, though.

The aforementioned Europe/Tiraspol is an example where FF is different when compared against global-tz without their backzone file.

If we want to do exact comparisons, it's necessary to explicitly define which configuration is tested:

IANA TZDB: Configurations for PACKRATDATA and PACKRATLIST.
global-tz: With or without backzone?
CLDR: Only the data in common/bcp47/timezone.xml, or including <zoneAlias> from common/supplemental/supplementalMetadata.xml? Or the actual implementations in ICU4C, or ICU4J, or ICU4X? [1]

[1] It's likely that ICU4C and ICU4X will also have slightly different behaviour, because if ICU4X uses BCP-47 ids to store time zone ids, it can't represent the old and deprecated SystemV time zone ids, because those don't have a BCP-47 id. It could use <zoneAlias> to treat them as links, but it'll still be slightly different when compared to ICU4C, which is still supporting them as actual time zones. (Support for SystemV time zones doesn't matter at all for real-world usage, but when doing exact comparisons it'd be good to define which differences can be ignored.)