tc39 / proposal-temporal

Provides standard objects and functions for working with dates and times.
https://tc39.es/proposal-temporal/docs/
Other
3.35k stars 153 forks source link

Reconsider time zone canonicalization behavior given forking of IANA Time Zone Database #2509

Closed justingrant closed 1 year ago

justingrant commented 1 year ago

While working on #2493, I learned that the IANA Time Zone Database has been forked due to a disagreement between that database's maintainer and some prominent users of the database.

Background

The two forks differ as follows:

You can read more about the fork in the TZDB mailing list archives. A few relevant threads:

The fork seems to represent a philosophical difference about the purpose of the TZDB. One camp (which includes the maintainer) sees the goal of TZDB as simply providing a way to convert post-1970 zoned timestamps into exact instants, and wants to reduce the TZDB size and maintenance hassle of dealing with pre-1970 data. The other camp (supporting the unmerged fork) adds additional use cases:

I'm not sure how much Temporal cares about pre-1970 dates, but the latter two issues seem quite important to Temporal users. The second one will make calendaring apps more resilient to country-level timezone/DST changes, while the third will prevent developer confusion and consternation.

Also, given the complaints about the changes, it's possible that the TZDB may revert these changes in the future, which would cause further churn.

Options

Anyway, now that we know this fork exists, we need to figure out what to do about it in the Temporal spec. Options include:

1. Recommend that implementers use the Primary Fork

2. Recommend that implementers use the Unmerged Fork

3. Don't recommend anything; implementers are free to choose.

4. Stop canonicalizing time zones (thanks to @pipobscure for this suggestion)

Discussion

Of the above options, my strong preference is for (4), because it solves both the forking issue as well as the existing canonicalization issues like Calcutta vs. Kolkata. Also, I think retaining user input as-is will be quite helpful to reduce confusion in cases where code takes input from some other source, modifies that data, and then sends or stores the modified data. If the time zone identifier varies a lot between the original and modified ZDT, I think that will generate user confusion that avoiding canonicalization would prevent.

If we want to go with (4), here's a few questions to answer:

If we add equals, here's a suggestion for its behavior:

Pinging @jasonwilliams @ptomato @sffc @gibson042 @pipobscure for your opinions.

gibson042 commented 1 year ago

I like option 4, but would rename it to something like "Canonicalization doesn't follow Links" and would also make clear that Link chains terminating in "Etc/UTC" or "Etc/GMT" are still followed and canonicalized to "UTC" (unless that option splits into e.g. 4a and 4b with different conclusions regarding this behavior).

  • i) How should Intl.DateTimeFormat.p.resolvedOptions().timeZone behave? Should it also stop canonicalizing? If yes, should it add a new canonicalTimeZone property?

That definitely gets into backwards compatibility territory, and it is plausible—if not likely—that some existing code already uses dtf.resolvedOptions().timeZone to get the host-reported canonical spelling of a time zone name. So I don't think changing that is on the table.

  • ii) Will there be any change to user-visible output of Intl.DateTimeFormat.p.format or Date.p.toLocaleString? I suspect that the answer is "no" because localized descriptions of time zones don't usually surface the IANA identifiers, but not 100% sure about this.

https://github.com/tc39/ecma402/issues/119 does propose exposing the IANA names, but I don't recall any existing text requiring implementations to do so (although they certainly could anyway; ECMA-402 gives formatters lots of flexibility).

  • iii) What changes (if any) would be required to CLDR and/or ICU to support this change?

I defer to @sffc.

  • iv) Even if we avoid the canonicalization mess, there's still the pre-1970-data question. The unmerged fork will have it, the merged fork will only have it for the merged zones. This would mean, for example, that Europe/Copenhagen pre-1970 results could vary by fork. So which fork should we recommend that implementers use? I don't have a strong opinion here. It'd be nice to understand the size of pre-1970 data to know how much smaller browser downloads would get if this data were removed.

I don't have a strong opinion here either.

  • v) Should case differences still be canonicalized, e.g. Europe/Paris vs. europe/paris? My opinion: yes, we should canonicalize.

Yes. Canonicalization still happens, it just doesn't follow Links (except for special-casing GMT and UTC).

  • vi) Should spelling differences due to renaming also be canonicalized, e.g. Asia/Calcutta vs. Asia/Kolkata. My opinion: no, because by not canonicalizing id in this case we can avoid user complaints like this chromium bug, and we can ensure future compatibility & round-trippability even if zones are renamed in the future. Note that equals should probably report these as true though. (See below.)

No. Doing this would make the behavior less comprehensible and would sacrifice potential benefits.

  • vii) Should we add a TimeZone.p.equals method? I think we should, both for consistency across Temporal types and to help code be robust in the face of past or future renames of cities which seems to happen fairly often globally. JS code should be able to ask "Is this date in the India time zone" without having to worry that that code will be broken by a past or future rename.

There should definitely be some way to identify that there is a Link chain establishing equality between two time zones with different names, and ideally a way to determine its directionality (e.g., detecting that Atlantic/Reykjavik is a Link to Africa/Abidjan rather than the reverse).

  • viii) If we add equals should we also add a method that tests if all rules are the same across time zones, e.g. Atlantic/Reykyavik vs. Africa/Abidjan? I don't think this is needed. Userland code can always use getNextTransition in a loop to check for this kind of equality, and if there's user demand we could always add it in a later release.

Agreed; not needed at this time.

  • ix) How should UTC zone be handled? I think this is straightforward: all zones whose canonical identifier is Etc/UTC should resolve to UTC in ECMAScript, matching current behavior. There's no value in changing this existing behavior.

Note that current behavior also maps "Etc/GMT" to "UTC": https://tc39.es/proposal-temporal/#sup-canonicalizetimezonename

  • x) In order for the PACKRATLIST option to work, TZDB data must provide a way to differentiate "merged" links like Atlantic/Reykyavik => Africa/Abidjan from "renamed" links like Asia/Calcutta vs. Asia/Kolkata. How does this differentiation work, and does is work for all links or are there gaps? It sounds like @anba may know how this works.

AFAICT, there's no explicit differentiation... rather, just a zone.tab file identifying for each ISO 3166-1 alpha-2 country code the corresponding time zones, which can theirselves identify Links (as is currently the case for DE/IS/etc.).

If we add equals, here's a suggestion for its behavior:

  • It should accept objects or strings.
  • If the receiver and/or the argument is a custom zone, use its id property.

Disagree on this; custom time zones should be compared by referential object identity. A custom time zone that happens to have id "UTC" is not equal to the built-in UTC time zone, and two custom time zones with the same id can have very different behavior. This probably means that distinct objects representing built-in time zones should also be reported non-equal, and it might therefore make sense to act on strings only (since authors already have === and Object.is for comparing objects). But I'd prefer to keep object input, such that Temporal.TimeZone.equals(zdt.timeZone, zdt.timeZone) is always acceptable (with an internal implementation that uses Link-aware canonicalization when both inputs are strings and otherwise validates both inputs as time zones using standard behavior such as ToTemporalTimeZone [or a more appropriate alternative without e.g. timeZone and ToString fallbacks] and compares the result using SameValue).

  • Treat different casings as equal, e.g. Europe/Paris vs. europe/paris.
  • If both receiver and argument canonicalize to Etc/UTC then treat them as equal.

:+1:

  • Treat different spellings of the same location as equal, e.g. Asia/Calcutta vs. Asia/Kolkata, because they represent the same thing with different spelling.
  • DO NOT treat different locations (like Atlantic/Reykyavik vs. Africa/Abidjan) as equal, even if all their time zone transitions are the same, because future changes could make those locations have different time zone rules. Per above, if users want to evaluate "all rules are the same" then can do this in userland by comparing time zone transitions in a loop. Although honestly I'm skeptical that this will be a popular use case. Who cares if the rules are equal?

I also disagree on these last points, although really it seems to be mostly a question of modeling—I don't think Temporal should classify Links as "alternate spelling" vs. "real", but rather just treat any Link relationship as establishing equality. In an implementation that uses the standard IANA time zone data, Atlantic/Reykyavik and Africa/Abidjan are equal until and unless a policy change causes them to diverge.

gilmoreorless commented 1 year ago
  • Treat different spellings of the same location as equal, e.g. Asia/Calcutta vs. Asia/Kolkata, because they represent the same thing with different spelling.
  • DO NOT treat different locations (like Atlantic/Reykyavik vs. Africa/Abidjan) as equal, even if all their time zone transitions are the same, because future changes could make those locations have different time zone rules. Per above, if users want to evaluate "all rules are the same" then can do this in userland by comparing time zone transitions in a loop. Although honestly I'm skeptical that this will be a popular use case. Who cares if the rules are equal?

I also disagree on these last points, although really it seems to be mostly a question of modeling—I don't think Temporal should classify Links as "alternate spelling" vs. "real", but rather just treat any Link relationship as establishing equality. In an implementation that uses the standard IANA time zone data, Atlantic/Reykyavik and Africa/Abidjan are equal until and unless a policy change causes them to diverge.

I agree with @gibson042, partly for a far more practical reason: maintenance. The IANA source doesn't have any API differences between "links due to similar clocks" and "links due to renames". The backward file was tidied up in the wake of this forking discussion — it now has commented groups of links based on their reasons. But this is only a convention in a single file, and not guaranteed to be a stable API.

If Temporal was to distinguish between the two cases in an API, there would need to be a stable maintenance process for adding brand new links to the correct category.

justingrant commented 1 year ago

I also disagree on these last points, although really it seems to be mostly a question of modeling—I don't think Temporal should classify Links as "alternate spelling" vs. "real", but rather just treat any Link relationship as establishing equality. In an implementation that uses the standard IANA time zone data, Atlantic/Reykyavik and Africa/Abidjan are equal until and unless a policy change causes them to diverge.

My assumption is that, ideally, there'd be two categories of links:

The first type of link (let's call them "synonyms") conveys no semantic value. Programs will never behave differently depending on which ones you use (other than when comparing the id strings themselves).

The second type of link (let's call them "merges") conveys semantically different information that could change the behavior of future programs beyond string comparison.

The particular use case I had in mind where it's helpful to know that difference is helping is when a program has logic like this: "I want to do special processing for timestamps for X" (where "X" is a particular country like India or Sweden). Like this:

if (Temporal.TimeZone.from('Europe/Copenhagen').equals(zdt.timeZoneId)) {
  // do India-specific stuff
} else {
  // non-India-specfic logic
}

It would be bad if future changes in the spelling of the desired English transliteration of "Copenhagen" caused the code above to break. So it's probably good practice for any code that checks for a specific time zone (or that wants to compare two ZDT timestamps to know if they're semantically identical) to use equals instead of comparing id.

But it'd also be bad if the price of protecting against future spelling changes meant that you'd need to false-negatively run jurisdiction-specific logic for other jurisdictions that coincidentally share the same time zone rules.

It's true that, continuing that example above, if Denmark split into multiple time zones then the code above would break. But I think this is OK, because the change happened in Denmark so of course Denmark-specific code will need to change. My main concern is that if you treat all aliases the same, then equals becomes riskier because you can never predict what other semantically-different zones are being lumped into the same bucket.

So I do think there's a case that being able to distinguish these cases is important. But...

I agree with @gibson042, partly for a far more practical reason: maintenance. The IANA source doesn't have any API differences between "links due to similar clocks" and "links due to renames". The backward file was tidied up in the wake of this forking discussion — it now has commented groups of links based on their reasons. But this is only a convention in a single file, and not guaranteed to be a stable API.

One possible (needs validation) solution using existing data would be to use zone.tab which includes pre-merge data. If a link from backward is also present in zone.tab them it's a merge, otherwise it's a synonym. I haven't done the work to validate that this will work perfectly, though!

If Temporal was to distinguish between the two cases in an API, there would need to be a stable maintenance process for adding brand new links to the correct category.

Agree, if the approach above won't work. We'd want to work with the IANA folks (or maybe ICU/CLDR?) to ensure that distinction is maintained in the future via some other solution.

There's less than 300 total links so this isn't a lot of ongoing maintenance work (would probably add <1hr/year of work to someone's plate) but someone would have to be willing to commit to doing the work long-term.

BTW, I'd volunteer do make an initial PR into TZDB, if it's decided that this split would be good to maintain AND if the data files need to change somehow.

gibson042 commented 1 year ago

One possible (needs validation) solution using existing data would be to use zone.tab which includes pre-merge data. If a link from backward is also present in zone.tab them it's a merge, otherwise it's a synonym. I haven't done the work to validate that this will work perfectly, though!

You'd also need to consider backzone, because e.g. Africa/Timbuktu does not appear in zone.tab but is a "merge" (to use your term) of Africa/Abidjan in the primary data but (presumably) a synonym of Africa/Bamako in the pre-1970 data, and I think the same applies to everything in the "Non-zone.tab locations with timestamps since 1970 that duplicate those of an existing location" section mentioned below.

if the approach above won't work. We'd want to work with the IANA folks (or maybe ICU/CLDR?) to ensure that distinction is maintained in the future via some other solution.

That seems like a goal that exceeds the scope of Temporal v1.

BTW, I'd volunteer do make an initial PR into TZDB, if it's decided that this split would be good to maintain AND if the data files need to change somehow.

AFAICT tzdata Links are all created equal—the only existing data that could be used is unstructured section-heading comment text like "Pre-2013 practice, which typically had a Zone per zone.tab line" and "Non-zone.tab locations with timestamps since 1970 that duplicate those of an existing location". So I guess you'd be proposing something like a new merged file that exclusively contains the content from those section(s) and a Temporal equality comparison that ignores its contents?

gilmoreorless commented 1 year ago

BTW, I'd volunteer do make an initial PR into TZDB, if it's decided that this split would be good to maintain AND if the data files need to change somehow.

It's probably best to read this whole discussion thread first: https://mm.icann.org/pipermail/tz/2021-November/031074.html That thread is what eventually produced the current grouped-under-comment-headings format of backward, despite calls for the changes to be easier to determine programmatically.

I would definitely like a change to the current format (I commented in that linked thread). But part of the reason the tzdb structure doesn't change often is the sheer number and variety of downstream consumers that have to be able to handle any new format.

justingrant commented 1 year ago

Yep, you're right: backzone was needed too. I'm building a quick proof-of-concept to understand how Intl is currently canonicalizing Links in the IANA Time Zone Database. Will share shortly. So far I see two results:

Will share more results when I finish the investigation.

justingrant commented 1 year ago

Initial investigation is complete. Results are here: https://4rylir.csb.app (full-screen view) and https://codesandbox.io/s/iana-vs-es-4rylir (source code). You can filter or sort to understand the various kinds of links.

Summary

Categorizing Synonyms vs. Merges

I took a first pass at classifying links as synonyms or merges based on the following algorithm:

I manually verified all 86 synonyms identified by the algorithm above. There were these patterns:

I also manually checked through the Links identified as merges , and I was unable to find any that looked like they should be synonyms.

sffc commented 1 year ago

My initial reaction is that it's not the job of Temporal to tell implementations what they can/should and can't/shouldn't do in this area. I can at least say that any solution that involves "don't canonicalize time zone names" likely means that ICU's time zone utilities can no longer be used for data storage; they can be used for calculations, but Temporal glue code will need to be implemented to conform to the spec rather than just following with ICU behavior as we've been doing for a long time.

justingrant commented 1 year ago

My initial reaction is that it's not the job of Temporal to tell implementations what they can/should and can't/shouldn't do in this area.

Before I did this research I probably would have agreed with you, but now that I've dug into the problem I'm quite concerned about the impact of canonicalization on the stability of ECMAScript code across engines and across time. From what I've seen, canonicalization changes very frequently, and implementations seem to vary quite a bit in how they apply canonicalization.

This has really made me question the value of exposing canonicalized IDs to userland developers. We're already seen (in this repo, in Chrome's bugs, etc.) user complaints about canonicalization when differences are usually limited to only minor variations like Calcutta vs. Kolkata. And that's with almost 2/3 of Links in the current IANA TZDB not being followed by engines to IANA's canonical IDs.

If engines start resolving Canadian time zones to Panama, Iceland to Cote d'Ivoire, and Stockholm to Berlin, we can expect many more complaints, user confusion, broken tests, etc.

Who'd be a good person to talk with to understand how ICU currently approaches this problem? How do they determine which Links to follow and which to ignore?

likely means that ICU's time zone utilities can no longer be used for data storage; they can be used for calculations, but Temporal glue code will need to be implemented to conform to the spec

I assume that implementations would need to store both the caller's (case-normalized) original string input as well as a pointer to the data structure that ICU uses to represent a canonicalized time zone. Is that what you mean by "storage"?

The stored string would be used by #2482's ToTemporalTimeZoneIdentifier, which in turn powers TimeZone's id and toString, ZDT's toString, etc. The ICU pointer would be used for all calculations. Does that match what you had in mind?

If we also wanted to offer a TimeZone.p.equals and if it only returned true for synonyms, then presumably there'd need to be support added (to ICU? by implementations?) to compare two time zones for "synonym equality" per discussion above. This wouldn't be needed if we don't offer this method, or if it compares only the id or ICU's fully-canonicalized identifier.

Other than above, what other glue code would be needed?

sffc commented 1 year ago

@yumaoka and @pedberg-icu know the most about ICU4C time zone handling.

For ICU4X, we currently persist time zones by BCP-47 ID. We can (or will be able to) take IANA strings and map them to BCP-47, and then we lookup the canonical ID to go in the other direction. There is an issue (https://github.com/unicode-org/icu4x/issues/2909) discussing which source of truth we should use for canonicalization.

I'm currently neutral on the actual usability issue. I'm just pointing out that we're in effect moving more responsibility out of ICU[4X] and into the Temporal glue code. This logic about how to compare time zones for equality, what form of canonicalization to apply to them, etc., is not easy, as your OP shows. ICU/CLDR already solves these problems in its own way, as it has been doing for a long time. Moving these problems into Temporal glue code just makes Temporal harder to implement and harder to test. If the champions think that the problem is big enough to warrant the additional (nontrivial) implementation cost, so be it.

ptomato commented 1 year ago

If the champions think that the problem is big enough to warrant the additional (nontrivial) implementation cost, so be it.

I don't, for one! I think the TZDB fork is a problem which JS implementations can coordinate among themselves to solve. Pulling the responsibility for solving the problem into our domain will delay the proposal, while delivering an incomplete solution (because this is a problem that applies outside of Temporal as well, and those parts we can't solve.)

sffc commented 1 year ago

Question. Can this behavior be changed as a Temporal V2 follow-up?

Logistically, I think it's fair to say that moving forward with this change is going to delay Temporal implementations by another several months, given that we need to discuss this in various venues to achieve consensus, then write the spec text, then the tests, then the ICU functions discussed above, then in-flight implementations need to be updated.

justingrant commented 1 year ago

An appendix to the synonym vs. merge investigation above: CLDR helpfully provides synonym data here. Example:

        "inccu": {
          "_description": "Kolkata, India",
          "_alias": "Asia/Calcutta Asia/Kolkata"
        },

If CLDR is the source of truth for time zone identifiers, then it's easy to distinguish merges from aliases.

TZDB fork is a problem which JS implementations can coordinate among themselves to solve.

My concern is that implementations have had years to do this coordination... and haven't done it. With Temporal V1 we have a one-time opportunity to reduce churn in the ecosystem forever... and from what I've seen coming down the road from IANA, avoiding the whole "what's the right canonical ID?" question forever (at least for Temporal) seems appealing.

For ICU4X, we currently persist time zones by BCP-47 ID.

Is the current plan for V8 to implement Temporal using ICU4C or ICU4X?

Question. Can this behavior be changed as a Temporal V2 follow-up?

TimeZone.p.equals could be deferred to V2. But in V1...

One approach that I think might be web-compatible would be to not canonicalize TimeZone.p.id and ZDT.p.timeZoneId at all in V1 (except setting them to 'UTC' for backwards compat). Given that we'd document that all id comparison should be case-insensitive, then it might maybe be web-compatible to do case-normalization on the identifier so that case-sensitive comparisons would work too. Not sure about this though.

Logistically, I think it's fair to say that moving forward with this change is going to delay Temporal implementations by another several months

Yep, agree. Although if we went with the "don't canonicalize IDs except UTC" solution above, that would require zero changes from ICU, and would only require a small change from implementers which could be bundled with the changes in #2482 which will already change how TimeZone slots are stored and used. The delta of additional implementer effort seems quite small.

But I agree that once we start asking for any different canonicalization behavior, I agree this would introduce delay. Which might be an argument for the "no-canonicalize" solution or the "full canonicalize" status quo as the best options for V1.

sffc commented 1 year ago

If we let ICU keep canonicalizing the .id and .timeZoneId values, which are known to be variable over time, then a change where we standardize on one particular canonicalization solution over another is likely to be web-compatible.

In other words, if we went with option 3 now, we could adopt options 1 or 2 (or even 4) later.

Option 4 has implementation concerns just like options 1 and 2. The laundry list of 10 questions in the OP is well thought out, but they are questions we need to resolve if we were to implement option 4, and, again, Temporal needs to persist the user-specified time zone alongside the ICU time zone (unless it computes the ICU time zone on the fly when it is needed).

My concern is that implementations have had years to do this coordination... and haven't done it. With Temporal V1 we have a one-time opportunity to reduce churn in the ecosystem forever... and from what I've seen coming down the road from IANA, avoiding the whole "what's the right canonical ID?" question forever (at least for Temporal) seems appealing.

I don't think Temporal is the right vehicle to force this type of ecosystem change. Temporal is already a really tall order for implementations. I do hope that implementations would be more amenable to solving the problem if there were a future proposal narrowly focused on this problem space.

justingrant commented 1 year ago

Sharing more stuff I've learned: CLDR metadata, not IANA TZDB, is currently the source of time zone canonicalization mappings in ECMAScript engines, per this comment:

From ICU’s point of view, which one is main one, and which one is specified by Link - is not important, because we don’t really expose the zoneinfo data directly to API. CLDR defines a set of “canonical zone IDs” for stability reason - and for example, both Europe/Berlin and Europe/Oslo are “canonical” zones. We don’t handle them one is an alias of another.

I think this means that we don't really care that much about the TZDB fork, as long as:

The last bullet is a problem! Currently the spec says this:

  1. If ianaTimeZone is a Link name, let ianaTimeZone be the String value of the corresponding Zone name as specified in the file backward of the IANA Time Zone Database.
  2. If ianaTimeZone is "Etc/UTC" or "Etc/GMT", return "UTC".

This language, combined with other spec text encouraging use of the latest TZDB, will force implementers to use IANA's canonicalization strategy because the spec text is very prescriptive about use of backward which now (at least in the default IANA build) aggressively merges.

If we do want engines (and not Temporal) to decide how canonicalization should work, then this spec text needs to change. Right?

sffc commented 1 year ago

Yeah, it makes a lot of sense to solve this in the section of 402 you're pointing to. I think there's already an issue open for it.

littledan commented 1 year ago

Given that this is already visible in 402, should Temporal be concerned with this issue specifically? Implementations already manage to choose to do something or other. We should just make sure that, whatever the result is, we apply it to 402 and Temporal equally.

justingrant commented 1 year ago

Yeah, it makes a lot of sense to solve this in the section of 402 you're pointing to. I think there's already an issue open for it.

@sffc Are you thinking of https://github.com/tc39/ecma402/issues/272? That issue seems a bit wider than just canonicalization, although it touches on some of the same questions.

Given that this is already visible in 402, should Temporal be concerned with this issue specifically?

@littledan Currently the only way to know the canonical ID is quite hard to discover: DateTimeFormat.p.resolvedOptions().timeZone and has very limited impact because localization output doesn't vary by alias. Unless developers are specifically poking into that API, canonicalization won't affect them at all.

In a Temporal world, canonical IDs will be highly visible in output of ZonedDateTime.p.toString, ZonedDateTime.timeZoneId, and TimeZone.p.id. These strings will be used in comparison logic, will be stored in logs and databases, and developers will (rightly or not) probably expect them to be the same over time.

So although canonicalization exists in 402 today, it will have a lot more visibility and impact once Temporal ships in engines. Hence my concern!

Disagree on this; custom time zones should be compared by referential object identity.

@gibson042 After #2482, if an object is in a ZDT's [[TimeZone]] slot, will we know if it's a custom zone or not? I'm OK to use Object.is to compare custom time zones as long as built-in time zone objects can still use the built-in comparison behavior. I do think it's a slippery slope though. If I subclass TimeZone in order to add a new method but don't change any of the built-in behavior, would I break equals? I'd also be OK with simply using id, e.g. if CLDR knows the ID then canonicalize it, otherwise just compare the string as-is. I don't have a strong opinion here.

justingrant commented 1 year ago

Based on discussion above, and given CLDR's synonym-only canonicalization strategy, I think we can narrow the decision to two basic choices below.

Note that neither option requires any change to ICU or CLDR.

A. Status quo: Follow Links + change 402 to codify existing CLDR practice.

Pro: Less spec churn; Somewhat easier to implement. Con: Changing canonical aliases will be much less web-compatible.

B. Don't follow non-UTC Links when exposing time zone identifiers from Temporal objects

Pro: better web compatibility

Con: More spec churn; Somewhat harder to implement.

In other words, if we went with option 3 now, we could adopt options 1 or 2 (or even 4) later.

Unfortunately, I don't think that (B) above is possible in a V2. For example, it would not be web-compatible to stop considering Asia/Calcutta and Asia/Kolkata as equivalent in ZonedDateTime.p.equals.

anba commented 1 year ago

A. Status quo: Follow Links + change 402 to codify existing CLDR practice.

* Implementations continue using CLDR, not IANA TZDB, to decide canonicalization.

Firefox doesn't use CLDR time zone canonicalisation, but IANA canonicalisation (including backzone) to follow ECMA-402 more closely, which only mentions IANA, but not CLDR. The overrides are in https://searchfox.org/mozilla-central/source/js/src/builtin/intl/TimeZoneDataGenerated.h.

* Change [`CanonicalizeTimeZoneName`](https://tc39.es/proposal-temporal/#sup-canonicalizetimezonename) to permit (require?) use of CLDR instead of IANA data.

CLDR has a stable time zone id policy, which can be problematic for some time zone ids. For example Europe/Kiev is forever the canonical id for Europe/Kyiv. This can lead to endless browser bug reports, similar to what happened for years on the IANA tz data mailing list. https://en.wikipedia.org/wiki/KyivNotKiev has more background information on this topic.

Yeah, it makes a lot of sense to solve this in the section of 402 you're pointing to. I think there's already an issue open for it.

@sffc Are you thinking of https://github.com/tc39/ecma402/issues/272? That issue seems a bit wider than just canonicalization, although it touches on some of the same questions.

https://github.com/tc39/ecma402/issues/272#issuecomment-423928522 has a link to this old bug report from bugs.ecmascript.org: https://tc39.es/archives/bugzilla/1892/.


Some missing bits which aren't yet covered here:

  1. ICU doesn't actually use just CLDR time zone canonicalisation, but also adds its own backward compatibility data on top of it, see https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/tzcode/icuzones. Firefox has extra code to disable these non-IANA and non-CLDR time zone ids, but other browsers return ICU results unchanged. For example new Intl.DateTimeFormat("en", {timeZone: "BET"}) should throw, because "BET" is neither a valid IANA nor CLDR time zone id. So some sort of pre-/post-processing when using ICU is required anyway. (This is also an example where ICU differs from CLDR, e.g. SystemV time zones were removed from IANA in https://github.com/eggert/tz/commit/b3cf2ee42f0799e190c875f3af2ce6e5a7e287ce, ICU still keeps them as zones in icuzones, whereas CLDR uses links.)
  2. ICU doesn't actually include any time zone transitions from backzone. For example new Intl.DateTimeFormat("en", {timeStyle: "full", timeZone: "Europe/Oslo"}).format(Date.UTC(1800, 0, 1)) returns "12:53:28 AM GMT+00:53:28". That's the offset for the IANA canonical time zone Europe/Berlin, Europe/Oslo has a different offset.

The overall situations is more like:

  1. ICU canonicalises according to CLDR, but also applies its own backward compatibility zones/links.
  2. ICU provides transition data for IANA canonical time zones (excluding backzone).
  3. ICU provides localisations for CLDR canonical time zones resp. in most cases the time zone is actually mapped to a meta zone, also see https://github.com/unicode-org/cldr/blob/main/common/supplemental/metaZones.xml. For example Antarctica/McMurdo is a canonical CLDR time zone id, but it's mapped to the meta zone New_Zealand, which can give the (false) impression that it's treated as equivalent to Pacific/Auckland per the backward link from IANA. [1]

There are probably more special cases, too. For example take Canada/East-Saskatchewan: When using CLDR time zone information as the source of truth, TimeZoneIANANameComponent also needs to be changed to handle Canada/East-Saskatchewan, because that id is still valid for CLDR/ICU, but was removed some time ago from IANA, because the name is too long (exceeds the fourteen characters limit).


[1] The meta zone mapping uses optional date information to handle the case when time zone rules change. When no date information is present, ICU restricts the range from 1970-01-01 to 9999-12-31, so it's best not to use dates more than fifty years in the past resp. dates too far into the future when testing this.

js> var dtf = new Intl.DateTimeFormat("en", {timeZone: "Antarctica/McMurdo", timeZoneName:"long"})
js> dtf.format(Date.UTC(1970, 0, 1))
"1/1/1970, New Zealand Standard Time"
js> dtf.format(Date.UTC(1970, 0, -1))
"12/30/1969, GMT+12:00"
js> dtf.format(Date.UTC(9999, 11, 31)) 
"12/31/9999, New Zealand Daylight Time"
js> dtf.format(Date.UTC(9999, 11, 31+1))   
"1/1/10000, GMT+13:00"
justingrant commented 1 year ago

Thanks, this is very useful info.

Firefox doesn't use CLDR time zone canonicalisation, but IANA canonicalisation (including backzone) to follow ECMA-402 more closely, which only mentions IANA, but not CLDR.

@anba - What is Firefox planning to do with the recent changes in IANA to merge unrelated zones together, for example, Europe/Stockholm => Europe/Berlin and Atlantic/Reykyavik => African/Abidjan? Are you planning to follow those links? Or are you planning to use the unmerged fork (https://github.com/JodaOrg/global-tz)? Or something else?

Once Temporal ships, these merges will be very problematic because time zone strings will be much more visible and will be persisted (e.g. in databases) and re-used far in the future. For example, imagine a calendar app that stores meeting times in a database using ZonedDateTime#toString. There's no guarantee that 2024-07-01T09:00[Atlantic/Reykyavik] and 2024-07-01T09:00[Africa/Abidjan] will refer to the same point in time in 2024. If Iceland or Côte d'Ivoire changes their time zone, then attendees will show up at the wrong time.

anba commented 1 year ago

Firefox examines the time zone information from backzone, any time zone rule within backzone will be treated as a canonical time zone id. Time zone links will also be canonicalised according to the information in backzone. For example backzone lists Atlantic/Reykjavik as a time zone rule, so Firefox treats it as a canonical time zone id. The link from Iceland will also canonicalised according to the backzone info, i.e. it'll be canonicalised to Atlantic/Reykjavik.

For Atlantic/Reykjavik, this matches what ICU is already doing, therefore https://searchfox.org/mozilla-central/source/js/src/builtin/intl/TimeZoneDataGenerated.h doesn't include this mapping. (TimeZoneDataGenerated.h is generated by comparing the IANA rules and links, including backzone, against the time zone rules and links from ICU. We don't compare against CLDR, because ICU sometimes doesn't match CLDR time zone definitions.) But for example Asia/Chongqing is treated as a canonical time zone id, because there's a time zone rule for it in backzone and Asia/Chungking is canonicalised according to the backzone link to Asia/Chongqing. This doesn't match ICU, which treats both as links to Asia/Shanghai (matching the definitions in backward resp. common/bcp47/timezone.xml), therefore TimeZoneDataGenerated.h contains overrides to treat Asia/Chongqing as a zone and Asia/Chungking as a link to Asia/Chongqing.

Using backzone avoids some potential issues, for example Europe/Ljubljana, Europe/Sarajevo, Europe/Skopje, and Europe/Zagreb are no longer canonicalised to Europe/Belgrade. Europe/Podgorica is still canonicalised to Europe/Belgrade, because there's no separate time zone rule for it in backzone. But that case is probably is less complicated than the other cases, because there wasn't any open conflict between Serbia and Montenegro.

But just using backzone also means we have entries like Europe/Tiraspol as a canonical time zone id. Time zone transitions and date-time formatting will still handle it equivalent to Europe/Chisinau, though.

justingrant commented 1 year ago

That sounds like a good approach, and definitely better than the current main fork of TZDB. Do you know if what you're doing in FF varies from what https://github.com/JodaOrg/global-tz is doing? They sound quite similar.

justingrant commented 1 year ago

From Temporal and 402 meetings 2023-03-09, we'll follow up on this issue in two ways:

  1. Editorial PR to make time zone canonicalization clearer/simpler in the spec, and to pave the way for...
  2. Standalone proposal for normative changes to 402 to address the issues described above. Goal is to ask for Stage 1 of this proposal in March 2023 plenary.

In the meantime I'll close this issue to remove noise from the Temporal repo.

anba commented 1 year ago

That sounds like a good approach, and definitely better than the current main fork of TZDB. Do you know if what you're doing in FF varies from what https://github.com/JodaOrg/global-tz is doing? They sound quite similar.

I think TZDB with backzone is equivalent to global-tz with their backzone file. I can't easily tell if global-tz without their backzone is equivalent to TZDB with PACKRATLIST=zone.tab, because I don't want to go through each line of https://github.com/JodaOrg/global-tz/blob/main/actions.txt to check the computed zones and links. The News file mentions that PACKRATDATA=backzone PACKRATLIST=zone.tab gives the same results as global-tz, though.

The aforementioned Europe/Tiraspol is an example where FF is different when compared against global-tz without their backzone file.

If we want to do exact comparisons, it's necessary to explicitly define which configuration is tested:

  1. IANA TZDB: Configurations for PACKRATDATA and PACKRATLIST.
  2. global-tz: With or without backzone?
  3. CLDR: Only the data in common/bcp47/timezone.xml, or including <zoneAlias> from common/supplemental/supplementalMetadata.xml? Or the actual implementations in ICU4C, or ICU4J, or ICU4X? [1]

[1] It's likely that ICU4C and ICU4X will also have slightly different behaviour, because if ICU4X uses BCP-47 ids to store time zone ids, it can't represent the old and deprecated SystemV time zone ids, because those don't have a BCP-47 id. It could use <zoneAlias> to treat them as links, but it'll still be slightly different when compared to ICU4C, which is still supporting them as actual time zones. (Support for SystemV time zones doesn't matter at all for real-world usage, but when doing exact comparisons it'd be good to define which differences can be ignored.)