python / steering-council

Communications from the Steering Council
159 stars 23 forks source link

Submission for SC consideration: PEP 615 "Support for IANA Time Zones in the Standard Library" #22

Closed pganssle closed 4 years ago

pganssle commented 4 years ago

I would like to submit PEP 615, adding a new zoneinfo module to the standard library, for consideration to the Steering Council.

I have advertised the PEP on the datetime-SIG and python-dev mailing lists, and solicited outside feedback from the IANA tz mailing list. The discussion on the discourse has settled down, and it seems that all open issues are resolved.

A reference implementation for the zoneinfo module and the first-party PyPI package tzdata are available to use.

Would it be possible for the Steering Council to consider this proposal or assign a BDFL-delgate?

I realize that I haven't left much time for this, but I was hoping (for reasons of whimsy) to get this accepted on Sunday, April 5th either between 02:00-04:00 UTC or between 13:00 and 17:30 UTC, since those times represent ambiguous datetimes somewhere on earth (mostly in Australia). There is one other opportunity for this, which is that on Sunday April 19th, the hours between 01:00 and 03:00 UTC are ambiguous in Western Sahara. I believe these are the last ambiguous datetimes before the feature freeze for Python 3.9 -- feel free to ignore this as it is not as important as getting the implementation right, or respecting the steering council's time and other priorities, but I thought it might be fun.

Relevant links:

PEP 615: https://www.python.org/dev/peps/pep-0615/ Discussion: https://discuss.python.org/t/pep-615-support-for-the-iana-time-zone-database-in-the-standard-library/3468 Reference implementation: https://github.com/pganssle/zoneinfo Reference implementation for tzdata: https://github.com/pganssle/tzdata

warsaw commented 4 years ago

Thanks very much for all your work on this PEP, and datetime/timezone support in Python. The SC discussed this ticket at today's meeting. I have a few questions, but no blocking dissent.

pganssle commented 4 years ago

Thanks for the quick turnaround on this!

  • The PEP seems to mention that the cache should be clearable, but I didn't see any discussion about the cache clearing API. Should that be explicitly defined?

Oops, that is an oversight! I meant to document this when I added the ability to clear the cache for individual keys. I'll add that to the PEP tonight. For now you can see what it looks like in the reference implementation - you can clear the entire cache or you can clear individual keys.

  • Why does ZoneInfo.__str__() return the key rather than (or perhaps in addition to) define something like a .key attribute?

I think there's nothing stopping us from having a public .key attribute, but I also think that the key is the most natural string representation for a ZoneInfo, so I think it makes sense to return the key from __str__(), which would allow you to easily do stuff like this:

>>> dt = datetime(2020, 3, 30, 15, 03, 51, tzinfo=ZoneInfo("America/New_York")
>>> f"{dt:%Y-%m-%d %H:%M:%S%z} [{dt.tzinfo}]"
'2020-03-30 15:03:51-0400 [America/New_York]'

It's also what pytz does (and I was never happy with what dateutil does, which is just return the __repr__.

I left off exposing .key as unnecessary since the information is available by calling str(ZoneInfo), but in my original design if no key were specified (e.g. ZoneInfo.from_file(some_file)) str() returned an empty string - making it easy to test whether str had failed. Now that __str__() falls back to __repr__(), it's a bit harder to test for that, so maybe it's best to add the .key parameter in so people can easily check if x.key is None.

  • The PEP is explicit that the __repr__() be the key, but doesn't explain why.

Do you mean that it's explicit about the __repr__ not being the key? I can clarify in the PEP, but I think the idea is that I wanted people to be able to serialize their ZoneInfo with str() and be guaranteed a failure if the original ZoneInfo falls back to the __repr__.

  • I was confused by the term "first party" to describe the tzdata module on PyPI. Isn't it a third party module by definition of not being in the stdlib?

I meant that it is "first party" in the sense that it is an official subproject of CPython / a PSF-owned project distributed using PyPI rather than in the main source tree. For example, hypothesis uses the same terminology to describe their first party extensions, which are part of hypothesis core, but not installed by default. I can clarify this in a footnote or change it if you'd prefer, but the concept I wanted to get across was that we wouldn't be picking an existing third party project and blessing it / adopting it, we'd be releasing a companion module.

  • What happens if no zone info is available? Does the ZoneInfo constructor fail? If so, what exception gets raised?

Yes, the constructor fails. I don't think that it's a good idea to make a distinction between "there is no ZoneInfo data available at all" and "this particular key is missing", because that gets into all kinds of shades of ambiguity, like "Does an empty /usr/share/zoneinfo directory count as being available?" or a weird situation where there are files in one of the directories on the time zone search path, but none of them are valid time zone files. Best to say, "We couldn't find the one you're looking for.", since in most cases that is enough for the user anyway.

With regards to the type of the error, I think the reference implementation raises ValueError, but I think KeyError, FileNotFoundError or a custom exception subtype would all also be valid choices IMO. I can add that to the PEP tonight - I'm thinking that a custom exception type would be best, but that exception type should probably be a subclass of one of the builtin exception types. I don't feel strongly about which one, and I'm inclined to not bother specifying in the PEP (just specify it in the documentation once the implementation is done), but I also am fine with just picking something now and sticking with it.

warsaw commented 4 years ago

Oops, that is an oversight! I meant to document this when I added the ability to clear the cache for individual keys. I'll add that to the PEP tonight. For now you can see what it looks like in the reference implementation - you can clear the entire cache or you can clear individual keys.

The reference implementation API looks great to me. Thanks for updating the PEP to match that.

Now that str() falls back to repr(), it's a bit harder to test for that, so maybe it's best to add the .key parameter in so people can easily check if x.key is None.

Works for me! Thank you for the rationale behind the __str__() decision. I'd like to see that added to the PEP, but it's not a blocker. I recommend adding .key to the PEP.

Do you mean that it's explicit about the repr not being the key? I can clarify in the PEP, but I think the idea is that I wanted people to be able to serialize their ZoneInfo with str() and be guaranteed a failure if the original ZoneInfo falls back to the repr.

Oops, yes, I meant not. When you say "serialize ... with str()", given the pickleability of ZoneInfo options, what other kinds of serialization are you thinking about? Also, to guarantee a failure on fallback to __repr__(), wouldn't that be on deserialization/unpickling instead of the serialization/pickling side? The serialization wouldn't fail. Clarity in the PEP around this decision would help, thanks!

I meant that it is "first party" in the sense that it is an official subproject of CPython / a PSF-owned project distributed using PyPI rather than in the main source tree.

Cool, although perhaps some clarity in the PEP the first time you use this term would be helpful.

With regards to the type of the error, I think the reference implementation raises ValueError, but I think KeyError, FileNotFoundError or a custom exception subtype would all also be valid choices IMO.

My preference would be a custom subclass of KeyError.

Thanks again Paul!

pganssle commented 4 years ago

I have opened https://github.com/python/peps/pull/1350 to address this feedback. Thanks so much for the feedback @warsaw, this has clarified many points!

pganssle commented 4 years ago

When you say "serialize ... with str()", given the pickleability of ZoneInfo options, what other kinds of serialization are you thinking about? Also, to guarantee a failure on fallback to __repr__(), wouldn't that be on deserialization/unpickling instead of the serialization/pickling side? The serialization wouldn't fail. Clarity in the PEP around this decision would help, thanks!

Oh, I forgot to respond to this: it is already the case that pickling a ZoneInfo.from_file object at all (whether or not it has a key) will raise an exception, so it's not a question of serializing by pickle. What I meant was that these keys are valid keys for this data across many contexts, and you may need to pass them to something like a CLDR library, or you may want to serialize your datetimes in JSON as "local, time-zone-key" pairs, or something similar to that.

I agree that it would be better if it could fail when you try to serialize it, but I don't think we can know the difference between "I want to see this represented as a string" and "I want to serialize this as a string", and I don't think ZoneInfo.__str__ should raise an exception.

I've specified in the PEP that the .key attribute will be None if the key is unset, and I will include a section in the documentation on proper serialization techniques - using .key rather than str(ZoneInfo) for serialization will help, particularly if static typing is used.

brettcannon commented 4 years ago

Two questions.

One, there seems to be a decent amount of contorting going on to get the is semantics for comparisons and the PEP tries to make the point that it's very important. So my question is why even bother with the non-caching support? Going to the extent that having to support pickling of this special case seems unnecessary. If testing the module itself needs a private function that bypasses the cache then that's fine, but making it part of the public API when a massive point is you should always be using is seems like asking for trouble in exposing it when users misuse it or some bug sneaks in which breaks the promises.

Otherwise I say change the is rule to == as part of the PEP and drop the caching. If people care they can have their own utility function that caches object construction.

Two, "In order to avoid forcing all datetime users to import zoneinfo, the zoneinfo module would need to be lazily imported, which means that end-users would need to explicitly import datetime.zoneinfo". This doesn't justify why the zoneinfo module would "need" to be lazily imported. Is the module's import time that bad as to need to avoid it at all costs if it isn't being used? If so then stating that would be good. Otherwise this just feels like an opinion in which case I argue for putting zoneinfo under datetime. ๐Ÿ˜„

warsaw commented 4 years ago

I agree that it would be better if it could fail when you try to serialize it, but I don't think we can know the difference between "I want to see this represented as a string" and "I want to serialize this as a string", and I don't think ZoneInfo.__str__ should raise an exception.

I totally agree that ZoneInfo.__str__() should not raise an exception. When I talk about "serialization", I generally think about pickleability, not str-ability. Maybe it's just a matter of terminology? I think your rational is sound. Thanks!

pganssle commented 4 years ago

One, there seems to be a decent amount of contorting going on to get the is semantics for comparisons and the PEP tries to make the point that it's very important. So my question is why even bother with the non-caching support? Going to the extend that having to support pickling of this special case seems unnecessary.

So I will say that my general design philosophy is that when there are different trade-offs to be made for different use cases, I prefer to have the default be the one that most people want, but I still want it to be possible to do the much more niche thing (this is one thing I really like about Rust's build system, by the way - Cargo.toml is a simple declarative file, but most people don't know - because they don't need to know - that you can fall back to a build.rs when the "one-size-fits-all" nature of Cargo.toml doesn't fit).

If maintenance of the no-cache and cache clearing options were a major issue, I'd probably agree that we should defer it until later, but we basically need these functions for testing purposes anyway, and the API and semantics is simple enough that I'd much rather nail down the specifics now and lock it into place than just have them exposed but considered private and for testing purposes only. I agree that the special case of pickling seems a bit overboard, but anything we do is a choice anyway, and I think the special casing for the pickle behavior is the most intuitive (especially if most pickle operations are things like "pass this ZoneInfo between processes or subinterpreters").

If testing the module itself needs a private function that bypasses the cache then that's fine, but making it part of the public API when a massive point is you should always be using is seems like asking for trouble in exposing it when users misuse it or some bug sneaks in which breaks the promises.

Maybe I have overstated the case here, but failures in the is-based semantics are not world-ending, they'll just lead to slightly funky behavior around edge cases. To be honest, people usually tend to have contradictory expectations about what things like "arithmetic across a DST boundary" should do in the first place, so I'm mainly trying to be consistent as much of the time as possible. You'll always have people confused by edge cases like this (US/Eastern is a link to America/New_York, but we have no easy way of knowing that):

>>> from zoneinfo import ZoneInfo
>>> from datetime import datetime, timedelta
>>> dt0 = datetime(2020, 3, 8, tzinfo=ZoneInfo("America/New_York"))
>>> dt1 = datetime(2020, 3, 9, tzinfo=ZoneInfo("US/Eastern"))
>>> (dt0 + timedelta(1)) - (dt0)
datetime.timedelta(days=1)
>>> (dt1 + timedelta(1)) - (dt1)
datetime.timedelta(days=1)
>>> dt0 == dt1
True
>>> (dt1 + timedelta(1)) - (dt0)
datetime.timedelta(seconds=82800)

I think it will be sufficiently rare that people try and use .no_cache at all, but if they do the worst case scenario is that people are surprised in a slightly different way than they were going to be surprised in the first place (for example, I would expect a decent fraction of people to think that seconds=82800 to be the "right" answer above).

So I take your point that it's true that someone may misuse these cache manipulation functions in their libraries and it may introduce subtle bugs - it's a good point. My contention is that it's the wrong thing to do, but it's probably "within the noise" of bad things to do, and it's not the kind of thing you'd do by accident -- anyone who doesn't now what .no_cache is for is going to pick the primary constructor, and hopefully the people who do know what it's for will only use it if they have a good reason.

Otherwise I say change the is rule to == as part of the PEP and drop the caching. If people care they can have their own utility function that caches object construction.

I don't really want to change how datetime works as part of this PEP, but I think this would be a breaking change and wouldn't totally solve the problem anyway, plus it would introduce new problems having to do with equality semantics. I think the original use of is vs == came into play because:

  1. There isn't necessarily a natural way to compare time zones, and even if there is it would likely not be efficient.
  2. Arithmetic will always attach the same tzinfo object to the result of the arithmetic, so you'll end up with a bunch of the same object propagating around anyway.
  3. It was always assumed that these things would be singletons anyway, since they may be somewhat heavyweight objects (lots of transitions), and they may be expensive to construct (construction is far and away the most expensive operation in the reference impementation).

To the extent that we can make the change in semantics now, I don't think it will be considerably worse to change is to == in the future, since it would be a backwards-incompatible change, but I don't think there's a pressing need to do that at the moment.

Two, "In order to avoid forcing all datetime users to import zoneinfo, the zoneinfo module would need to be lazily imported, which means that end-users would need to explicitly import datetime.zoneinfo". This doesn't justify why the zoneinfo module would "need" to be lazily imported. Is the module's import time that bad as to need to avoid it at all costs if it isn't being used? If so then stating that would be good. Otherwise this just feels like an opinion in which case I argue for putting zoneinfo under datetime.

The justification for this is in the earlier section rejecting datetime.ZoneInfo:

Considering that zoneinfo will likely pull in additional, possibly more heavy-weight standard library modules, it would be preferable to allow the two to be imported separately โ€” particularly if potential "tree shaking" distributions are in Python's future.

It is a fair point that I do not expect zoneinfo to be a terribly heavyweight dependency here but I think that because they are logically separable and somewhat different in "kind", I think it makes a lot of sense to keep them separate for the purposes of tree-shaking and minimizing the impact on applications sensitive to startup time. The issue is not so much that zoneinfo is heavy but that I expect datetime to be light: datetime.datetime and datetime.timedelta are pretty close to basic types. I would expect a lot of people to have uses for them in "tree-shaking" contexts even when they won't have any use for zoneinfo (e.g. embedded devices, WASM).

Right now, zoneinfo is pretty light-weight. If you are only looking at the C extension and you include anything that might be imported, above and beyond what comes in with datetime, it brings in struct, weakref and importlib.resources. If you include the pure Python implementation, it also pulls in bisect, collections and re (and eventually I'll drop the dependency on re). I think 3-6 additional modules isn't terrible, but currently the C extension for datetime eagerly pulls in exactly 1 additional module (math), so this is a pretty significant increase in the dependency tree for datetime.

Thanks for taking the time to review the proposal, sorry for writing a novel in response (you know you've given good feedback when every paragraph of yours needs 3 paragraphs of response :P)!

brettcannon commented 4 years ago

So I will say that my general design philosophy is that when there are different trade-offs to be made for different use cases, I prefer to have the default be the one that most people want, but I still want it to be possible to do the much more niche thing (this is one thing I really like about Rust's build system, by the way - Cargo.toml is a simple declarative file, but most people don't know - because they don't need to know - that you can fall back to a build.rs when the "one-size-fits-all" nature of Cargo.toml doesn't fit).

While I understand the sentiment, when it comes to the stdlib the "niche thing" inevitably becomes used enough to not be that niche in the grand scheme of things. ๐Ÿ˜‰

Basically my worry is you will disappear, Paul, and then someday I will get a PR about how pickling is broken for non-cached objects, I will ๐Ÿ™„ , and then have to try and support this niche concept that people really, probably don't want but have come to rely on anyway. I think my brain is working out the implementation and it has the terms "pickle", "subclassing", and "__new__" floating together and that just makes me paranoid something will go awry that's a pain to fix someday.

pganssle commented 4 years ago

Basically my worry is you will disappear, Paul, and then someday I will get a PR about how pickling is broken for non-cached objects, I will , and then have to try and support this niche concept that people really, probably don't want but have come to rely on anyway. I think my brain is working out the implementation and it has the terms "pickle", "subclassing", and "__new__" floating together and that just makes me paranoid something will go awry that's a pain to fix someday.

So it may be a bit of hubris - I certainly don't want to underestimate how complicated this stuff can be - but I think that the cache clearing and no_cache stuff is simple enough that it won't cause problems. The big issues that I'm highlighting in the PEP are mainly to avoid people stumbling into some confusing semantics like non-transitive datetime comparisons, but you won't stumble into using .no_cache() - you'd only use it if you had a deliberate need for it.

I want to be clear on what your proposal is here, though - is your proposal that we drop .no_cache() / .clear_cache() but leave the caching logic in place, or is your proposal that we replace the primary constructor with .no_cache()? The more complicated stuff that involves __new__, subclasses and pickling are all from the primary constructor generating singletons, but I think that will lead to a far better user experience than the alternative unless we change the nature of "same zone" vs "different zone" comparisons; changing the arithmetic and equality semantics of datetimes is a much bigger change than I'm prepared to make here, and is likely to have much more complicated consequences for the existing ecosystem than implementing the cache mechanism (which pytz and dateutil already do).

pganssle commented 4 years ago

I realize that I haven't left much time for this, but I was hoping (for reasons of whimsy) to get this accepted on Sunday, April 5th either between 02:00-04:00 UTC or between 13:00 and 17:30 UTC, since those times represent ambiguous datetimes somewhere on earth (mostly in Australia).

Oh no! I seem to have read the zdump wrong here (had been bugging me me today because it wasn't adding up). The ambiguous times in Australia have already passed now, because they were actually on Saturday in UTC. The South American ambiguous times are still happening between 02:00 and 04:00 UTC on Sunday, April 5th. That's 22:00-24:00 on Saturday April 4th in ET and 19:00-21:00 PT.

Ah well.

pganssle commented 4 years ago

Just a gentle ping on this, as it seems that there has been no activity here or on the discourse for ~2 weeks. Are there further points I should address? Should I try for another round of general discussion?

Thanks!

brettcannon commented 4 years ago

@pganssle I believe @vstinner had some questions.

vstinner commented 4 years ago

Hi @pganssle,

I have a question on the usage of the "is" operator. I understand that it's used by the datetime module and that it's non-trivial to change datetime. My question is about the zoneinfo module.

How is the comparison between two zoneinfo objects implemented?

key = "Europe/Berlin"
zone1 = ZoneInfo.no_cache(key)
zone2 = ZoneInfo.no_cache(key)
print(zone1 == zone2)

Should I expect True here? What if a zoneinfo is loaded from a file: how is it compared to zone info created by ZoneInfo(key)?

My concern is about Python implementation which cannot implement the is operator as CPython does.

pganssle commented 4 years ago

Should I expect True here? What if a zoneinfo is loaded from a file: how is it compared to zone info created by ZoneInfo(key)?

Currently I'm using the default __eq__, so two ZoneInfo objects are equal if they are the same object, and not equal otherwise, so:

>>> key = "Europe/Berlin"
>>> zone1 = ZoneInfo.no_cache(key)
>>> zone2 = ZoneInfo.no_cache(key)
>>> print(zone1 == zone2)
False
>>> zone1c = ZoneInfo(key)
>>> zone2c = ZoneInfo(key)
>>> zone1c == zone2c
True
>>> zone1 == zone1c
False

Other reasonable equality semantics would be to use self.key == other.key, or something more complicated (and more expensive to calculate) like "are all the transitions in this zone identical"?

I can't think of any particularly good reason to make a distinction between x == y and x is y, though. It's hard to make a choice without knowing what people would want to use it for, though.

The only choice that users can't implement themselves using other exposed semantics would be having equality check for identical transitions, but that would mean that the hash would have to actually be a hash of the contents of the ZoneInfo, which would make hashing a more expensive operation (though we can always cache the hash code). The other issue there is that the operation would be opaque: if two objects have the same key but different transitions, you would have no way to determine why they are different.

In the end, the vast majority of people will be using ZoneInfo, and probably the most reasonable non-testing use case for ZoneInfo.no_cache() is building your own, separate cache, in which case the difference between x == y and x is y will be mostly academic, since x is y implies x == y with all the reasonable choices.

My concern is about Python implementation which cannot implement the is operator as CPython does.

I don't think this matters much, because we are not doing anything special to try to "override" the is operator, by using a cache, we just actually return the same object from the constructor. So for example, this function:

_CACHE = {}

def f(key):
    if key in _CACHE:
        return _CACHE[key]

    return _CACHE.setdefault(key, object())

You can guarantee that f(key) == f(key) for all values of key, regardless of the Python implementation, because the _CACHE dictionary holds a reference to a single object that it always returns.

vstinner commented 4 years ago

I'm confused by https://www.python.org/dev/peps/pep-0615/#deliberate-cache-invalidation

Two "America/New_York" zones are not equal but have the same key (and same str() value).

The only choice that users can't implement themselves using other exposed semantics would be having equality check for identical transitions, but that would mean that the hash would have to actually be a hash of the contents of the ZoneInfo, which would make hashing a more expensive operation (though we can always cache the hash code).

Yeah, I'm thinking about hash(tzdata data of this zone). Sure, it can be cached. I would expect:

>>> zone1 = ZoneInfo.no_cache(key)
>>> zone2 = ZoneInfo.no_cache(key)
>>> zone1 is zone2
False
>>> zone1 == zone2
True

My expectation is that if two zones are equal, two different (equal) datetime objects using them should behave the same.

I'm thinking of zones created by various ways: ZoneInfo(key), ZoneInfo.no_cache(key), ZoneInfo.from_file(), deserialization, etc.

the most reasonable non-testing use case for ZoneInfo.no_cache() is building your own, separate cache, in which case the difference between x == y and x is y will be mostly academic, since x is y implies x == y with all the reasonable choices.

I'm no longer sure that I understood the purpose of ZoneInfo.clear_cache(). Does it reload tzdata on disk and so newly created zones will get new transition data?

Maybe I should dig into the implementation to try to understand.

pganssle commented 4 years ago

Yeah, I'm thinking about hash(tzdata data of this zone). Sure, it can be cached. I would expect:

>>> zone1 = ZoneInfo.no_cache(key)
>>> zone2 = ZoneInfo.no_cache(key)
>>> zone1 is zone2
False
>>> zone1 == zone2
True

My expectation is that if two zones are equal, two different (equal) datetime objects using them should behave the same.

That is one of the reasons I think that this is a reasonable definition of equality, but in some ways equality is contextual. I think we have to answer "why would someone compare two ZoneInfo objects?" What do we do for something like this:

zi0 = ZoneInfo("America/New_York")
zi1 = ZoneInfo("US/Eastern")
zi0 == zi1

Those two will have identical transition data, but their key is different, and datetimes using both of them would use inter-zone semantics.

I think for the most part people care about "will Python consider these two things to be the same zone", in which case __eq__ meaning is makes the most sense. If someone wants to know whether two zones have all the same transition information, we can always implement an equvalent_to() method on ZoneInfo.

Using value-based equality semantics would also introduce hash collisions that I don't think anyone would want:

>>> zis = {ZoneInfo("America/New_York"), ZoneInfo("US/Eastern")}
>>> zis
{ZoneInfo('America/New_York')}

I think people would find this very surprising and annoying, and other than "It seems like zones with identical behavior should be equal", I don't see any benefit to using this form of equality.

I'm no longer sure that I understood the purpose of ZoneInfo.clear_cache(). Does it reload tzdata on disk and so newly created zones will get new transition data?

It just clears the cache so that calling ZoneInfo(key) after a ZoneInfo.clear_cache(), it's as if you had never called ZoneInfo(key) before. That does mean that if the time zone data on disk has changed, you'll hit the new time zone data. You can also use it for testing: clear the cache to ensure that you don't depend on the state that may have been created by other tests running. I don't see people using this much outside of a testing context, but you could also imagine doing something like probing your system for changes to tzdata and firing off an "invalidate the cache" call whenever you see it change.

vstinner commented 4 years ago

Those two will have identical transition data, but their key is different, and datetimes using both of them would use inter-zone semantics.

The key can be taken in account in the comparison.

Here is a concrete implementation of what I asking for: https://github.com/pganssle/zoneinfo/pull/42

With this PR, the 3 following zones are considered as equal:

zone1 = zoneinfo.ZoneInfo.no_cache(key)
zone2 = zoneinfo.ZoneInfo.no_cache(key)
with open(zone1._file_path, "rb") as fp:
    zone3 = zoneinfo.ZoneInfo.from_file(fp, key=key)

So it becomes possible to merge zones created by various ways even if they are not exactly the same object (zone1 is zone2). The idea would be to create a cache which takes in account transition data, not only the key. It may help for datetime which rely on "is" to compare tzinfo.

Currently, transition data is not serialized and I'm fine with that. But I can easily see an use case where someone would like to serialize that over network and then would expect to retrieve the equality property. For example:

=> consider zone1 and zone2 as equal since their transition data is exactly the same.

What do you think?

I'm not proposing to rely on "zone1 == zone2" in zoneinfo or datetime, but provide eq and hash methods for people who want to extend zoneinfo with their special use cases.

pganssle commented 4 years ago

@vstinner So I can understand why someone might want to do this, but I think it would be extraordinarily rare, and I think having equality defined in a way that usually but mostly isn't aligned with how these things affect datetime semantics is probably more likely to be misleading and to make these things more computationally expensive to create in the first place.

If we're just catering to the case where someone wants to merge different data sources if and only if they agree for all transitions, then I think that such a person could make use of an .equivalent_transitions() method, which tells you whether two time zones are equivalent, and that is a lot less ambiguous than defining __eq__ (and consequently __hash__). You may also want to take note of the discussion on the discourse thread, notably Stuart Bishop's response and some of the subsequent responses from that. I think his reasoning is sound - generally speaking people want a single zone, they don't care about the specific offsets that apply. In that view, if you have a bunch of time zones and you want to merge them, you should merge them by key.

That said, one place where I have seen people care about equality semantics is when trying to determine if a given zone represents UTC. People will probably want ZoneInfo("UTC") == datetime.timezone.utc to be True - though I'm not a huge fan of the semantics of datetime.timezone equality, since it ignores the tzname.

One thing mildly in favor of using __eq__ instead of equivalent_to is that if you did want to merge all zones by their contents (and not their key), using equivalent_to would be an O(n) operation, but looking up the hash would be O(1). If that's an actual real-world concern, though, we do have the option of changing __hash__ without changing __eq__ in future versions, so we don't necessarily need to implement it today.

For the most part, I don't think this matters too much for the 99.999% case, but I'm on the fence about it enough that maybe it's worth seeing if anyone on the discourse has strong opinions on it? I can write up a summary of the issues and options tonight or tomorrow morning and advertise it on discourse and the mailing list.

vstinner commented 4 years ago

Me:

I have a question on the usage of the "is" operator.

Paul provided a great rationale. In fact, the PEP is fine, it's a good thing to not define the equality operator: https://discuss.python.org/t/pep-615-support-for-the-iana-time-zone-database-in-the-standard-library/3468/63

warsaw commented 4 years ago

https://mail.python.org/archives/list/python-dev@python.org/thread/MDR2FL66R4T4VSLUI5XRFFUTKD43FMK4/