Make data classes hashable

MetRonnie commented 4 years ago

Close #162 (plus #166 and #50).

As noted in the discussion below, for the hash of two instances to be the same, the == operator has to evaluate as True. This is tricky for Durations with nominal units, as P1Y != P365D (explained in #167).

The behaviour settled on, which is not ideal, is to have P1Y tacitly equal to 365 days (or rather CALENDAR.ROUGH_DAYS_IN_YEAR) for the sake of the <, <=, > etc operators (this behaviour is pretty much unchanged from before), but not actually explicitly equal when it comes to ==.

MetRonnie commented 4 years ago

(The tests fail because of things like TimeZone inheriting from Duration. I'm checking the Duration tests pass locally as I go along)

MetRonnie commented 4 years ago

Given that we think P365D == P1Y should be False, what about P365D > P1Y or P365D < P1Y?

raise ValueError()?

What about P1Y == P1Y? I think that should be True... but my brain hurts

oliver-sanders commented 4 years ago

my brain hurts

Why are dates so a hard!

I think it's ok if these comparison methods come with the caveat that they operate under certain assumptions. Accurate comparisons require sensical entities to compare. I would be happy to extend that to __eq__, it's just that __hash__ gets in the way!

How about just changing the rule for equality and leaving the rest as it is (with a warning) since there is no correct solution:

P365D != P1Y
P1Y == P1Y
P365D < P1Y
P1Y <= P1Y

wxtim commented 4 years ago

Why are dates so a hard!

Did you remove the stone?

Isn't P365D < P1Y clear, but dependent on your calendar?

MetRonnie commented 4 years ago

Isn't P365D < P1Y clear, but dependent on your calendar?

It is dependent on the calendar but also on what time point you add it to:

2020-01-01T00Z + P1Y  = 2021-01-01T00Z
2020-01-01T00Z + P365 = 2020-12-31T00Z
2019-01-01T00Z + P365 = 2020-01-01T00Z

MetRonnie commented 4 years ago

Actually I'm thinking it might be safe, and perhaps desirable even, to let P1D exactly equal PT24H (and if day is an exact unit that implies week is an exact unit).

Currently when you subtract a TimePoint that is in a different time zone from a TimePoint that is in UTC, the former is converted to UTC to do the calculation. So you won't get P1D != PT24H there. (And we don't treat DST any differently to just normal time zones).

I can't think of a case where P1D != PT24H, and I suspect most users will assume P1D == PT24H anyway. I think only if we implement leap seconds will it be an issue.

From ISO 8601:

The duration of a calendar day is 24 hours; except if modified by:

the insertion or deletion of leap seconds, by decision of the International Earth Rotation Service (IERS), or

the insertion or deletion of other time intervals, as may be prescribed by local authorities to alter the time scale of local time.

I don't think the latter applies for a fixed Calendar in isodatetime

MetRonnie commented 4 years ago

How about just changing the rule for equality and leaving the rest as it is (with a warning) since there is no correct solution

When you say warning is that warnings.warn() or just leave a note of caution in the docstring? @oliver-sanders

oliver-sanders commented 4 years ago

Caution on the docstring I think. A formal warning is a bit over the top.

MetRonnie commented 4 years ago

Problem with making TimePoint hashable: TimePoints representing the same point in time but differing in time zone and whether they're in calendar date mode, ordinal date mode or week date mode all evaluate as equal. But the methods can return different values for each of these.

oliver-sanders commented 4 years ago

But the methods can return different values for each of these.

[insert cathartic scream]

Oh god, and it seemed so simple on the surface.

If it's that bad we might want to consider another approach, perhaps we should implement new equality methods which use a more loosely defined equality e.g. TimePoint.equivalent_to or something like that. But make __eq__ strict as per the pythonic definition.

Good idea, bad idea, better idea?

The good news is that Cylc hides the isodatetime classes behind its cycling classes. Whilst this is a bad architectural decision it does at least mean that the code base isn't doing comparisons on raw isodatetime objects so the impact of this to Cylc is likely to be minimal.

MetRonnie commented 4 years ago

For methods that depend on "variable" properties that don't affect the hash of the instance, could we just create a wrapper method that takes the variable properties as args?

Thus, two instances of different time zones, but representing the same point when converted to UTC, would be equal but the method to get the hours, minutes and seconds wouldn't get cached erroneously?

Something like this:

class TimePoint:
    ...
    def get_hours_mins_secs(self):
        return self._get_hours_mins_secs(self.time_zone)

    @lru_cache(1000)
    def _get_hours_mins_secs(self, _time_zone):
        return (self.hour_of_day, self.minute_of_hour, self.second_of_minute)

MetRonnie commented 4 years ago

Without caching, the performance has gone down; the tests on Travis are taking twice as long now.

test_time_point.py, before:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2198    0.185    0.000    4.826    0.002 test_time_point.py:72(_do_test_dates)
    19782    0.053    0.000    2.154    0.000 data.py:1522(__sub__)
   153818    0.268    0.000    1.879    0.000 data.py:357(__init__)
   156016    1.105    0.000    1.630    0.000 data.py:2416(_type_checker)
    35168    0.101    0.000    1.285    0.000 data.py:1280(__add__)
    87920    0.079    0.000    1.216    0.000 data.py:646(copy)
    90118    0.075    0.000    1.168    0.000 data.py:642(__init__)
    43960    0.218    0.000    0.968    0.000 data.py:1361(copy)
     4396    0.024    0.000    0.755    0.000 data.py:1466(__gt__)
    37366    0.027    0.000    0.622    0.000 case.py:847(assertEqual)
    46158    0.510    0.000    0.562    0.000 data.py:2070(get_calendar_date_from_week_date)
    17584    0.154    0.000    0.505    0.000 data.py:1369(get_props)
    19782    0.021    0.000    0.477    0.000 data.py:966(get_ordinal_date)
     8792    0.005    0.000    0.467    0.000 case.py:840(_baseAssertEqual)
    19782    0.013    0.000    0.447    0.000 data.py:2135(get_ordinal_date_from_week_date)
    24178    0.078    0.000    0.424    0.000 data.py:508(__mul__)
     4396    0.011    0.000    0.405    0.000 data.py:1379(__eq__)
    26376    0.031    0.000    0.375    0.000 data.py:939(get_calendar_date)
    56930    0.271    0.000    0.347    0.000 data.py:1601(tick_over)
     8792    0.012    0.000    0.319    0.000 data.py:1111(set_time_zone)
     8792    0.009    0.000    0.303    0.000 data.py:505(__sub__)

After

``` ncalls tottime percall cumtime percall filename:lineno(function) 2198 0.191 0.000 9.112 0.004 test_time_point.py:72(_do_test_dates) 43960 0.175 0.000 6.853 0.000 data.py:1467(__add__) 56940 0.441 0.000 6.042 0.000 data.py:1716(_tick_over) 56940 3.763 0.000 5.452 0.000 data.py:1800(_tick_over_day_of_month) 19782 0.047 0.000 3.599 0.000 data.py:1635(__sub__) 21535779 1.618 0.000 1.618 0.000 data.py:1085(day_of_month) 59294 0.089 0.000 0.616 0.000 data.py:421(__init__) 8792 0.042 0.000 0.616 0.000 data.py:1568(_cmp) 41762 0.367 0.000 0.545 0.000 data.py:2519(_type_checker) 46158 0.222 0.000 0.544 0.000 data.py:1540(_copy) 101056 0.256 0.000 0.416 0.000 data.py:493(_copy) 37366 0.027 0.000 0.406 0.000 case.py:847(assertEqual) 4396 0.004 0.000 0.395 0.000 data.py:1629(__gt__) 17584 0.175 0.000 0.346 0.000 data.py:1548(get_props) 8792 0.014 0.000 0.308 0.000 data.py:1290(to_time_zone) 8792 0.005 0.000 0.249 0.000 case.py:840(_baseAssertEqual) 4396 0.002 0.000 0.227 0.000 data.py:1620(__eq__) 2236984 0.214 0.000 0.214 0.000 {built-in method builtins.getattr} 24178 0.070 0.000 0.205 0.000 data.py:616(__mul__) 8792 0.008 0.000 0.160 0.000 data.py:613(__sub__) 1654601 0.156 0.000 0.156 0.000 {built-in method builtins.setattr} ```

Update: managed to get it back down

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2198    0.186    0.000    4.819    0.002 test_time_point.py:72(_do_test_dates)
    43960    0.101    0.000    2.729    0.000 data.py:1471(__add__)
    56942    0.216    0.000    2.066    0.000 data.py:1720(_tick_over)
    19782    0.045    0.000    1.896    0.000 data.py:1639(__sub__)
    56942    1.763    0.000    1.826    0.000 data.py:1805(_tick_over_day_of_month)
    59310    0.084    0.000    0.610    0.000 data.py:422(__init__)
     8792    0.039    0.000    0.548    0.000 data.py:1572(_cmp)
    41762    0.362    0.000    0.545    0.000 data.py:2523(_type_checker)
    46158    0.212    0.000    0.535    0.000 data.py:1544(_copy)
   101072    0.256    0.000    0.418    0.000 data.py:494(_copy)
    37366    0.026    0.000    0.372    0.000 case.py:847(assertEqual)
     4396    0.004    0.000    0.347    0.000 data.py:1633(__gt__)
    17584    0.173    0.000    0.347    0.000 data.py:1552(get_props)
     8792    0.011    0.000    0.259    0.000 data.py:1294(to_time_zone)
     8792    0.005    0.000    0.220    0.000 case.py:840(_baseAssertEqual)
  2237176    0.217    0.000    0.217    0.000 {built-in method builtins.getattr}
     4396    0.002    0.000    0.207    0.000 data.py:1624(__eq__)
    24178    0.069    0.000    0.204    0.000 data.py:618(__mul__)
  1654753    0.158    0.000    0.158    0.000 {built-in method builtins.setattr}
     8792    0.008    0.000    0.138    0.000 data.py:615(__sub__)
    10990    0.008    0.000    0.129    0.000 data.py:1160(get_ordinal_date)

but the overall test suite still takes about 35% longer on Travis. I guess a (small) part of that is the addition of more tests. Otherwise it might be Travis being funny; locally I get much the same speed for the non-runslow tests before and after.

oliver-sanders commented 4 years ago

(note I've renamed the milestone to 2.1 as this is a pretty significant change)

oliver-sanders commented 4 years ago

Let's see how Cylc Flow handles this 😈

metomi / isodatetime

Make data classes hashable #165