skyfielders / python-skyfield

Elegant astronomy for Python
MIT License
1.43k stars 213 forks source link

Add __hash__ method to Time class #395

Closed mixxen closed 4 years ago

mixxen commented 4 years ago

The Time class should contain the __hash__ method. This will allow it to be used with other functions that require it such as functools.lru_cache.

Here is an example implementation:

     def __hash__(self):
         return hash((self.whole, self.tt_fraction))

Here is a link to what was done for AstroPy: https://github.com/astropy/astropy/pull/8912/files

brandon-rhodes commented 4 years ago

I agree that a hash could be helpful! But since different combinations of whole + fraction can represent the same date, I'm not sure that simply hashing those two values would work? Instead they will probably need to be normalized.

glangford commented 4 years ago

Since hashing is dependent on equality, I am looking at the current implementation of equality testing for Time - which does not normalize, and does not take into account different combinations. If there was code out there comparing two dates which were identical, but with different combinations of whole + fraction they would be unequal.

So if __eq__ does not normalize, is just hashing these two values ok?

def __eq__(self, other_time):
        return self.__sub__(other_time) == 0.0

    def __sub__(self, other_time):
        if not isinstance(other_time, Time):
            return NotImplemented
        return (self.whole - other_time.whole
                + self.tt_fraction - other_time.tt_fraction)
brandon-rhodes commented 4 years ago

I am looking at the current implementation of equality testing for Time — which does not normalize, and does not take into account different combinations.

We can test whether it handles different combinations by instantiating two Time objects manually:

from skyfield.api import load, T0, Time

ts = load.timescale()
t0 = Time(ts, 2451545.5, 0.125)
t1 = Time(ts, 2451545.0, 0.625)
print(t0.whole, t0.tt_fraction)
print(t1.whole, t1.tt_fraction)
print(t0 - t1)
print(t0 == t1)

The output I get on my laptop is:

2451545.5 0.125
2451545.0 0.625
0.0
True

The code appears to be treating two different combinations of floating point number as representing the same time. Could you run this script on your own system to see how if behaves there? If subtraction and equality are broken, that definitely needs to be fixed before we proceed to hashing!

(In fact I may see a small improvement we can make to how it’s implemented, but I’ll wait until you’ve tested and re-read the code before discussing that.)

glangford commented 4 years ago

Thanks for that, I did misread the code! I do get the same output on my system, however using subtraction does not work correctly in general because of floating point representation. For example, try this

from skyfield.api import load, T0, Time

ts = load.timescale()
t0 = Time(ts, 2451545.5, 1.1)
t1 = Time(ts, 2451545.6, 1.0)
print(t0.whole, t0.tt_fraction)
print(t1.whole, t1.tt_fraction)
print(t0 - t1)
print(t0 == t1)

I get this output:

2451545.5 1.1
2451545.6 1.0
-9.313216864370588e-11
False
brandon-rhodes commented 4 years ago

True, but it works if you provide floats that in fact add to the same number :)

from skyfield.api import load, T0, Time
ts = load.timescale()
t0 = Time(ts, 2451545.5, 1.1)
t1 = Time(ts, 2451545.6, 0.9999999999068678)
print(t0.whole, t0.tt_fraction)
print(t1.whole, t1.tt_fraction)
print(t0 - t1)
print(t0 == t1)

But equality is indeed a problematic concept when dealing with floats. One way it could cause problems here is that two different Time objects, like the two we have created here, might be equal yet could give a slightly different coordinate when asked for a satellite or planetary position, because the two floats of which the time is composed could cause different rounding down inside the computation, even though the two floats might represent the same sum. So two times that are “equal” could produce coordinates that are different.

In any case: one possibility would be to hash the sum currently returned by the .tt attribute (which used to be the real value of the time, before I decided to add more digits with this split representation). In the vast majority of cases the hash will come out different for different times. Only for very closely spaced times would they hash the same and have to fall through to comparison. Does that sound like a reasonable first approach?

glangford commented 4 years ago

Ok, I am reading .tt and the code says

# Low-precision floats generated from internal float pairs.

    @property
    def tt(self):
        return self.whole + self.tt_fraction

So I agree about the sum, that makes sense - would the key need to be rounded to some number of digits? All of the articles I have seen make me nervous about using a floating point key, and I don't know all the ramifications.

brandon-rhodes commented 4 years ago

So I agree about the sum, that makes sense - would the key need to be rounded to some number of digits? All of the articles I have seen make me nervous about using a floating point key, and I don't know all the ramifications.

Yes, I often take several tries to get floating point reasoning right.

Rounding seems to only make floating point problems less frequent, rather than solving them. If you round a 16-digit number to 14 digits, for example, you now hide 9,999 out of 10,000 cases where a difference could be noticeable. But in that last case, where one 14-digit number ends and the next one begins, it’s the full noise in the 16 digits of precision that will determine which way a number rounds. So I tend to avoid rounding since that rare case will still come back to cause problems for a test or a user someday.

But implicit rounding will occur here in any case because extra digits in the fractional part will disappear when combined with the whole part of the date. I guess I'll implement it and see if it ever causes problems.

glangford commented 4 years ago

But implicit rounding will occur here in any case because extra digits in the fractional part will disappear when combined with the whole part of the date. I guess I'll implement it and see if it ever causes problems.

I will try and devote some time to review and testing (and I'm sure @mixxen will too).

brandon-rhodes commented 4 years ago

I hope to release a new version of Skyfield within the next few days, which will make this feature available, @mixxen. In the meantime feel free to test with:

pip install https://github.com/skyfielders/python-skyfield/archive/master.zip
mixxen commented 4 years ago

Thank you for the change. I think this will do for now, but at least for my use case (which is caching long-running calculations based on Time), I'd rather two equal Time objects not give the same hash (my example) instead of two unequal Time objects giving the same hash due to floating-point rounding (your commit). The former has the potential for longer run time, but the latter has the potential for wrong cached values.

Would it work to make -1 < tt_fraction < 1 and whole integer at construction time? For example:

    def __init__(self, ts, tt, tt_fraction=0.0):
        self.ts = ts
        self.whole = int(tt) + int(tt_fraction)
        self.tt_fraction = tt_fraction - int(tt_fraction)
        self.shape = getattr(tt, 'shape', ())
brandon-rhodes commented 4 years ago

@mixxen — Alas, it will always be the case that some unequal Time objects have the same hash, per the pigeonhole principle: there are only around 2^64 possible hashes but on the order of something like 2^128 possible times (minus some bits for the fact that the mantissas of whole and tt_fraction will be to some degree redundant).

This is the case with nearly all Python hash() values. There are far more integers than 2^64; far more strings than 2^64; and so forth.

Therefore, all uses of hashes in Python fall through to an equality check if the hashes match. Hashing is simply an optimization, allowing the equality check in some cases to be skipped.

I'd rather two equal Time objects not give the same hash (my example)

That, alas, would violate the Python standard for what __hash__() means.

https://docs.python.org/3/reference/datamodel.html#object.__hash__

“The only required property is that objects which compare equal have the same hash value”

Do you have in mind a specific case where the implementation I’ve committed makes your use case inconvenient? If so, simply provide some sample code with Time objects showing the inconvenient behavior, and I’ll try working on an improved implementation that resolves your problem while at the same time following the relevant standards necessary for tools like the Python dictionary to keep working correctly.

brandon-rhodes commented 4 years ago

(A bit of background: if you are curious why the Python standard insists on the above-quoted restriction on the behavior of hashes, see my old talk “The Mighty Dictionary”:

https://www.youtube.com/watch?v=oMyy4Sm0uBs

The recording is a bit grainy, but hopefully will convey the ideas around why hashes of equal quantities must be equal.)

mixxen commented 4 years ago

Thanks for your video, I learned something about python hashing. My thoughts were based on your code comments.

lru_cache worked with a simple test I've made (see below). Thanks again for the update, I hope it will be useful for others too.

#!/usr/bin/env python
# coding: utf-8

# In[1]:

from skyfield.api import load, T0, Time
from functools import lru_cache
import numpy as np

eps = np.finfo(float).eps
print(eps)

ts = load.timescale(builtin=True)
t0 = Time(ts, 2451545.5, 0.0)
t1 = Time(ts, 2451545.5, eps)
t2 = Time(ts, 2451545.5, eps * 2)
print(t0.whole, t0.tt_fraction)
print(t1.whole, t1.tt_fraction)
print(t2.whole, t2.tt_fraction)
print(t0 - t1)
print(t0 - t2)
print(t0 == t1)
print(t0 == t2)
print(t0.tt - t1.tt)
print(t0.tt - t2.tt)

# In[2]:

class TimeHash0(Time):

    def __init__(self, other):
        self.ts = other.ts
        self.whole = other.whole
        self.tt_fraction = other.tt_fraction
        self.shape = getattr(self.tt, 'shape', ())

    def __hash__(self):
        print('hash:', hash(self.tt))
        return hash(self.tt)

class TimeHash1(Time):

    def __init__(self, other):
        self.ts = other.ts
        self.whole = other.whole
        self.tt_fraction = other.tt_fraction
        self.shape = getattr(self.tt, 'shape', ())

    def __hash__(self):
        print('hash:', hash((self.whole, self.tt_fraction)))
        return hash((self.whole, self.tt_fraction))    

# In[3]:

@lru_cache(maxsize=32)
def blah(t):
    print('no cache')
    return t.tt_fraction

# In[4]:

print(blah(TimeHash0(t0)))
print(blah(TimeHash0(t1)))
print(blah(TimeHash0(t2)))
print(blah(TimeHash0(t0)))
print(blah(TimeHash0(t1)))
print(blah(TimeHash0(t2)))

# In[5]:

print(blah(TimeHash1(t0)))
print(blah(TimeHash1(t1)))
print(blah(TimeHash1(t2)))
print(blah(TimeHash1(t0)))
print(blah(TimeHash1(t1)))
print(blah(TimeHash1(t2)))

output:

2.220446049250313e-16
2451545.5 0.0
2451545.5 2.220446049250313e-16
2451545.5 4.440892098500626e-16
-2.220446049250313e-16
-4.440892098500626e-16
False
False
0.0
0.0

hash: 1152921504609298521
no cache
0.0
hash: 1152921504609298521
no cache
2.220446049250313e-16
hash: 1152921504609298521
no cache
4.440892098500626e-16
hash: 1152921504609298521
0.0
hash: 1152921504609298521
2.220446049250313e-16
hash: 1152921504609298521
4.440892098500626e-16

hash: -9153018218721566478
no cache
0.0
hash: -9153018218167313678
no cache
2.220446049250313e-16
hash: -9153018219830072078
no cache
4.440892098500626e-16
hash: -9153018218721566478
0.0
hash: -9153018218167313678
2.220446049250313e-16
hash: -9153018219830072078
4.440892098500626e-16
brandon-rhodes commented 4 years ago

I'm glad that the hashes are now supporting LRU caching for you! Hopefully it supports many other Time object users in the future.