pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

API/BUG: hashing of datetimes is based on UTC values #16372

Open jreback opened 7 years ago

jreback commented 7 years ago

These should are 3 different 'views' of the same time. We DO disambiguate these in mains. So we should do so when hashing as well.

xref #16346

In [1]: from pandas.util import hash_pandas_object

In [8]: hash_pandas_object(pd.date_range('20130101', periods=3, tz='UTC').tz_convert('US/Eastern'))
Out[8]: 
2012-12-31 19:00:00-05:00     4326795898974544501
2013-01-01 19:00:00-05:00     2833560015380952180
2013-01-02 19:00:00-05:00    14913883737423839247
Freq: D, dtype: uint64

In [9]: hash_pandas_object(pd.date_range('20130101', periods=3, tz='UTC'))
Out[9]: 
2013-01-01 00:00:00+00:00     4326795898974544501
2013-01-02 00:00:00+00:00     2833560015380952180
2013-01-03 00:00:00+00:00    14913883737423839247
Freq: D, dtype: uint64

In [10]: hash_pandas_object(pd.date_range('20130101', periods=3))
Out[10]: 
2013-01-01     4326795898974544501
2013-01-02     2833560015380952180
2013-01-03    14913883737423839247
Freq: D, dtype: uint64
jreback commented 7 years ago

cc @mrocklin

@TomAugspurger @jorisvandenbossche

from a practical perspective I don't think this makes a whole lot of difference, but should fix to be correct.

jorisvandenbossche commented 7 years ago

What alternative do you think off to hash upon? Hash the timezone separately and combine the hashes?

jreback commented 7 years ago

I think you could do something like this.

In [1]: df = pd.DataFrame({'tz': pd.date_range('20130101', periods=3, tz='UTC').tz_convert('US/Eastern'),
   ...: 'utc': pd.date_range('20130101', periods=3, tz='UTC'),
   ...: 'naive': pd.date_range('20130101', periods=3)})

In [2]: df
Out[2]: 
       naive                        tz                       utc
0 2013-01-01 2012-12-31 19:00:00-05:00 2013-01-01 00:00:00+00:00
1 2013-01-02 2013-01-01 19:00:00-05:00 2013-01-02 00:00:00+00:00
2 2013-01-03 2013-01-02 19:00:00-05:00 2013-01-03 00:00:00+00:00

In [3]: from pandas.util import hash_pandas_object

In [6]: hash_pandas_object(pd.DataFrame({'tz':df['tz'],'zone':df['tz'].dt.tz}), index=False)
Out[6]: 
0    11960632900184590671
1    17909201100930397932
2      244240496600445005
dtype: uint64

In [7]: hash_pandas_object(pd.DataFrame({'utc':df['tz'],'zone':df['utc'].dt.tz}), index=False)
Out[7]: 
0     557885042773898185
1    1996380570925580138
2    5435501107539799243
dtype: uint64

In [8]: hash_pandas_object(pd.DataFrame({'naivec':df['naive']}), index=False)
Out[8]: 
0    14376405836841727586
1     1052390041072582175
2    12596642793234779168
dtype: uint64

IOW, hash the tz as an additional column and combine (which is what we do with a DataFrame with index=False).

This would break backward compat for tz-aware, but (and maybe should document this more), that this is version-to-version hashing, it is not (necessarily) designed to be backward compat.

TomAugspurger commented 6 years ago

I suppose it depends on what people are using the hashing for. Suppose I have hashed values a and a pandas object x.

  1. hash_pandas_object(x) == a, then x may be the same as the object that originally hashed to a
  2. hash_pandas_ojbect(x) != a, then x is not the same as the object that originally hashed to a.

To me, the most common use case is likely storing hashed values somewhere and wanting to answer "are these new values the same as what I have hashed?", so a stronger form of 1. (is the same instead of may be the same).

So I think it's on pandas to either mix the dtype information into the hash somehow, or provide guidance that you should store the original dtype along with the hashed values.

IOW, hash the tz as an additional column and combine (which is what we do with a DataFrame with index=False).

Hashing an extra column seems wasteful. I'd rather have some kind of stable map of each type and do a bit-shift on each type after hashing.

type_map = {
    int: 0,
    float: 1,
    ...
}

h = hash_array(obj.values, encoding, hash_key,
               categorize).astype('uint64', copy=False)
h >>= type_map[obj.dtype]

Building that type map is tricky (impossible?), because of parameterize types, 3rd party extension types...

version-to-version hashing, it is not (necessarily) designed to be backward compat.

We should explicitly state that hashing can change between versions. Maintaining that seems like it would be a nightmare.

jorisvandenbossche commented 6 years ago

To me, the most common use case is likely storing hashed values somewhere and wanting to answer "are these new values the same as what I have hashed?", so a stronger form of 1. (is the same instead of may be the same).

Yes, I agree with that (that is eg what joblib uses hashing for)

Building that type map is tricky (impossible?), because of parameterize types, 3rd party extension types...

I am not familiar with how hash values are calculated. But would it be possible to somehow combine the hash of the dtype with the hash of the values?

jbrockmendel commented 2 years ago

Was a satisfactory solution ever found for this? Looks like this is about hash_pandas_object and DatetimeIndex, but I'm looking at Timestamp.__hash__ and it ignores the tz too. Current motivation is to adapt for non-nano.