Open jreback opened 7 years ago
cc @mrocklin
@TomAugspurger @jorisvandenbossche
from a practical perspective I don't think this makes a whole lot of difference, but should fix to be correct.
What alternative do you think off to hash upon? Hash the timezone separately and combine the hashes?
I think you could do something like this.
In [1]: df = pd.DataFrame({'tz': pd.date_range('20130101', periods=3, tz='UTC').tz_convert('US/Eastern'),
...: 'utc': pd.date_range('20130101', periods=3, tz='UTC'),
...: 'naive': pd.date_range('20130101', periods=3)})
In [2]: df
Out[2]:
naive tz utc
0 2013-01-01 2012-12-31 19:00:00-05:00 2013-01-01 00:00:00+00:00
1 2013-01-02 2013-01-01 19:00:00-05:00 2013-01-02 00:00:00+00:00
2 2013-01-03 2013-01-02 19:00:00-05:00 2013-01-03 00:00:00+00:00
In [3]: from pandas.util import hash_pandas_object
In [6]: hash_pandas_object(pd.DataFrame({'tz':df['tz'],'zone':df['tz'].dt.tz}), index=False)
Out[6]:
0 11960632900184590671
1 17909201100930397932
2 244240496600445005
dtype: uint64
In [7]: hash_pandas_object(pd.DataFrame({'utc':df['tz'],'zone':df['utc'].dt.tz}), index=False)
Out[7]:
0 557885042773898185
1 1996380570925580138
2 5435501107539799243
dtype: uint64
In [8]: hash_pandas_object(pd.DataFrame({'naivec':df['naive']}), index=False)
Out[8]:
0 14376405836841727586
1 1052390041072582175
2 12596642793234779168
dtype: uint64
IOW, hash the tz as an additional column and combine (which is what we do with a DataFrame
with index=False
).
This would break backward compat for tz-aware, but (and maybe should document this more), that this is version-to-version hashing, it is not (necessarily) designed to be backward compat.
I suppose it depends on what people are using the hashing for. Suppose I have hashed values a
and a pandas object x
.
hash_pandas_object(x) == a
, then x
may be the same as the object that originally hashed to a
hash_pandas_ojbect(x) != a
, then x
is not the same as the object that originally hashed to a
.To me, the most common use case is likely storing hashed values somewhere and wanting to answer "are these new values the same as what I have hashed?", so a stronger form of 1. (is the same instead of may be the same).
So I think it's on pandas to either mix the dtype information into the hash somehow, or provide guidance that you should store the original dtype along with the hashed values.
IOW, hash the tz as an additional column and combine (which is what we do with a DataFrame with index=False).
Hashing an extra column seems wasteful. I'd rather have some kind of stable map of each type and do a bit-shift on each type after hashing.
type_map = {
int: 0,
float: 1,
...
}
h = hash_array(obj.values, encoding, hash_key,
categorize).astype('uint64', copy=False)
h >>= type_map[obj.dtype]
Building that type map is tricky (impossible?), because of parameterize types, 3rd party extension types...
version-to-version hashing, it is not (necessarily) designed to be backward compat.
We should explicitly state that hashing can change between versions. Maintaining that seems like it would be a nightmare.
To me, the most common use case is likely storing hashed values somewhere and wanting to answer "are these new values the same as what I have hashed?", so a stronger form of 1. (is the same instead of may be the same).
Yes, I agree with that (that is eg what joblib uses hashing for)
Building that type map is tricky (impossible?), because of parameterize types, 3rd party extension types...
I am not familiar with how hash values are calculated. But would it be possible to somehow combine the hash of the dtype with the hash of the values?
Was a satisfactory solution ever found for this? Looks like this is about hash_pandas_object and DatetimeIndex, but I'm looking at Timestamp.__hash__
and it ignores the tz too. Current motivation is to adapt for non-nano.
These should are 3 different 'views' of the same time. We DO disambiguate these in mains. So we should do so when hashing as well.
xref #16346