Open Danferno opened 1 year ago
hi @Danferno if we set index false then we will get same hashing
hash2 = pd.util.hash_pandas_object(df2,index=False)
hash1 = pd.util.hash_pandas_object(df1,index=False)
Good find @hvsesha. When the index is the same, you get the same hashes.
df1 = pd.DataFrame(data=['a', 'b', 'c'], index=[0, 1, 2])
df2 = pd.DataFrame(data=['b', 'a', 'c'], index=[1, 0, 2])
hash1 = pd.util.hash_pandas_object(df1)
hash2 = pd.util.hash_pandas_object(df2)
print(hash1)
# 0 4578374827886788867
# 1 17338122309987883691
# 2 5473791562133574857
# dtype: uint64
print(hash2)
# 1 17338122309987883691
# 0 4578374827886788867
# 2 5473791562133574857
# dtype: uint64
@hvsesha Oh, that's good to know!
I'm going to assume there's a good reason including the index is the default? Because from my usecase that's very counterintuitive. Maybe an idea to change the documentation? For example:
Return a data hash of the Index/Series/DataFrame.
to
Return a data hash of the Index/Series/DataFrame. Set index=False to ensure that the same value always returns the same hash.
and then for the index option
Include the index in the hash (if Series/DataFrame).
to
Include the index in the hash (if Series/DataFrame). The same value at a different location will then return a different hash.
Set index=False to ensure that the same value always returns the same hash.
The docstring already includes Include the index in the hash (if Series/DataFrame).
Isn't this duplicative?
Include the index in the hash (if Series/DataFrame). The same value at a different location will then return a different hash.
This isn't true.
df = pd.DataFrame({'a': [1, 1]}, index=[1, 1])
print(pd.util.hash_pandas_object(df))
# 1 7554402398462747209
# 1 7554402398462747209
# dtype: uint64
I think the danger of getting non-deterministic hashes when you expect them far outweighs the damage of getting determinististic hashes when you don't expect them? For example, in my use-case I want to partition observations by the (modulo of) the hash of their firm ID so that downstream analyses (e.g. dropping duplicates) can operate within such a firm partition, drastically improving speed and memory requirements. That means it's crucial that the same firm ID will always be assigned to the same partition, regardless of when and where it comes in.
The way I read the documentation now is that pd.util.hash_pandas_object
works that way, especially because the pandas.util.hash_array
which gets called in the source code a lot, mentions return an array of deterministic integers
. But then when I checked it, I was not getting the same integers for the same values, which turned out to be because of the index
option. And I agree that this is implied by saying Include the index in the hash
but as a programmer newbie those implications hadn't reached me. So then I think it is worth it to be very explicit in what you get in which situations.
Maybe the following is more accurate?
Main (same as I suggested before): Return a data hash of the Index/Series/DataFrame. Set index=False to ensure that the same value always returns the same hash.
Index option (different series bit added): Include the index in the hash (if Series/DataFrame). The same value in a different series will then return a different hash.
So then I think it is worth it to be very explicit in what you get in which situations.
I agree the documentation could be improved. How about something like this:
Return a data hash of the Index/Series/DataFrame.
For Series and DataFrame, this function defaults to hashing both the index and the values for each row. You can disable including the index using the argument
index
below.
I'm not sure what the Pandas standards are, I personally would avoid referencing defaults outside the option in particular as I've seen that gone wrong in other projects (the default changes, the documentation does not). Maybe an alternative is to add some examples? E.g. now it just shows the bare functionality without elaborating on the options. Instead we could have something like this:
hash_pandas_object converts any Index, Series or DataFrame into a numerical representation
>>> df1 = pd.DataFrame(data=['a', 'b', 'c'])
>>> pd.util.hash_pandas_object(df1)
0 4578374827886788867
1 17338122309987883691
2 5473791562133574857
If index=True
, the hash of the Index is added to the Series or DataFrame. If your goal is to obtain hashes that are consistent across different DataFrames, you will want to set index=False
instead.
>>> df2 = pd.DataFrame(data=['b', 'a', 'c'])
>>> pd.concat([pd.util.hash_pandas_object(df, index=True) for df in [df1, df2]], axis=1, ignore_index=True)
0 1
0 4578374827886788867 8168238220198793318
1 17338122309987883691 14044658390916132862
2 5473791562133574857 5473791562133574857
I think examples sound like a good idea. Instead of using pd.concat
, I suggest showing pd.util.hash_pandas_object(df1, index=True)
and pd.util.hash_pandas_object(df2, index=True)
in two separate outputs, perhaps along with a sentence like
Note that though the values are the same, the hash values are different because the index is included.
This keeps the code generating the example simple.
If your goal is to obtain hashes that are consistent across different DataFrames, you will want to set
index=False
instead.
I must again insist that this is incorrect. The hashes are consistent across different DataFrames with index=True
as long as both the index and values are the same.
The Index is an important object on a DataFrame, we can't just drop it. Doc-adjustments sounds good to me
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I understand deterministic integer to mean, "the same input leads to the same output". However, simply changing the order of elements within an array already leads to different hashes for the same element. Is the idea instead that the exact same array will lead to the exact same hashes? If so that should be clarified in the documentation.
Expected Behavior
The same hashes appear, just in a different order
Installed Versions