pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

BUG: hash_array does not produce deterministic integers #55605

Open Danferno opened 1 year ago

Danferno commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd

df1 = pd.DataFrame(data=['a', 'b', 'c'])
df2 = pd.DataFrame(data=['b', 'a', 'c'])

hash1 = pd.util.hash_pandas_object(df1)
hash2 = pd.util.hash_pandas_object(df2)

print(f'hash1: {hash1} \n hash2: {hash2}')
# hash1: 0    14639053686158035780
# 1     3869563279212530728
# 2      393322362522515241
# dtype: uint64
#       hash2: 0    6989536449337278821
# 1    7554402398462747209
# 2     393322362522515241

hash1_alt = pd.util.hash_pandas_object(df1, categorize=False)
hash2_alt = pd.util.hash_pandas_object(df2, categorize=False)

print(f'hash1: {hash1_alt} \n hash2: {hash2_alt}')

df1_n = pd.DataFrame(data=[1,2,3])
df2_n = pd.DataFrame(data=[2,1,3])

print(f'''hash1: {pd.util.hash_pandas_object(df1_n)}
      hash2: {pd.util.hash_pandas_object(df2_n)}''')

Issue Description

I understand deterministic integer to mean, "the same input leads to the same output". However, simply changing the order of elements within an array already leads to different hashes for the same element. Is the idea instead that the exact same array will lead to the exact same hashes? If so that should be clarified in the documentation.

Expected Behavior

The same hashes appear, just in a different order

df1 = pd.DataFrame(data=['a', 'b', 'c'])
df2 = pd.DataFrame(data=['b', 'a', 'c'])

hash1 = pd.util.hash_pandas_object(df1)
hash2 = pd.util.hash_pandas_object(df2)

print(f'hash1: {hash1} \n hash2: {hash2}')
# hash1: 0    7554402398462747209
# 1     6989536449337278821
# 2      393322362522515241
# dtype: uint64
#       hash2: 0    6989536449337278821
# 1    7554402398462747209
# 2     393322362522515241

Installed Versions

INSTALLED VERSIONS ------------------ commit : e86ed377639948c64c429059127bcf5b359ab6be python : 3.11.2.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22000 machine : AMD64 processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : English_Belgium.1252 pandas : 2.1.1 numpy : 1.25.2 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 22.3.1 Cython : 0.29.34 pytest : 7.4.2 hypothesis : None sphinx : None blosc : 1.11.1 feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : 2023.3.0 gcsfs : None matplotlib : 3.7.2 numba : None numexpr : 2.8.5 odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : 13.0.0 pyreadstat : 1.2.3 pyxlsb : None s3fs : None scipy : 1.10.1 sqlalchemy : None tables : None tabulate : None xarray : 2023.8.0 xlrd : 2.0.1 zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
hvsesha commented 1 year ago

hi @Danferno if we set index false then we will get same hashing

hash2 = pd.util.hash_pandas_object(df2,index=False)

hash1 = pd.util.hash_pandas_object(df1,index=False)

rhshadrach commented 1 year ago

Good find @hvsesha. When the index is the same, you get the same hashes.

df1 = pd.DataFrame(data=['a', 'b', 'c'], index=[0, 1, 2])
df2 = pd.DataFrame(data=['b', 'a', 'c'], index=[1, 0, 2])

hash1 = pd.util.hash_pandas_object(df1)
hash2 = pd.util.hash_pandas_object(df2)

print(hash1)
# 0     4578374827886788867
# 1    17338122309987883691
# 2     5473791562133574857
# dtype: uint64

print(hash2)
# 1    17338122309987883691
# 0     4578374827886788867
# 2     5473791562133574857
# dtype: uint64
Danferno commented 1 year ago

@hvsesha Oh, that's good to know!

I'm going to assume there's a good reason including the index is the default? Because from my usecase that's very counterintuitive. Maybe an idea to change the documentation? For example:

Return a data hash of the Index/Series/DataFrame. to Return a data hash of the Index/Series/DataFrame. Set index=False to ensure that the same value always returns the same hash.

and then for the index option Include the index in the hash (if Series/DataFrame). to Include the index in the hash (if Series/DataFrame). The same value at a different location will then return a different hash.

rhshadrach commented 1 year ago

Set index=False to ensure that the same value always returns the same hash.

The docstring already includes Include the index in the hash (if Series/DataFrame). Isn't this duplicative?

Include the index in the hash (if Series/DataFrame). The same value at a different location will then return a different hash.

This isn't true.

df = pd.DataFrame({'a': [1, 1]}, index=[1, 1])
print(pd.util.hash_pandas_object(df))
# 1    7554402398462747209
# 1    7554402398462747209
# dtype: uint64
Danferno commented 1 year ago

I think the danger of getting non-deterministic hashes when you expect them far outweighs the damage of getting determinististic hashes when you don't expect them? For example, in my use-case I want to partition observations by the (modulo of) the hash of their firm ID so that downstream analyses (e.g. dropping duplicates) can operate within such a firm partition, drastically improving speed and memory requirements. That means it's crucial that the same firm ID will always be assigned to the same partition, regardless of when and where it comes in.

The way I read the documentation now is that pd.util.hash_pandas_object works that way, especially because the pandas.util.hash_array which gets called in the source code a lot, mentions return an array of deterministic integers. But then when I checked it, I was not getting the same integers for the same values, which turned out to be because of the index option. And I agree that this is implied by saying Include the index in the hash but as a programmer newbie those implications hadn't reached me. So then I think it is worth it to be very explicit in what you get in which situations.

Maybe the following is more accurate? Main (same as I suggested before): Return a data hash of the Index/Series/DataFrame. Set index=False to ensure that the same value always returns the same hash. Index option (different series bit added): Include the index in the hash (if Series/DataFrame). The same value in a different series will then return a different hash.

rhshadrach commented 1 year ago

So then I think it is worth it to be very explicit in what you get in which situations.

I agree the documentation could be improved. How about something like this:

Return a data hash of the Index/Series/DataFrame.

For Series and DataFrame, this function defaults to hashing both the index and the values for each row. You can disable including the index using the argument index below.

Danferno commented 1 year ago

I'm not sure what the Pandas standards are, I personally would avoid referencing defaults outside the option in particular as I've seen that gone wrong in other projects (the default changes, the documentation does not). Maybe an alternative is to add some examples? E.g. now it just shows the bare functionality without elaborating on the options. Instead we could have something like this:

hash_pandas_object converts any Index, Series or DataFrame into a numerical representation

>>> df1 = pd.DataFrame(data=['a', 'b', 'c'])
>>> pd.util.hash_pandas_object(df1)
0     4578374827886788867
1    17338122309987883691
2     5473791562133574857

If index=True, the hash of the Index is added to the Series or DataFrame. If your goal is to obtain hashes that are consistent across different DataFrames, you will want to set index=False instead.

>>> df2 = pd.DataFrame(data=['b', 'a', 'c'])
>>> pd.concat([pd.util.hash_pandas_object(df, index=True) for df in [df1, df2]], axis=1, ignore_index=True)
                      0                     1
0   4578374827886788867   8168238220198793318
1  17338122309987883691  14044658390916132862
2   5473791562133574857   5473791562133574857
rhshadrach commented 1 year ago

I think examples sound like a good idea. Instead of using pd.concat, I suggest showing pd.util.hash_pandas_object(df1, index=True) and pd.util.hash_pandas_object(df2, index=True) in two separate outputs, perhaps along with a sentence like

Note that though the values are the same, the hash values are different because the index is included.

This keeps the code generating the example simple.

If your goal is to obtain hashes that are consistent across different DataFrames, you will want to set index=False instead.

I must again insist that this is incorrect. The hashes are consistent across different DataFrames with index=True as long as both the index and values are the same.

phofl commented 8 months ago

The Index is an important object on a DataFrame, we can't just drop it. Doc-adjustments sounds good to me