pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.83k stars 18k forks source link

BUG: index.has_duplicates on a subset of a dataframe sometimes returns an incorrect outcome #60355

Closed HungryZebra563 closed 3 days ago

HungryZebra563 commented 3 days ago

Pandas version checks

Reproducible Example

import pandas as pd
df = pd.DataFrame([0, 0], index=['A','A'])

df.iloc[0:1, :].index.has_duplicates  # <-- this returns False (correctly)
df.index.has_duplicates               # <-- this somehow alters the outcome       
df.iloc[0:1, :].index.has_duplicates  # <-- exact same statement, but this returns True (incorrectly)

Issue Description

The value of index.has_duplicates of a subset of a dataframe depends on whether we run index.has_duplicates on the full dataframe first.

In the example, df.iloc[0:1, :] only has a single row, so it cannot have duplicates. The first call to df.iloc[0:1, :].index.has_duplicates correctly returns False. However, after we queried the same property on the full dataframe, df.index.has_duplicates, the exact same statement on the subset now returns True.

Expected Behavior

I would expect df.iloc[0:1, :] to always return False. It has only 1 row, so it cannot have duplicates.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.7 python-bits : 64 OS : Windows OS-release : 11 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 154 Stepping 4, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 2.2.3 numpy : 2.1.3 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 24.2 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None
rhshadrach commented 3 days ago

Thanks for the report! I can reproduce this on the 2.2.x branch but not on main. You've checked the box that you've verified this bug exists on the main branch of pandas. Can you confirm if you've done that?

HungryZebra563 commented 3 days ago

Sorry, I'm an idiot. I forgot to activate the environment that had the dev install....

It is indeed fixed on the main branch, so I'll close the issue.