pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.91k stars 18.03k forks source link

PERF: index.unique much slower than get_level_values.drop_duplicates #60213

Open jacek-pliszka opened 3 weeks ago

jacek-pliszka commented 3 weeks ago

Pandas version checks

Reproducible Example

It is not very important but still quite surprising. unique should be the method to use and faster but is twice slower.

df=pd.DataFrame({"M": ["M1","M2"], "P": ["P1", "P2"], "V": [1.,2.]}) i = df.set_index(['M','P']).index

In [6]: %timeit i.unique("M") 30.9 µs ± 958 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: %timeit i.get_level_values('M').drop_duplicates() 16.1 µs ± 84 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.7 python-bits : 64 OS : Linux OS-release : 6.11.5-200.fc40.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Tue Oct 22 19:13:11 UTC 2024 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 2.1.3 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 23.3.2 Cython : 3.0.9 sphinx : None IPython : 8.23.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.1 html5lib : 1.1 hypothesis : None gcsfs : 2023.6.0+1.g7cc53d9 jinja2 : 3.1.4 lxml.etree : 5.1.0 matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None psycopg2 : 2.9.9 pymysql : 1.4.6 pyarrow : 17.0.0 pyreadstat : None pytest : 7.4.3 python-calamine : None pyxlsb : None s3fs : None scipy : 1.11.3 sqlalchemy : 2.0.36 tables : N/A tabulate : 0.9.0 xarray : N/A xlrd : 2.0.1 xlsxwriter : 3.1.9 zstandard : 0.22.0 tzdata : 2024.2 qtpy : 2.4.1 pyqt5 : None

Prior Performance

No response

rhshadrach commented 3 weeks ago

Thanks for the report. On that size of data, you're just measuring overhead.

size = 100_000
df=pd.DataFrame({"M": ["M1","M2"] * size, "P": ["P1", "P2"] * size, "V": [1.,2.] * size})
i = df.set_index(['M','P']).index

%timeit i.unique("M")
# 466 μs ± 3.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit i.get_level_values('M').drop_duplicates()
# 3.43 ms ± 12.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

.unique is 7 times faster here.