pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.87k stars 18.02k forks source link

PERF: Extreme regression in 2.2 when aggregating a DataFrame with .attrs data #60070

Open buhrmann opened 1 month ago

buhrmann commented 1 month ago

Pandas version checks

Reproducible Example

Hi, so it seems some interaction between copy-on-write and .attrs data leads to extremely slow performance, at least with custom aggregations. In the below code, the timed aggregations all perform identical in v2.1. But in v2.2, the last one, with custom .attrs data and copy-on-write enabled, is about 10x slower. Using my original dataset, which I cannot share, but which is simply larger in both dimensions, the result was even more extreme, being almost 50x slower (from less than a second to 40s).

import numpy as np
import pandas as pd
from sklearn import datasets
from pandas import options as pdopt

print(f"{pd.__version__=}")

X, y = datasets.fetch_covtype(return_X_y=True, as_frame=True)
X["group"] = np.random.choice(range(20_000), size=len(X))

print("\nExecution times with and without metadata before setting copy_on_write to 'warn'")
%timeit -n1 -r1 X.groupby("group").Elevation.apply(lambda ser: (ser >= 3000).sum() / len(ser))
X.attrs["metadata"] = {col: {"hello": {"world": "foobar"}} for col in X.columns}
%timeit -n1 -r1 X.groupby("group").Elevation.apply(lambda ser: (ser >= 3000).sum() / len(ser))

pdopt.mode.copy_on_write = True #"warn"

X, y = datasets.fetch_covtype(return_X_y=True, as_frame=True)
X["group"] = np.random.choice(range(20_000), size=len(X))

print("\nExecution times with and without metadata after setting copy_on_write to 'warn'")
%timeit -n1 -r1 X.groupby("group").Elevation.apply(lambda ser: (ser >= 3000).sum() / len(ser))
X.attrs["metadata"] = {col: {"hello": {"world": "foobar"}} for col in X.columns}
%timeit -n1 -r1 X.groupby("group").Elevation.apply(lambda ser: (ser >= 3000).sum() / len(ser))

The output:

pd.__version__='2.2.3'

Execution times with and without metadata before setting copy_on_write to 'warn'
661 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
667 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Execution times with and without metadata after setting copy_on_write to 'warn'
671 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.22 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.10.15 python-bits : 64 OS : Darwin OS-release : 24.0.0 Version : Darwin Kernel Version 24.0.0: Tue Sep 24 23:36:26 PDT 2024; root:xnu-11215.1.12~1/RELEASE_ARM64_T8103 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 2.2.3 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0 pip : 24.2 Cython : 0.29.37 sphinx : 8.1.3 IPython : 8.28.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.9.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.4 lxml.etree : 5.3.0 matplotlib : 3.9.2 numba : 0.60.0 numexpr : None odfpy : None openpyxl : 3.1.5 pandas_gbq : 0.23.0 psycopg2 : 2.9.10 pymysql : 1.4.6 pyarrow : 16.1.0 pyreadstat : 1.2.7 pytest : 8.3.3 python-calamine : None pyxlsb : None s3fs : None scipy : 1.11.3 sqlalchemy : 2.0.36 tables : None tabulate : None xarray : None xlrd : 2.0.1 xlsxwriter : None zstandard : 0.23.0 tzdata : 2024.2 qtpy : None pyqt5 : None

Prior Performance

pd.__version__='2.1.4'

Execution times with and without metadata before setting copy_on_write to 'warn'
703 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
695 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Execution times with and without metadata after setting copy_on_write to 'warn'
694 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
691 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
buhrmann commented 1 month ago

On a related note, reading the docs for 2.2 I was under the impression that copy_on_write = "warn" would only warn about certain cases, but not actually enable copy_on_write mode, which seems to be what's happening here? If so, perhaps the docs could make that clearer...

buhrmann commented 1 month ago

Hm, the problem even seems to occur in some cases in v2.2 with copy-on-write=False, though I haven't managed to create a minimal reproducible example yet. But for now the only safe option seems to be to stick to <2.2.

rhshadrach commented 1 month ago

Thanks for the report. The issue here is that (ser >= 3000).sum() / len(ser) is needing to copy the attrs data for every group. I don't think there is a way around this. The solution to the performance issue is to not use apply.

%timeit (X["Elevation"] >= 3000).groupby(X["group"]).mean()
# 7.99 ms ± 65.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Does this solve your issue?

timhoffm commented 1 week ago

I haven't looked into the details, but note that since 2.2 (https://github.com/pandas-dev/pandas/pull/55314) attrs are always deep-copied to prevent accidental data sharing (motivation: safety over performance). It should be fast if attrs is just a small dict with a handful of metadata. If performance is critical and you have a lot of context data. attrs is likely not suited and you should manage that state separately.