Open buhrmann opened 1 month ago
On a related note, reading the docs for 2.2 I was under the impression that copy_on_write = "warn"
would only warn about certain cases, but not actually enable copy_on_write mode, which seems to be what's happening here? If so, perhaps the docs could make that clearer...
Hm, the problem even seems to occur in some cases in v2.2 with copy-on-write=False, though I haven't managed to create a minimal reproducible example yet. But for now the only safe option seems to be to stick to <2.2.
Thanks for the report. The issue here is that (ser >= 3000).sum() / len(ser)
is needing to copy the attrs data for every group. I don't think there is a way around this. The solution to the performance issue is to not use apply.
%timeit (X["Elevation"] >= 3000).groupby(X["group"]).mean()
# 7.99 ms ± 65.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Does this solve your issue?
I haven't looked into the details, but note that since 2.2 (https://github.com/pandas-dev/pandas/pull/55314) attrs are always deep-copied to prevent accidental data sharing (motivation: safety over performance). It should be fast if attrs
is just a small dict with a handful of metadata. If performance is critical and you have a lot of context data. attrs
is likely not suited and you should manage that state separately.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Hi, so it seems some interaction between copy-on-write and .attrs data leads to extremely slow performance, at least with custom aggregations. In the below code, the timed aggregations all perform identical in v2.1. But in v2.2, the last one, with custom .attrs data and copy-on-write enabled, is about 10x slower. Using my original dataset, which I cannot share, but which is simply larger in both dimensions, the result was even more extreme, being almost 50x slower (from less than a second to 40s).
The output:
Installed Versions
Prior Performance