ENH: Make performance warnings opt-in and more noisy

rhshadrach commented 1 year ago

At times, I think our performance warnings can be too noisy. For example, if I'm doing an ad hoc analysis on a small data set, I don't care about performance. Other times, I purposefully have a MultiIndex with a particular order because it's important to the analysis I'm doing, and working with the data produces lexsort warnings.

One idea is to make performance warnings opt-in (via an option) and more noisy. The idea here would be that a user has written a piece of code they're happy with, and then they can enable performance warnings to see if pandas thinks they're being ill-performant. Doing this, we can start emitting more warnings to try to nudge users to more performant options.

Some cases where we could warn:

Using agg / apply / transform with a Python callable (a UDF)
Using merge on columns rather than indices if the index would be unique
Using .copy() with Copy-on-Write enabled (ref: https://github.com/pandas-dev/pandas/pull/51463#issuecomment-1745025139)

cc @Dr-Irv @phofl @jbrockmendel @jorisvandenbossche

Dr-Irv commented 1 year ago

Nice idea. Would we have separate options for each type of warning?

rhshadrach commented 1 year ago

Would we have separate options for each type of warning?

What do you mean by each type here? I'm proposing to only doing this for PerformanceWarning.

attack68 commented 1 year ago

I think this applicable to other sorts of warning too, to be honest.

In my own application I have tests that fail on warnings and do not want to pass warnings through to users for backend code. So I end up implementing it both ways, for example:

if version.parse(pd.__version__) >= version.parse("2.1.0"):
    # applymap issues a deprecation warning with version <2.1.0
    # TODO (low): clean this up when setting a minimum pandas version at 2.1.0
    df[df["dtype"] == "bool"] = df[df["dtype"] == "bool"].map(_map_true_false)
else:
    df[df["dtype"] == "bool"] = df[df["dtype"] == "bool"].applymap(_map_true_false)
return df

Probably there are better ways to handle this but sometimes the most obvious patches are more efficient to implement (timewise). I would prefer if I could be made aware of the warning in my backend code, but turn them off in production, and I don't really want to implements some kind of intermediate warnings catcher.

rhshadrach commented 1 year ago

I think this applicable to other sorts of warning too, to be honest.

I think I'd be against enabling users to disable FutureWarnings via an option. That sounds like a footgun.

Dr-Irv commented 1 year ago

Would we have separate options for each type of warning?

What do you mean by each type here? I'm proposing to only doing this for PerformanceWarning.

Consider the 3 examples that you list above. You could have a separate option for turning on/off the PerformanceWarning for each of those examples.

rhshadrach commented 1 year ago

It sounds like you have a use case in mind that is at odds with what I'm proposing here, but I'm not sure what that use case is. To be sure, what I'm proposing is that users don't typically run with PerformanceWarning enabled - both in the development of new code and in repeated runs of existing. Rather, a user has a piece of code they want to make more performant, they enable the PerformanceWarnings to see where pandas makes suggestions, modify the code, and then disable PerformanceWarnings as they run it in the future.

I'm curious to learn more about the use case you have in mind where (I think) every different warning has an option to control it.

Dr-Irv commented 1 year ago

I'm curious to learn more about the use case you have in mind where (I think) every different warning has an option to control it.

Let's say I only want to find where I can remove .copy(). So I don't care about other performance warnings. I just want to know where I can get rid of those extra defensive .copy() calls.

rhshadrach commented 1 year ago

I think users could accomplish this by grepping the output. But for both maintainers and users in the scenario I proposed, it seems like a headache to have a plethora of different options.

Dr-Irv commented 1 year ago

I think users could accomplish this by grepping the output. But for both maintainers and users in the scenario I proposed, it seems like a headache to have a plethora of different options.

That makes sense.

jorisvandenbossche commented 1 year ago

To have an idea about the current warnings we might be talking about, a list of PerformanceWarnings I found with a quick search:

"Adding/subtracting object-dtype array to [DatetimeLikeArray] not vectorized." (note: I think this is typically for DateOffsets)
"Non-vectorized DateOffset being applied to Series or DatetimeIndex." (similar as above)
"Falling back on a non-pyarrow code path which may decrease performance." (for some corner case features in our string methods)
For SparseArray:
- "searchsorted requires high memory usage."
- "Concatenating sparse arrays with multiple fill values"
MultiIndex:
- "dropping on a non-lexsorted multi-index without a level parameter may impact performance."
- "indexing past lexsort depth may impact performance."
"DataFrame is highly fragmented. ..."
unstack: "The following operation may generate {num_cells} cells in the resulting pandas object."
align: "Alignment difference on axis {axis} is larger than an order of magnitude on term {repr(terms[i].name)}, by more than {ordm:.4g}; performance may suffer." (never seen this one ;))

I think I am +1 on the proposal. In any case it would be nice to have an easier way to turn them off.

arnaudlegout commented 11 months ago

I would also add all pandas methods that have a pure Python implementation that is known to be slow. Even if there is no alternative in pandas, there might be an alternative with an optimized implementation in another library. Getting this warning will at least give a hint where my code could be slow.

Here are other suggestions. I do not claim they must lead to a PerformanceWarning, but that it is practical performance issues in my code that I spent time finding and fixing. Having a suggestion (with a PerformanceWarning) would have saved me a lot of time.

np.isin is faster than Series.isin
df.groupby(group_col)[col].unique().apply(len) is faster than df.groupby(group_col)[col].nunique()
array[row, col] (array being a numpy array) is much faster (250x) than df.at[row, col] (assuming you can convert you df to a numpy array, basically, all columns are of the same numpy dtype)

phofl commented 11 months ago

@arnaudlegout These examples aren't black and white, the array calls will be slower depending on what dtypes you have, for example the last one would trigger a copy if your dataframe has multiple dtypes and you convert to a numpy array

arnaudlegout commented 11 months ago

@phofl right. I did not intend to say they are clear perf warning candidates. When you implement a method, you know you have certain code paths that are suboptimal, in that case, it could be nice to add a PerformanceWarning to make clear to the user you are in a slow code path execution.

As another perf warning candidate, there is a groupby on a non-sorted index vs. a sorted index.

jorisvandenbossche commented 11 months ago

df.groupby(group_col)[col].unique().apply(len) is faster than df.groupby(group_col)[col].nunique()

@arnaudlegout if you have a specific example that shows this, I think it would be worth opening an issue for this. Because to me it seems the purpose of nunique() to be more efficient (in addition to more convenient to use). So if that is not that the case, IMO we shouldn't warn about that but rather fix that ;) (for example by simply using the .unique().apply(len) under the hood)

arnaudlegout commented 11 months ago

df.groupby(group_col)[col].unique().apply(len) is faster than df.groupby(group_col)[col].nunique()

@arnaudlegout if you have a specific example that shows this, I think it would be worth opening an issue for this. Because to me it seems the purpose of nunique() to be more efficient (in addition to more convenient to use). So if that is not that the case, IMO we shouldn't warn about that but rather fix that ;) (for example by simply using the .unique().apply(len) under the hood)

I opened issue #55972 In my reproducible example, unique.apply(len) is 3x faster than nunique on a Series.groupby

bionicles commented 9 months ago

Just throwing it out there, detecting testing time, e.g.:

if "PYTEST_CURRENT_TEST" in os.environ:

Would be a wonderful time to turn on these warnings.

Second, the regex capturing groups warning counts because it compiles a regex pattern every call to .str.contains to warn us about unused capturing groups. I use named groups to organize complicated regex spaghetti, so this warning just gets ignored. Compiling a pattern every time just to throw it away is a waste of time.

IMHO it could be completely removed. The issue is here: https://github.com/pandas-dev/pandas/issues/56798 and my PR to delete it didn't get merged but it's an easy fix.

pandas-dev / pandas

ENH: Make performance warnings opt-in and more noisy #55385