pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.9k stars 18.03k forks source link

BUG: `df.agg` raises Exception on valid function in `df.apply` #45800

Open attack68 opened 2 years ago

attack68 commented 2 years ago

Pandas version checks

Reproducible Example

.

Issue Description

If the following agg is performed it currently works but gives a warning:

df = pd.DataFrame({
    "A": Series((1000, 2000), dtype=int),
    "B": Series((1000, 2000), dtype=np.int64),
    "C": Series(["a", "b"]),
})

df.agg(["mean", "sum"])
           A       B    C
mean  1500.0  1500.0  NaN
sum   3000.0  3000.0   ab

FutureWarning: ['C'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning. print(df.agg(["mean", "sum"]))

However, I do not want to:

I tried to design a function which would error trap this:

def mean2(s:Series):
    try:
        ret = s.mean()
    except Exception:
        ret = pd.NA
    return ret

df.agg([mean2, "sum"])
<ValueError: cannot combine transform and aggregation operations>

Oddly, this works with apply which is what the agg docs give guidance on:

df.apply(mean2, axis=0)
A    1500.0
B    1500.0
C      <NA>
dtype: object

So what's the solution here?

Expected Behavior

.

Installed Versions

.

Jaafarben2 commented 2 years ago

Seems that in this circumstance (in list argument to pandas.DataFrame.aggregate), pandas first tries to apply the aggregating function to each data point, and from the moment this fails, falls back to the correct behaviour (calling back with the Series to be aggregated).

source : https://stackoverflow.com/questions/54890646/pandas-fails-to-aggregate-with-a-list-of-aggregation-functions

The solution is to force Series arguments:

def mean2(s:Series):
    if not isinstance(s,Series):
        raise ValueError('need Series argument')
    try:
        ret = s.mean()
    except Exception:
        ret = pd.NA
    return ret

df.agg([mean2, "sum"])
            A       B     C
mean2  1500.0  1500.0  <NA>
sum    3000.0  3000.0    ab