narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
423 stars 76 forks source link

[Enh]: `mode` in grouped context #981

Open FBruzzesi opened 1 week ago

FBruzzesi commented 1 week ago

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

mode operation in group by context behave very differently across pandas, polars and pyarrow.

As for when mode was introduced, I am taking a mild look at skrub usecase.

Please describe the purpose of the new feature or describe the problem to solve.

Consider the following snippet:

import pandas as pd
import polars as pl
import pyarrow as pa

data = {
    "g1": [1, 1, 1, 1],
    "x1": [1, 1, 2, 3],
    "x2": [1, 1, 2, 2],
}

Polars

Polars has a consistent behavior and, even if the column is unimodal, it will return a list[T]

grp_pl = pl.DataFrame(data).group_by("g1")

print(grp_pl.agg(pl.col("x1").mode()))
print(grp_pl.agg(pl.col("x2").mode()))
shape: (1, 2)
┌─────┬───────────┐
│ g1  ┆ x1        │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1   ┆ [1]       │
└─────┴───────────┘
shape: (1, 2)
┌─────┬───────────┐
│ g1  ┆ x2        │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1   ┆ [1, 2]    │
└─────┴───────────┘

pandas

pandas fails for multi-modal results as it would not return a scalar.

grp_pd = pd.DataFrame(data).groupby("g1")

print(grp_pd.agg(x1 = ("x1", pd.Series.mode)))  # uni-modal
    x1
g1    
1    1
grp_pd.agg(x2 = ("x2", pd.Series.mode))  # multi-modal

ValueError: Must produce aggregated value

pyarrow

I was not able to find a way to have the mode value directly

Suggest a solution if possible.

Unsure on how to proceed

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

MarcoGorelli commented 1 week ago

Interesting thanks - how do skrub use mode?

FBruzzesi commented 1 week ago

Interesting thanks - how do skrub use mode?

It's the default operation for aggregating non numeric columns (polars, pandas)

FBruzzesi commented 1 week ago

Ok, I don't know if this is good, or bad, or what... but... if the values are not numeric and the group is multi-modal, then we get a list in pandas as well:

data = {
    "g1": [1, 1, 1, 1],
    "x1": ["x", "x", "y", "z"],
    "x2": ["a", "a", "b", "b"],
}

(pd.DataFrame(data)
.groupby("g1")
.agg(
    x1=("x1", pd.Series.mode),
    x2=("x2", pd.Series.mode)
    )
)
   x1      x2
g1           
1   x  [a, b]

but it would still not be supported in narwhals, since it has to use pd.Series.mode or we need to return a scalar value from Expr.mode