[Enh]: support fast, non-simple pandas grouped operations

machow commented 2 weeks ago

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

This isn't a feature needed by a specific tool, but one provided by tools like siuba and ibis.

Please describe the purpose of the new feature or describe the problem to solve.

Suppose we've converted a simple dataset to narwhals:

import narwhals as nw
from siuba.data import mtcars

nw_cars = nw.from_native(mtcars)
nw_g_cyl = nw_cars.group_by("cyl")

Currently, narwhals only supports fast execution of simple expressions over pandas. For example..

fast: nw_g_cyl.select(nw.col("hp").mean())
slow (via apply): nw_g_cyl.select(nw.col("hp").mean() - 1)

Something really powerful about polars, IMO, is that complex aggregations are easy to express and fast to execute. It would be really handy to bring this via polars expressions to execution over pandas DataFrames.

Suggest a solution if possible.

Libraries like siuba have experimental machinery to speed up these operations:

from siuba.data import mtcars
from siuba import _
from siuba.experimental.pd_groups import fast_summarize, fast_filter

# aggregate to mean of hp - 1
fast_summarize(mtcars.groupby("cyl"), res = _["hp"].mean() - 1)

# filer to keep rows that are greater than the mean w/in their group
fast_filter(mtcars.groupby("cyl"), _["hp"] >  _["hp"].mean())

Importantly, the bulk of code to make this happen only depends on pandas. The key is adding a subclass of the pandas SeriesGroupBy, called GroupByAgg---which represents the result of a grouped aggregation.

This means that grouped operations like...

_["hp"].mean() - 1 are a GroupByAgg minus a scalar, so should return a GroupByAgg
_["hp"].mean() / _["cyl"].mean() are a GroupByAgg divided by GroupByAgg, so should return a GroupByAgg
_["hp"] > _["hp"].mean() is a SeriesGroupBy > GroupByAgg, so should return a SeriesGroupBy

Essentially, this allows composing grouped operations by representing types of grouped results, as opposed to the pandas behavior of returning Series for everything.

Here's an example of manually triggering two of the complex expressions above:

from siuba import _
from siuba.data import mtcars
from siuba.experimental.pd_groups.translate import method_agg_op, method_el_op2
from siuba.experimental.pd_groups.groupby import broadcast_agg

g_cyl = mtcars.groupby("cyl")

f_mean = method_agg_op("mean", is_property=False, accessor=None)
f_sub = method_el_op2("__sub__", is_property=False, accessor=None)

# Example 1: _["hp"].mean() - 1
g_res = f_sub(f_mean(g_cyl["hp"]), 1)     # GroupByAgg (subclass of SeriesGroupBy)
broadcast_agg(g_res)                      # pandas Series

# Example 2: _["hp"] - _["hp"].mean()
f_sub(g_cyl["hp"], f_mean(g_cyl["hp"]))    # pandas SeriesGroupBy

For more information, see these resources:

If you have tried alternatives, please describe them below.

No response

Would you want to open a pull request?

yes

Additional information that may help us understand your needs.

I've never really loved maintaining the fast operations in siuba. I think this is because siuba's lazy expression API currently uses pandas Series methods, so the fast operations feel like a patch. However, I'm a giant fan of Polars, and use it a lot nowadays, so love the idea of executing the Polars API over a pandas DataFrame in fast ways.

I'm happy to upstream a lot of the logic for fast DataFrame operations into a separate library, or explore putting it inside narwhals. This could also be a good chance for me to get more familiar with how narwhals works. (No worries if it's out of scope! Also happy to spec / prototype more & flesh out a proposal)

MarcoGorelli commented 2 weeks ago

Libraries like siuba have experimental machinery to speed up these operations:

wow, that's amazing!

I'm definitely open to this (so long as we can keep issuing a warning for cases when we aren't able to convert to an efficient implementation), I just though it would be too hard. but if you already thought about it and have a solution...let's do it! 🚀

super-keen to take a look at a PR if you open one!

machow commented 2 weeks ago

Sounds good--I should be able to take a pass at a PR next week!

narwhals-dev / narwhals