Open machow opened 2 weeks ago
Libraries like siuba have experimental machinery to speed up these operations:
wow, that's amazing!
I'm definitely open to this (so long as we can keep issuing a warning for cases when we aren't able to convert to an efficient implementation), I just though it would be too hard. but if you already thought about it and have a solution...let's do it! 🚀
super-keen to take a look at a PR if you open one!
Sounds good--I should be able to take a pass at a PR next week!
We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?
This isn't a feature needed by a specific tool, but one provided by tools like siuba and ibis.
Please describe the purpose of the new feature or describe the problem to solve.
Suppose we've converted a simple dataset to narwhals:
Currently, narwhals only supports fast execution of simple expressions over pandas. For example..
nw_g_cyl.select(nw.col("hp").mean())
nw_g_cyl.select(nw.col("hp").mean() - 1)
Something really powerful about polars, IMO, is that complex aggregations are easy to express and fast to execute. It would be really handy to bring this via polars expressions to execution over pandas DataFrames.
Suggest a solution if possible.
Libraries like siuba have experimental machinery to speed up these operations:
Importantly, the bulk of code to make this happen only depends on pandas. The key is adding a subclass of the pandas SeriesGroupBy, called GroupByAgg---which represents the result of a grouped aggregation.
This means that grouped operations like...
_["hp"].mean() - 1
are a GroupByAgg minus a scalar, so should return a GroupByAgg_["hp"].mean() / _["cyl"].mean()
are a GroupByAgg divided by GroupByAgg, so should return a GroupByAgg_["hp"] > _["hp"].mean()
is a SeriesGroupBy > GroupByAgg, so should return a SeriesGroupByEssentially, this allows composing grouped operations by representing types of grouped results, as opposed to the pandas behavior of returning Series for everything.
Here's an example of manually triggering two of the complex expressions above:
For more information, see these resources:
If you have tried alternatives, please describe them below.
No response
Would you want to open a pull request?
yes
Additional information that may help us understand your needs.
I've never really loved maintaining the fast operations in siuba. I think this is because siuba's lazy expression API currently uses pandas Series methods, so the fast operations feel like a patch. However, I'm a giant fan of Polars, and use it a lot nowadays, so love the idea of executing the Polars API over a pandas DataFrame in fast ways.
I'm happy to upstream a lot of the logic for fast DataFrame operations into a separate library, or explore putting it inside narwhals. This could also be a good chance for me to get more familiar with how narwhals works. (No worries if it's out of scope! Also happy to spec / prototype more & flesh out a proposal)