narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
607 stars 90 forks source link

[Enh]: cumulative features #1371

Closed FBruzzesi closed 6 days ago

FBruzzesi commented 1 week ago

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

Cumulative features, together with forward fill and some other checks/hacks, would most likely be enough to enable the equivalent of pandas expanding operations. This is a requirement to complete https://github.com/plotly/plotly.py/issues/4834.

Please describe the purpose of the new feature or describe the problem to solve.

List of cumulative expressions supported by polars:

With these, we would enable the following additional univariate expanding operations: mean, var, std, skew, kurt.

What's left out is: median, quantile and rank - I don't think we would be able to implement those 🥲 (entire pandas expanding window function list).

Group by context

Edit: Additionally, we should support these expr in group by's context. This is partially possible:

For the moment I would keep these out of the PRs introducing the methods in the first place. Thanks @AlessandroMiola to point that out in #1384

FBruzzesi commented 6 days ago

I am closing this issue as completed for now although these expr won't be available in group_by context. I think for now it would be a bit too hard to support them, although it would definitly be a nice to have for the over use case.

Even for pandas, even though DataFrameGroupBy has cumsum and other cumulative operations, its behaviour seems a bit unexpected as the group keys are not maintained in the output. Example from the doc itself:

>>> data = [[1, 8, 2], [1, 2, 5], [2, 6, 9]]
>>> df = pd.DataFrame(data, columns=["a", "b", "c"],
...                   index=["fox", "gorilla", "lion"])
>>> df
          a   b   c
fox       1   8   2
gorilla   1   2   5
lion      2   6   9

>>> df.groupby("a").cumsum()
          b   c
fox       8   2
gorilla  10   7
lion      6   9

As you can see, the output has no column "a", not even in the index