narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
465 stars 81 forks source link

[Bug]: cum_sum expression raises an exception with pandas when 'over' is used #1177

Open IsaiasGutierrezCruz opened 2 hours ago

IsaiasGutierrezCruz commented 2 hours ago

Describe the bug

The cum_sum expression raises an exception when used with the over expression.

I did a quick review of the code and I believe this issue can be resolved by simply adding the cum_sum expression name to the dictionary that maps pandas method names to polars method names. I would like to open a pull request to solve this c:

Steps or code to reproduce the bug

def replicating_error_cum_sum() -> None:
    data = {
        "cat": ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
        'values': [1, 2, 3, 4, 5, 6, 7, 8, None]
    }
    original_data = pd.DataFrame(data)
    df_nw = nw.from_native(original_data)
    df_nw = (
        df_nw
        .with_columns(
            nw.col("values").cum_sum().over("cat")
        )
    )
    print(nw.to_native(df_nw))

Expected results

┌─────┬────────┐
│ cat ┆ values │
│ --- ┆ ---    │
│ str ┆ i64    │
╞═════╪════════╡
│ a   ┆ 1      │
│ a   ┆ 3      │
│ a   ┆ 6      │
│ b   ┆ 4      │
│ b   ┆ 9      │
│ b   ┆ 15     │
│ c   ┆ 7      │
│ c   ┆ 15     │
│ c   ┆ null   │
└─────┴────────┘

Actual results

RuntimeError: Failed to aggregated - does your aggregation function return a scalar?

Please run narwhals.show_version() and enter the output below.

System:
    python: 3.12.5 (main, Aug 14 2024, 04:32:18) [Clang 18.1.8 ]
executable: /Users/abelisaiasgutierrezcruz/Documents/Proyects/test_narwhals/.venv/bin/python
   machine: macOS-15.0.1-arm64-arm-64bit

Python dependencies:
     narwhals: 1.9.3
       pandas: 2.2.3
       polars: 1.8.2
         cudf: 
        modin: 
      pyarrow: 17.0.0
        numpy: 2.1.1

Relevant log output

File "/Users/abelisaiasgutierrezcruz/Documents/Proyects/test_narwhals/.venv/lib/python3.12/site-packages/narwhals/_expression_parsing.py", line 98, in <genexpr>
    for sublist in (evaluate_into_expr(df, into_expr) for into_expr in exprs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/abelisaiasgutierrezcruz/Documents/Proyects/test_narwhals/.venv/lib/python3.12/site-packages/narwhals/_expression_parsing.py", line 63, in evaluate_into_expr
    return expr._call(df)  # type: ignore[arg-type]
           ^^^^^^^^^^^^^^
  File "/Users/abelisaiasgutierrezcruz/Documents/Proyects/test_narwhals/.venv/lib/python3.12/site-packages/narwhals/_pandas_like/expr.py", line 334, in func
    tmp = df.group_by(*keys).agg(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/abelisaiasgutierrezcruz/Documents/Proyects/test_narwhals/.venv/lib/python3.12/site-packages/narwhals/_pandas_like/group_by.py", line 76, in agg
    return agg_pandas(
           ^^^^^^^^^^^
  File "/Users/abelisaiasgutierrezcruz/Documents/Proyects/test_narwhals/.venv/lib/python3.12/site-packages/narwhals/_pandas_like/group_by.py", line 184, in agg_pandas
    raise RuntimeError(msg) from exc
RuntimeError: Failed to aggregated - does your aggregation function return a scalar?
MarcoGorelli commented 1 hour ago

thanks for the report!

I did a quick review of the code and I believe this issue can be resolved by simply adding the cum_sum expression name to the dictionary that maps pandas method names to polars method names. I would like to open a pull request to solve this c:

sounds great, thanks! 🙏