narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
552 stars 88 forks source link

[Bug]: pandas-wrapped narwhals .over() leads to inconsistent result #1296

Open mscolnick opened 6 days ago

mscolnick commented 6 days ago

Describe the bug

import marimo as mo
import narwhals as nw
import polars as pl
import pandas as pd

data = {"item": ["a", "a", "a", "b", "b", "b"], "start": [6, 7, 8, 14, 15, 16]}

df_pandas = pd.DataFrame(data)
df_polars = pl.DataFrame(data)
nw_df_pandas = nw.from_native(df_pandas)
nw_df_polars = nw.from_native(df_polars)

# Correct, creates a new column with the shifted values
nw_df_polars.with_columns(
    (nw.col("start") * 1 - nw.min("start")).over("item").alias("shifted")
).to_native()

# Incorrect, all starts at 0
nw_df_pandas.with_columns(
    (nw.col("start") * 1 - nw.min("start")).over("item").alias("shifted")
).to_native()
image

Steps or code to reproduce the bug

https://static.marimo.app/static/narwhals-bug-bo1z

Expected results

Expected the results of polars. Values are incrementing.

Actual results

Values are the same

Please run narwhals.show_version() and enter the output below.

System:
    python: 3.12.6 (main, Sep  6 2024, 19:03:47) [Clang 15.0.0 (clang-1500.3.9.4)]
executable: /Users/myles/.cache/uv/archive-v0/sNIzHlLTqqBXxH79AubYO/bin/python
   machine: macOS-14.6-arm64-arm-64bit

Python dependencies:
     narwhals: 1.12.1
       pandas: 2.2.3
       polars: 1.12.0
         cudf: 
        modin: 
      pyarrow: 17.0.0
        numpy: 2.1.2


### Relevant log output

_No response_
FBruzzesi commented 6 days ago

Hey @mscolnick , thanks for reporting the issue. I have some mixed feelings about this, as I would not be sure what to expect as output.

What I mean by that: over does end up applying only to the reduction nw.min/pl.min and not the entire expression (nor I would know what to expect by that). By changing the code to:

df.with_columns(
    (nw.col("start") - nw.min("start").over("item")).alias("shifted")
)

the result is consistent across backends.

I apologize if I am missing something 🙈

mscolnick commented 5 days ago

Ah, I got it. I am probably using the API wrong. I was doing:

(
        pl.col("year") * 12
        + pl.col("month")
        - pl.min("year") * 12
        - pl.min("month")
        + 1
    ).over("project")

but since this worked (in polars), I didn't think twice about doing

(
        pl.col("year") * 12
        + pl.col("month")
        - pl.min("year").over("project") * 12
        - pl.min("month").over("project")
        + 1
    )

Makes sense, and fine to close!

MarcoGorelli commented 5 days ago

thanks for the report!

if we only support reductions with over, then I think this should raise

FBruzzesi commented 4 days ago

Our current Expr API for pandas-like and pyarrow does not distinguish between aggregation or not, while for Dask it does. Standardizing it would definitly help in the long term, and yield more control in each operation