pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.39k stars 1.86k forks source link

Getting different result after same operations on dataframe and column #18685

Open pozitiff4ikk opened 1 week ago

pozitiff4ikk commented 1 week ago

Checks

Reproducible example

ss = pl.DataFrame(
    {
        "c": [1, 1, 1, 2, 2, 3, 3, 3, 3, 3],
        "a": [0.1, 0.3, 0.10, 0.8, 0.5, 0.9, 0.7, 0.4, 0.6, 0.2],
        "b": [0.11, 0.13, 0.16, 0.17, 0.19, 0.20, 0.18, 0.14, 0.12, 0.15],
    }
)
df = ss.sort("b", descending=True).filter(pl.col("c").is_in([3]))
res1 = df.select(pl.col("a").sum())["a"][0]
col = pl.col("a").sort_by("b", descending=True).filter(pl.col("c").is_in([3])).sum()
res2 = ss.select(col)["a"][0]

Log output

res1
2.8000000000000003
res2
1.6

Issue description

Getting different result after same operations on dataframe and column

Expected behavior

res1 and res2 should be equal

Installed versions

``` --------Version info--------- Polars: 1.7.0 Index type: UInt32 Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35 Python: 3.12.4 (main, Jul 22 2024, 09:21:14) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib nest_asyncio numpy 2.1.1 openpyxl pandas 2.2.2 pyarrow 17.0.0 pydantic 2.9.1 pyiceberg sqlalchemy 2.0.34 torch xlsx2csv xlsxwriter ```
cmdlineluser commented 1 week ago

It's a little confusing at first - but I think this may be the correct behaviour?

In the frame version, the column in the filter has also been sorted.

df = pl.DataFrame({"a": [1, 3, 2, 3, 4], "b": [10, 11, 12, 13, 14]})

df.sort("a", descending=True).filter(pl.col.b == 12) # b = [14, 11, 13, 12, 10]
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 12  │
# └─────┴─────┘

But in the expression version, it is still the original order.

df.select(
   pl.col("a", "b").sort_by("a", descending=True) 
     .filter(pl.col.b == 12) # b = [10, 11, 12, 13, 14]
)
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 3   ┆ 13  │
# └─────┴─────┘

The column in the filter would also need to be sorted to be equivalent?

df.select(
   pl.col("a", "b").sort_by("a", descending=True)
     .filter(pl.col.b.sort_by("a", descending=True) == 12)
)

# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 12  │
# └─────┴─────┘
pozitiff4ikk commented 1 week ago

@cmdlineluser ty for your reply, seems to be working this way, but in my code i have multiple expressions like this, and somehow it was working that way, but when i update polars from 1.5 to 1.6/1.7 i`ve noticed this behavior. Cant reproduce it with small example. Made it work with this fix for now, maybe this behavior will be changed in future.

deanm0000 commented 1 week ago

On mobile so can't test myself but try making lazy and turn off optimizations in the collect.

pozitiff4ikk commented 1 week ago

@deanm0000 still getting the same result