pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.49k stars 1.87k forks source link

Empty dataframe with "cross-row" operation contaminates sum and product #18689

Open markxwang opened 2 weeks ago

markxwang commented 2 weeks ago

Checks

Reproducible example

pl.DataFrame({"data": []}, schema={"data": pl.Float64}).select(
    r1=pl.col("data").product(),
    r2=pl.col("data").sum(),
    r3=pl.col("data").diff(), # pct_change, shift, cum_sum, 
)

# shape: (0, 3)
# r1    r2  r3
# f64   f64 f64

Log output

No response

Issue description

Currently, sum/product an empty dataframe will lead to 0/1 respetively,

pl.DataFrame({"data": []}, schema={"data": pl.Float64}).select(
    r1=pl.col("data").product(),
    r2=pl.col("data").sum(),
)

# r1    r2
# f64   f64
# 1.0   0.0

However, the result can be wiped out when a "cross-row" operation such as shift/cum_sum/pct_change/diff is introduced alongside product/sum. It returns a empty dataframe

Expected behavior

Not entirely sure what would be the expected behaviour

Installed versions

``` Replace this line with the output of pl.show_versions(). Leave the backticks in place. ```
markxwang commented 2 weeks ago

A similar behaviour:

The following returns an empty dataframe:

pl.Series(name="data", values=[], dtype=pl.Float64).to_frame().select(
    pl.col("data").cum_prod()
)

# data
# --
# f64

Adding last return an non-empty dataframe

pl.Series(name="data", values=[], dtype=pl.Float64).to_frame().select(
    pl.col("data").cum_prod().last()
)

# data
# --
# f64
# null
AlexeyDmitriev commented 2 weeks ago

Similar confusion to what I had in https://github.com/pola-rs/polars/issues/18404

I think the current behaviour kinda makes sense If all the expressions return 1 value, then you get 1 row with these values. If at least one of them return a column, then your result returns exactly the same number of lines as number of rows in initial dataframe (0 in your cases) and constants are broadcasted.

Maybe one way is to have separate select functions for this as I mentioned in https://github.com/pola-rs/polars/issues/18404#issuecomment-2315687383

I mean it's not exactly the same, but still you have sizes 1,1, and 0