pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.13k stars 1.83k forks source link

Inconsistent results with expression arithmetic involving `pl.len` in `shift` #16321

Closed CameronBieganek closed 3 months ago

CameronBieganek commented 3 months ago

Checks

Reproducible example

In [1]: import polars as pl

In [2]: df = pl.DataFrame(dict(a=range(1, 10)))

In [3]: df.with_columns(
   ...:     b_shift_good = pl.col("a").shift(-1 * (pl.len() - 3)),
   ...:     b_shift_bad = pl.col("a").shift(3 - pl.len())
   ...: )
Out[3]:
shape: (9, 3)
┌─────┬──────────────┬─────────────┐
│ a   ┆ b_shift_good ┆ b_shift_bad │
│ --- ┆ ---          ┆ ---         │
│ i64 ┆ i64          ┆ i64         │
╞═════╪══════════════╪═════════════╡
│ 1   ┆ 7            ┆ null        │
│ 2   ┆ 8            ┆ null        │
│ 3   ┆ 9            ┆ null        │
│ 4   ┆ null         ┆ null        │
│ 5   ┆ null         ┆ null        │
│ 6   ┆ null         ┆ null        │
│ 7   ┆ null         ┆ null        │
│ 8   ┆ null         ┆ null        │
│ 9   ┆ null         ┆ null        │
└─────┴──────────────┴─────────────┘

Log output

No response

Issue description

-1 * (pl.len() - 3) works as expected. 3 - pl.len() should work the same, but it does not.

Expected behavior

The b_shift_bad column should be the same as the b_shift_good column.

Installed versions

``` In [4]: pl.show_versions() --------Version info--------- Polars: 0.20.26 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.11.4 (tags/v3.11.4:d2340ef, Jun 7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: nest_asyncio: numpy: openpyxl: pandas: pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 3 months ago

It's because pl.len() is unsigned.

>>> pl.select(pl.len(), overflow = pl.len() - 1)
shape: (1, 2)
┌─────┬────────────┐
│ len ┆ overflow   │
│ --- ┆ ---        │
│ u32 ┆ u32        │
╞═════╪════════════╡
│ 0   ┆ 4294967295 │
└─────┴────────────┘

Some previous regarding the general topic: https://github.com/pola-rs/polars/issues/11465

CameronBieganek commented 3 months ago

Ok, it's not clear to me if you are saying that 3 - pl.len() is intentionally unsupported. If that's the case, I would say it is unexpected behavior for the average user, and definitely unergonomic. It can also lead to silently incorrect results, like the b_shift_bad column above.

ritchie46 commented 3 months ago

We are aware of the problem, there isn't a good solution yet as we cannot change types on runtime. We are aware of it and try to come up with a solution.

For now you can upcast your len to i64.

pl.len().cast(pl.Int64).

I will close in favor of #11465