pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.45k stars 1.87k forks source link

Inconsistent behavior between slice() as an expression on a column and slice() on a Series/DataFrame #18553

Open JSteilberg opened 3 weeks ago

JSteilberg commented 3 weeks ago

Checks

Reproducible example

>>> pl.DataFrame({"a": [1,2,3,4,5]}).select(pl.col("a").slice(-6))
shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
│ 4   │
│ 5   │
└─────┘
>>> pl.DataFrame({"a": [1,2,3,4,5]}).slice(-6)
shape: (4, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
│ 4   │
└─────┘
>>> pl.Series([1,2,3,4,5]).slice(-6)
shape: (4,)
Series: '' [i64]
[
    1
    2
    3
    4
]

I would highly prefer the first behavior, as it is what Python does for negative index out-of-bounds:

>>> [1,2,3,4,5][-6:]
[1, 2, 3, 4, 5]

Log output

No response

Issue description

Often there is a need to support capping a list to some maximum length. I know there are other ways to achieve what I want, but it seems like the above behavior should be consistent as the docs do not specify any functional difference between slicing on a DataFrame/Series/Expr.

Expected behavior

I would expect all three examples to return [1,2,3,4,5] (in the applicable datastructure), as slicing beyond the beginning of the list in Python tends to clamp to the beginning rather than loop back around to the end.

Installed versions

``` --------Version info--------- Polars: 1.6.0 Index type: UInt64 Platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.35 Python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec 2024.6.1 gevent great_tables matplotlib 3.8.4 nest_asyncio numpy 1.26.4 openpyxl pandas 2.2.2 pyarrow 16.1.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
lmocsi commented 3 weeks ago

And the expressions pl.DataFrame({"a": [1,2,3,4,5]}).slice(-10) and pl.Series([1,2,3,4,5]).slice(-10) return an empty df / series. (with slice(-9) they return only the first value: 1)