pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.19k stars 1.84k forks source link

Multiple identical sorts not elided by optimizer #15447

Open AlecZorab opened 5 months ago

AlecZorab commented 5 months ago

Checks

Reproducible example

(
    pl.DataFrame({"symbol": ["a", "a", "b", "b"]})
    .lazy()
    .sort("symbol")
    .sort("symbol")
    .explain(optimized=True)
)

Log output

SORT BY [col("symbol")]
  SORT BY [col("symbol")]
    DF ["symbol"]; PROJECT */1 COLUMNS; SELECTION: "None"

Issue description

when performing the same sort twice in, a row one of them could be dropped

Expected behavior

expect output to look like

  SORT BY [col("symbol")]
    DF ["symbol"]; PROJECT */1 COLUMNS; SELECTION: "None"

Installed versions

``` --------Version info--------- Polars: 0.20.18 Index type: UInt32 Platform: macOS-14.2.1-arm64-arm-64bit Python: 3.12.2 (main, Feb 6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
AlecZorab commented 5 months ago

stretch goal:

(
    pl.DataFrame({"symbol": ["a", "a", "b", "b"], "Timestamp": [1, 2, 1, 2], "val": [1, 2, 3, 4]})
    .lazy()
    .sort("symbol")
    .select(["symbol", "val"])
    .sort("symbol")
    .explain(optimized=True)
)
SORT BY [col("symbol")]
   SELECT [col("symbol"), col("val")] FROM
    SORT BY [col("symbol")]
      DF ["symbol", "Timestamp", "val"]; PROJECT 2/3 COLUMNS; SELECTION: "None"'

I think one of these sorts can be skipped too