pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.6k stars 1.99k forks source link

Cannot use regular expressions in `pl.col` within an `over` #12858

Open avimallu opened 12 months ago

avimallu commented 12 months ago

Checks

Reproducible example

import polars as pl

(
    pl.DataFrame({
        "grp_1": [1, 1, 2, 2, 2, 3, 3, 3, 3],
        "grp_2": [2, 1, 2, 1, 2, 2, 1, 1, 2],
        "value": [1, 2, 3, 1, 3, 1, 4, 2, 1]})
    .with_columns(
        pl.col("value").sum().over(pl.col("^grp.*$"))
    )
)

Log output

ComputeError: The name: 'value' passed to `LazyFrame.with_columns` is duplicate

Error originated just after this operation:
DF ["grp_1", "grp_2", "value"]; PROJECT */3 COLUMNS; SELECTION: "None"

Issue description

I cannot use regular expressions within an over call to perform operations - it looks like this gets expanded into multiple columns, resulting in duplicate output column names.

Expected behavior

Identical behaviour to:

(
    pl.DataFrame({
        "grp_1": [1, 1, 2, 2, 2, 3, 3, 3, 3],
        "grp_2": [2, 1, 2, 1, 2, 2, 1, 1, 2],
        "value": [1, 2, 3, 1, 3, 1, 4, 2, 1]})
    .with_columns(
        pl.col("value").sum().over("grp_1", "grp_2")
    )
)

Installed versions

``` --------Version info--------- Polars: 0.19.19 Index type: UInt32 Platform: macOS-14.1.2-arm64-arm-64bit Python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:13) [Clang 14.0.6 ] ----Optional dependencies---- adbc_driver_manager: 0.5.1 cloudpickle: 2.2.1 connectorx: 0.3.1 deltalake: 0.10.0 fsspec: 2023.9.2 gevent: matplotlib: 3.7.2 numpy: 1.24.3 openpyxl: 3.0.10 pandas: 2.1.1 pyarrow: 11.0.0 pydantic: 2.0.1 pyiceberg: pyxlsb: sqlalchemy: 2.0.21 xlsx2csv: 0.8.1 xlsxwriter: 3.0.9 ```
cmdlineluser commented 12 months ago

Using a struct works: pl.struct("^grp.*$")

avimallu commented 12 months ago

Thanks! I wasn't aware that regular expressions could be used in structs as well.

cmdlineluser commented 12 months ago

Yeah, it almost seems like whatever is passed to over should be implicitly wrapped in a struct?

e.g. I would have thought this would work:

df.with_columns(
   pl.col("value").sum().over("^grp.*$")
)
# ComputeError: The name: 'value' passed to `LazyFrame.with_columns` is duplicate
datenzauberai commented 5 months ago

Using pl.exclude also seems to be affected:

# over with multiple explicit columns: ok
df.with_columns(
    pl.col("value").min().over(["grp_1", "grp_2"]).alias("minimum")
)

# over with single column via exclude: ok
df.drop("grp_2").with_columns(
    pl.col("value").min().over(pl.exclude("value")).alias("minimum")
)

# over with multiple columns via exclude: "the name: 'minimum' passed to `LazyFrame.with_columns` is duplicate"
df.with_columns(
    pl.col("value").min().over((pl.exclude("value"))).alias("minimum")
)
ritchie46 commented 2 weeks ago

This is expected: #19681