pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.51k stars 1.98k forks source link

A column expression with multiple columns passed to 'over' causes a name duplication error #19681

Open tanhevg opened 2 weeks ago

tanhevg commented 2 weeks ago

Checks

Reproducible example

Setup:

import polars as pl
df = pl.DataFrame({
    'i':list(range(8)),
    'a':[1,1,1,1,2,2,2,2],
    'b':[1,1,2,2,1,1,2,2],
}).lazy()

This works:

df = df.with_columns(x=pl.col.i.first().over('a', 'b').rank('dense'))
df.collect()

This crashes: (pl.col('a', 'b') instead of 'a', 'b')

df = df.with_columns(x=pl.col.i.first().over(pl.col('a', 'b')).rank('dense'))
df.collect()

Error:

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[3], [line 2](vscode-notebook-cell:?execution_count=3&line=2)
      [1](vscode-notebook-cell:?execution_count=3&line=1) df = df.with_columns(x=pl.col.i.first().over(pl.col('a', 'b')).rank('dense'))
----> [2](vscode-notebook-cell:?execution_count=3&line=2) df.collect()

File ~/micromamba/envs/ai/lib/python3.12/site-packages/polars/lazyframe/frame.py:2055, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _eager, **_kwargs)
   [2053](https://file+.vscode-resource.vscode-cdn.net/Users/evgeny/code/protein_production/nb/~/micromamba/envs/ai/lib/python3.12/site-packages/polars/lazyframe/frame.py:2053) # Only for testing purposes
   [2054](https://file+.vscode-resource.vscode-cdn.net/Users/evgeny/code/protein_production/nb/~/micromamba/envs/ai/lib/python3.12/site-packages/polars/lazyframe/frame.py:2054) callback = _kwargs.get("post_opt_callback", callback)
-> [2055](https://file+.vscode-resource.vscode-cdn.net/Users/evgeny/code/protein_production/nb/~/micromamba/envs/ai/lib/python3.12/site-packages/polars/lazyframe/frame.py:2055) return wrap_df(ldf.collect(callback))

ComputeError: the name 'x' passed to `LazyFrame.with_columns` is duplicate

It's possible that multiple expressions are returning the same default column name. If this is the case, try renaming the columns with `.alias("new_name")` to avoid duplicate column names.

Resolved plan until failure:

    ---> FAILED HERE RESOLVING 'with_columns' <---
DF ["i", "a", "b"]; PROJECT */3 COLUMNS; SELECTION: None

Log output

No response

Issue description

A pl.col() expression with multiple columns passed to 'over' causes a name duplication error. Passing the same columns as string arguments works.

Expected behavior

Expecting a pl.col() expression to yield identical results to passing columns as strings.

Installed versions

``` --------Version info--------- Polars: 1.12.0 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:13:44) [Clang 16.0.6 ] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec 2023.10.0 gevent great_tables matplotlib 3.9.1 nest_asyncio 1.6.0 numpy 1.26.4 openpyxl 3.1.4 pandas 2.2.2 pyarrow 17.0.0 pydantic pyiceberg sqlalchemy torch 2.4.0 xlsx2csv xlsxwriter ```
cmdlineluser commented 2 weeks ago

You can wrap expressions with pl.struct() to prevent this.

.over(pl.struct(pl.col('a', 'b'))) # or pl.struct('a', 'b') 

There was an issue about .over(regex) having the same problem https://github.com/pola-rs/polars/issues/12858

But there was no official answer as to whether or not it is a bug.

ritchie46 commented 1 week ago

This isn't a bug. This expands to multiple expressions:

[
    pl.col(i").first().over(pl.col('a')),
    pl.col(i").first().over(pl.col('b')),
]
tanhevg commented 1 week ago

Why does it work then when passing columns as strings? For group_by, both ...('a','b') and ...(pl.col('a','b')) work, and return identical results. This is very confusing. The expansion given above will always throw.

Whether it is a bug or not depends on the definition of a 'bug'. IMHO this is a sort of API inconsistency that either needs to be corrected, or must be explained at length in the docs and people will still keep stumbling upon it anyway.

ritchie46 commented 1 week ago

In group_by it also expands. In a with_columns aan expansion may not lead to duplicates.

We have documented expression expansion in our user guide.