pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.86k stars 1.92k forks source link

Numpy functions with more than one argument cannot have same column in multiple arguments #17472

Closed wolfgang-noichl closed 3 months ago

wolfgang-noichl commented 3 months ago

Checks

Reproducible example

import polars as pl
import numpy as np

pl.DataFrame({'a': 5}).with_columns(b=np.divide(pl.col('a'), pl.col('a')))

Log output

--------------------------------------------------------------------------
DuplicateError                            Traceback (most recent call last)
<ipython-input-7-4a2c3a4a088d> in <module>
----> 1 pl.DataFrame({'a': 5}).with_columns(b=np.divide(pl.col('a'), pl.col('a')))

~/.local/lib/python3.10/site-packages/polars/dataframe/frame.py in with_columns(self, *exprs, **named_exprs)
   8761         └─────┴──────┴─────────────┘
   8762         """
-> 8763         return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
   8764 
   8765     def with_columns_seq(

~/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1940         callback = _kwargs.get("post_opt_callback")
   1941 
-> 1942         return wrap_df(ldf.collect(callback))
   1943 
   1944     @overload

DuplicateError: multiple fields with name 'a' found

Issue description

Seems to be the case with numpy functions expecting two arguments, e.g. np.arctan2.

Expected behavior

The same as

pl.DataFrame({'a': 5}).with_columns(b=pl.col('a') / pl.col('a'))

Installed versions

``` -------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: 3.9.1 nest_asyncio: 1.5.4 numpy: 2.0.0 openpyxl: pandas: 2.2.2 pyarrow: pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
itamarst commented 3 months ago

The problem is with __array_ufunc__ in polars/expr.py, which constructs a struct with duplicate fields. Deduplicatin on the struct construction side doesn't quite make sense semantically, after playing with it for a bit, so I will try to submit PR that deduplicates inside __array_ufunc__.

itamarst commented 3 months ago

Or, easier than deduplicates, aliases... assuming I can figure out why there's an undo_aliases() and can remove it.

deanm0000 commented 3 months ago

Or, easier than deduplicates, aliases... assuming I can figure out why there's an undo_aliases() and can remove it.

I can't remember ;(