pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.23k stars 1.95k forks source link

Pivot with multiple columns, null column value causes column named 'null' and can cause duplicated columns #14445

Open ChristopherRussell opened 8 months ago

ChristopherRussell commented 8 months ago

Checks

Reproducible example

Another pivot one :)

df = pl.DataFrame({'a': [1,2,3], 'b':[4,5,6], 'c': ['a', None, None], 'd':[7,8,9]})
piv = df.pivot(index='a', columns=['c', 'd'], values='d')
piv.columns
['a', '{"a",7}', 'null', 'null']

Log output

No response

Issue description

Column names should be ['a', '{"a",7}', {'null', 8}, {'null', 9}], and duplicate columns should not be allowed.

Expected behavior

Column names should be ['a', '{"a",7}', {'null', 8}, {'null', 9}]

Installed versions

``` --------Version info--------- Polars: 0.20.7 Index type: UInt32 Platform: macOS-12.3.1-arm64-arm-64bit Python: 3.12.0 | packaged by conda-forge | (main, Oct 3 2023, 08:36:57) [Clang 15.0.7 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: numpy: 1.26.3 openpyxl: pandas: 2.1.4 pyarrow: 15.0.0 pydantic: 2.6.1 pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
MarcoGorelli commented 8 months ago

thanks for the report

this currently raises on the latest commit to main

In [1]: df = pl.DataFrame({'a': [1,2,3], 'b':[4,5,6], 'c': ['a', None, None], 'd':[7,8,9]})
   ...: piv = df.pivot(index='a', columns=['c', 'd'], values='d')
---------------------------------------------------------------------------
DuplicateError                            Traceback (most recent call last)
<ipython-input-1-67b010500cd5> in ?()
      1 df = pl.DataFrame({'a': [1,2,3], 'b':[4,5,6], 'c': ['a', None, None], 'd':[7,8,9]})
----> 2 piv = df.pivot(index='a', columns=['c', 'd'], values='d')

~/polars-dev/py-polars/polars/dataframe/frame.py in ?(self, values, index, columns, aggregate_function, maintain_order, sort_columns, separator)
   7431         else:
   7432             aggregate_expr = aggregate_function._pyexpr
   7433 
   7434         return self._from_pydf(
-> 7435             self._df.pivot_expr(
   7436                 values,
   7437                 index,
   7438                 columns,

DuplicateError: column with name 'null' has more than one occurrences

but I think your expected output looks right