pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.24k stars 1.67k forks source link

`separator` parameter in `pivot()` has no effect #15775

Closed jsarbach closed 1 month ago

jsarbach commented 2 months ago

Checks

Reproducible example

pl.DataFrame({'ix': [0, 1], 'col1': ['a', 'b'], 'col2': [1, 2], 'val': [1, 2]}).pivot(index='ix', columns=['col1', 'col2'], values='val')
pl.DataFrame({'ix': [0, 1], 'col1': ['a', 'b'], 'col2': [1, 2], 'val': [1, 2]}).pivot(index='ix', columns=['col1', 'col2'], values='val', separator='_')
pl.DataFrame({'ix': [0, 1], 'col1': ['a', 'b'], 'col2': [1, 2], 'val': [1, 2]}).pivot(index='ix', columns=['col1', 'col2'], values='val', separator='letsseparate')

Log output

shape: (2, 3)
┌─────┬─────────┬─────────┐
│ ix  ┆ {"a",1} ┆ {"b",2} │
│ --- ┆ ---     ┆ ---     │
│ i64 ┆ i64     ┆ i64     │
╞═════╪═════════╪═════════╡
│ 0   ┆ 1       ┆ null    │
│ 1   ┆ null    ┆ 2       │
└─────┴─────────┴─────────┘

Issue description

According to the documentation, separator is used as separator/delimiter in generated column names. However, it seems to be ignored (incl. the default _).

Expected behavior

>>> pl.DataFrame({'ix': [0, 1], 'col1': ['a', 'b'], 'col2': [1, 2], 'val': [1, 2]}).pivot(index='ix', columns=['col1', 'col2'], values='val')
shape: (2, 3)
┌─────┬─────────┬─────────┐
│ ix  ┆ a_1     ┆ b_2     │
│ --- ┆ ---     ┆ ---     │
│ i64 ┆ i64     ┆ i64     │
╞═════╪═════════╪═════════╡
│ 0   ┆ 1       ┆ null    │
│ 1   ┆ null    ┆ 2       │
└─────┴─────────┴─────────┘

Installed versions

``` --------Version info--------- Polars: 0.20.21 Index type: UInt32 Platform: Linux-6.5.0-27-generic-x86_64-with-glibc2.38 Python: 3.10.6 (main, Oct 7 2022, 20:19:58) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.2.0 connectorx: deltalake: fastexcel: 0.10.3 fsspec: 2022.8.2 gevent: hvplot: matplotlib: nest_asyncio: numpy: 1.23.4 openpyxl: 3.0.10 pandas: 2.0.3 pyarrow: 15.0.0 pydantic: 1.10.2 pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: 0.8.2 xlsxwriter: ```
MarcoGorelli commented 1 month ago

hey - the separator is used if there are multiple values

jsarbach commented 1 month ago

Ah I see, thanks for clarifying.

reswqa commented 1 month ago

@jsarbach fancy updating the document about this?

cmdlineluser commented 1 month ago

Are there any plans to allow expressions?

i.e.

df.pivot(
    index="ix"
    columns=pl.format("{}_{}", "col1", "col2"),
    values="val"
)   

In order to:

(df.with_columns(columns=pl.format("{}_{}", "col1", "col2"))
   .pivot(index="ix", columns="columns", values="val")
)

# shape: (2, 3)
# ┌─────┬──────┬──────┐
# │ ix  ┆ a_1  ┆ b_2  │
# │ --- ┆ ---  ┆ ---  │
# │ i64 ┆ i64  ┆ i64  │
# ╞═════╪══════╪══════╡
# │ 0   ┆ 1    ┆ null │
# │ 1   ┆ null ┆ 2    │
# └─────┴──────┴──────┘
jsarbach commented 1 month ago

@jsarbach fancy updating the document about this?

Ok :-)