pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

Nested struct column is null after pivoting DataFrame #17065

Open edm1 opened 3 months ago

edm1 commented 3 months ago

Checks

Reproducible example


import polars as pl

df = pl.DataFrame(
    {
        "foo": ["one", "two", "one", "two"],
        "bar": ["x", "x", "y", "y"],
        "baz": [
            {"a": 1, "b": {"c": 2}},
            {"a": 3, "b": {"c": 4}},
            {"a": 5, "b": {"c": 6}},
            {"a": 7, "b": {"c": 8}},
        ]
    }
)

piv = df.pivot(index="foo", columns="bar", values="baz")

print(piv)

Log output

shape: (2, 3)
┌─────┬────────────┬────────────┐
│ foo ┆ x          ┆ y          │
│ --- ┆ ---        ┆ ---        │
│ str ┆ struct[2]  ┆ struct[2]  │
╞═════╪════════════╪════════════╡
│ one ┆ {1,{null}} ┆ {5,{null}} │
│ two ┆ {3,{null}} ┆ {7,{null}} │
└─────┴────────────┴────────────┘

Issue description

Pivotting values containing a nested struct causes the nested contents to become null.

Expected behavior

Expected output

shape: (2, 3)
┌─────┬────────────┬────────────┐
│ foo ┆ x          ┆ y          │
│ --- ┆ ---        ┆ ---        │
│ str ┆ struct[2]  ┆ struct[2]  │
╞═════╪════════════╪════════════╡
│ one ┆ {1,{2}}    ┆ {5,{6}}    │
│ two ┆ {3,{4}}    ┆ {7,{8}}    │
└─────┴────────────┴────────────┘

Installed versions

``` --------Version info--------- Polars: 0.20.31 Index type: UInt32 Platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.36 Python: 3.12.3 (main, May 14 2024, 07:23:41) [GCC 12.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.6.0 gevent: hvplot: matplotlib: 3.9.0 nest_asyncio: numpy: 1.26.4 openpyxl: pandas: pyarrow: 16.1.0 pydantic: 2.7.3 pyiceberg: pyxlsb: sqlalchemy: torch: 2.3.0+cu121 xlsx2csv: 0.8.2 xlsxwriter: ```
cmdlineluser commented 3 months ago

Can reproduce.

In case it may be a useful datapoint for debugging, transpose also seems to lose the inner values.

>>> df.drop('foo').transpose(column_names='bar')
# shape: (1, 4)
# ┌────────────┬────────────┬────────────┬────────────┐
# │ x          ┆ x          ┆ y          ┆ y          │
# │ ---        ┆ ---        ┆ ---        ┆ ---        │
# │ struct[2]  ┆ struct[2]  ┆ struct[2]  ┆ struct[2]  │
# ╞════════════╪════════════╪════════════╪════════════╡
# │ {1,{null}} ┆ {3,{null}} ┆ {5,{null}} ┆ {7,{null}} │
# └────────────┴────────────┴────────────┴────────────┘