pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.25k stars 1.67k forks source link

`pl.concat_list()` can cause `explode()` to throw a shape error #16215

Open jaschn opened 1 month ago

jaschn commented 1 month ago

Checks

Reproducible example

import polars as pl

df = pl.DataFrame(
    {
        "cl1": [[0], [0]],
        "cl2": [[0], [0]],
    }
)

df_row_1 = df[1] # index 0 works and any others fail
df_row_1 = df_row_1.select(
        pl.col("cl1"),
        pl.concat_list(pl.col("cl2")) # without pl.concat it works as well
    )

df_row_1.explode(pl.all())

Log output

---------------------------------------------------------------------------
ShapeError                                Traceback (most recent call last)
Cell In[6], line 16
     10 df_row_1 = df[1]
     11 df_row_1 = df_row_1.select(
     12         pl.col("cl1"),
     13         pl.concat_list(pl.col("cl2"))
     14     )
---> 16 df_row_1.explode(pl.all())

File ~/miniconda3/envs/x/lib/python3.12/site-packages/polars/dataframe/frame.py:7193, in DataFrame.explode(self, columns, *more_columns)
   7136 def explode(
   7137     self,
   7138     columns: str | Expr | Sequence[str | Expr],
   7139     *more_columns: str | Expr,
   7140 ) -> DataFrame:
   7141     """
   7142     Explode the dataframe to long format by exploding the given columns.
   7143 
   (...)
   7191     └─────────┴─────────┘
   7192     """
-> 7193     return self.lazy().explode(columns, *more_columns).collect(_eager=True)

File ~/miniconda3/envs/x/lib/python3.12/site-packages/polars/lazyframe/frame.py:1816, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager, **_kwargs)
   1813 # Only for testing purposes atm.
   1814 callback = _kwargs.get("post_opt_callback")
-> 1816 return wrap_df(ldf.collect(callback))

ShapeError: exploded columns must have matching element counts

Issue description

When indexing the non first row, having a list column and concatenating a list column the following explode() can fail with a shape error

Expected behavior

import polars as pl

df = pl.DataFrame(
    {
        "cl1": [[0], [0]],
        "cl2": [[0], [0]],
    }
)

df_row_1 = df[1]
df_row_1 = df_row_1.select(
        pl.col("cl1"),
        pl.col("cl2")
    )

print(df_row_1.explode(pl.all()))

df_row_0 = df[0]
df_row_0 = df_row_0.select(
        pl.col("cl1"),
        pl.concat_list(pl.col("cl2"))
    )

print(df_row_0.explode(pl.all()))
shape: (1, 2) cl1 cl2
i64 i64
0 0
shape: (1, 2) cl1 cl2
i64 i64
0 0

Installed versions

``` --------Version info--------- Polars: 0.20.25 Index type: UInt32 Platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2024.3.1 gevent: hvplot: matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: torch: 2.2.2+cu121 xlsx2csv: xlsxwriter: ```