pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.23k stars 1.95k forks source link

polars List type columns accessors not returning the right dtype #17361

Open atigbadr opened 4 months ago

atigbadr commented 4 months ago

Checks

Reproducible example

df = pl.DataFrame([None, None, None], schema=[("a", str)])
df.select(pl.col("a").str.split("|")) # Yields list[str]
df.select(c1=pl.col("a").str.split("|").list.max(),
         c2=pl.col("a").str.split("|").list.min(),
         c3=pl.col("a").str.split("|").list.get(0).schema
# Schema([('c1', Null), ('c2', Null), ('c3', String)])

Log output

No response

Issue description

This is problematic in the case of streaming mode or even partitioning a dataframe and treating each partition in batch mode it would yield two different schemas in the case that column in partition X ends up with mainly null values.

Expected behavior

I think list.max() and other accessors should always return the list dtype, e.g list[str] should always return str dtype.

The behaviour is not consistent.

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: Linux-3.10.0-1160.114.2.el7.x86_64-x86_64-with-glibc2.28 Python: 3.11.8 (main, Apr 12 2024, 16:17:28) [GCC 8.5.0 20210514 (Red Hat 8.5.0-20)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: 0.3.2 deltalake: 0.18.1 fastexcel: 0.10.4 fsspec: 2024.6.1 gevent: great_tables: hvplot: 0.10.0 matplotlib: nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.5 pandas: 2.2.2 pyarrow: 16.1.0 pydantic: 2.7.4 pyiceberg: sqlalchemy: 2.0.31 torch: xlsx2csv: 0.8.2 xlsxwriter: ```
stinodego commented 4 months ago

Thanks for the report! Indeed, min/max should return a string type in this case.