pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.25k stars 1.67k forks source link

Panic: when/then/otherwise not implemented with lit() #16243

Closed marton78 closed 3 weeks ago

marton78 commented 1 month ago

Checks

Reproducible example

import polars as pl

df = pl.DataFrame({"place": ["Mars", "Earth", "Saturn"]}).with_row_index()

expr = pl.struct(pl.col("place"), pl.lit(42))

df.select(
    pl.when(pl.col('index') == 1).then(expr).otherwise(None)
)

Log output

thread '<unnamed>' panicked at crates/polars-core/src/series/ops/null.rs:78:17:
not implemented for dtype Unknown(Int(42))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/jovyan/test.py", line 7, in <module>
    df.select(
  File "/opt/conda/lib/python3.11/site-packages/polars/dataframe/frame.py", line 8069, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1816, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: not implemented for dtype Unknown(Int(42))

Issue description

This works:

df.select(
    #pl.when(pl.col('index') == 1).then(expr).otherwise(None)
    expr
)

Expected behavior

It should return:

shape: (1, 1)
┌───────────────┐
│ place         │
│ ---           │
│ struct[2]     │
╞═══════════════╡
│ null          │
│ {"Earth",42}  │
│ null          │
└───────────────┘

Installed versions

``` --------Version info--------- Polars: 0.20.25 Index type: UInt32 Platform: Linux-6.5.0-26-generic-aarch64-with-glibc2.35 Python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:25:01) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2024.3.1 gevent: hvplot: matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.2 pyarrow: 15.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: 2.0.30 torch: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 1 month ago

Can reproduce.

It seems to be an issue with null + structs?

pl.select(pl.lit(None).fill_null(pl.struct(42)))
# PanicException: not implemented for dtype Unknown(Int(42))

It does seem possible to force the type in a couple of ways, but it's not ideal:

df.select(
    pl.when(pl.col.index == 1)
      .then(pl.struct("place", 42))
      .otherwise(
          pl.struct(place=pl.lit(None, dtype=pl.String), literal=pl.lit(None, dtype=pl.Int32))
      )
)

df.select(
    (pl.col('index') == 1).replace(
        old=False, new=None, default=pl.struct("place", 42)
    )
)

# shape: (3, 1)
# ┌──────────────┐
# │ place        │
# │ ---          │
# │ struct[2]    │
# ╞══════════════╡
# │ {null,null}  │
# │ {"Earth",42} │
# │ {null,null}  │
# └──────────────┘
marton78 commented 1 month ago

Thanks for the proposed workarounds, @cmdlineluser, those will keep me afloat in the meantime!