pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.41k stars 1.97k forks source link

`fill_null(pl.lit([]))` on data with enforced schema but no entries makes `to_arrow()` panic #19738

Open moritzwilksch opened 6 days ago

moritzwilksch commented 6 days ago

Checks

Reproducible example

# %%
import json
from io import StringIO
import polars as pl
import os

os.environ["RUST_BACKTRACE"] = "1"
os.environ["POLARS_VERBOSE"] = "1"

pl.show_versions()

# %%

data = {
    "container": [
        {
            "items": [
                {
                    "item_id": 12,
                    "flags": [
                        {"filter": True, "value": 1},
                    ],
                    "line_items": [
                        {
                            "li_id": 1,
                            # vvvvvvvvvvvvvv    adding this will make it work!!
                            # "flags": [
                            #     {"filter": True, "value": 12},
                            # ],
                        },
                        {
                            "li_id": 2,
                            # "flags": [
                            #     {"filter": True, "value": 13},
                            # ],
                        },
                    ],
                },
            ]
        }
    ]
}

schema = pl.Schema(
    {
        "container": pl.List(
            pl.Struct(
                {
                    "items": pl.List(
                        pl.Struct(
                            {
                                "item_id": pl.Int32(),
                                "line_items": pl.List(
                                    pl.Struct(
                                        {
                                            "li_id": pl.Int32(),
                                            "flags": pl.List(
                                                pl.Struct(
                                                    {
                                                        "filter": pl.Boolean(),
                                                        "value": pl.Int32(),
                                                    }
                                                )
                                            ),
                                        }
                                    )
                                ),
                                "flags": pl.List(
                                    pl.Struct(
                                        {"filter": pl.Boolean(), "value": pl.Int32()}
                                    )
                                ),
                            }
                        )
                    )
                }
            )
        )
    }
)

# %%
df = pl.read_json(StringIO(json.dumps(data)), schema_overrides=schema)
container = df.explode("container").with_columns(
    items=pl.col("container").struct.field("items")
)
print(container)
# shape: (1, 2)
# ┌─────────────────────────────────┬─────────────────────────────────┐
# │ container                       ┆ items                           │
# │ ---                             ┆ ---                             │
# │ struct[1]                       ┆ list[struct[3]]                 │
# ╞═════════════════════════════════╪═════════════════════════════════╡
# │ {[{12,[{1,null}, {2,null}],[{t… ┆ [{12,[{1,null}, {2,null}],[{tr… │
# └─────────────────────────────────┴─────────────────────────────────┘

items = container.explode("items")
print(items)
# shape: (1, 2)
# ┌─────────────────────────────────┬─────────────────────────────────┐
# │ container                       ┆ items                           │
# │ ---                             ┆ ---                             │
# │ struct[1]                       ┆ struct[3]                       │
# ╞═════════════════════════════════╪═════════════════════════════════╡
# │ {[{12,[{1,null}, {2,null}],[{t… ┆ {12,[{1,null}, {2,null}],[{tru… │
# └─────────────────────────────────┴─────────────────────────────────┘

line_items = items.select(
    pl.col("items").struct.field("line_items").alias("line_items")
).explode("line_items")
print(line_items)
# shape: (2, 1)
# ┌────────────┐
# │ line_items │
# │ ---        │
# │ struct[2]  │
# ╞════════════╡
# │ {1,null}   │
# │ {2,null}   │
# └────────────┘

res = line_items.select(
    pl.col("line_items")
    .struct.field("flags")
    .list.eval(pl.element().filter(pl.element().struct.field("filter")))
    .list.eval(pl.element().struct.field("value"))
    .list.drop_nulls()
    .list.sort()
    .fill_null(pl.lit([]))  # removing this will make it work!!
    .cast(pl.List(pl.Int32()))
)

print(res)  # works, dtypes looking good
# shape: (2, 1)
# ┌───────────┐
# │ flags     │
# │ ---       │
# │ list[i32] │
# ╞═══════════╡
# │ []        │
# │ []        │
# └───────────┘

res.to_arrow()  # panic
# PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("ListArray's child's DataType must match. However, the expected DataType is Int32 while it got Null."))

Log output

thread '<unnamed>' panicked at /home/conda/feedstock_root/build_artifacts/polars_1730209296944/work/crates/polars-arrow/src/array/list/mod.rs:81:57:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("ListArray's child's DataType must match. However, the expected DataType is Int32 while it got Null."))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: polars_core::series::into::<impl polars_core::series::Series>::to_arrow
   4: <polars_core::frame::RecordBatchIter as core::iter::traits::iterator::Iterator>::next
   5: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
   6: polars_python::dataframe::export::<impl polars_python::dataframe::PyDataFrame>::__pymethod_to_arrow__
   7: pyo3::impl_::trampoline::trampoline
   8: polars_python::dataframe::export::_::__INVENTORY::trampoline
   9: method_vectorcall_VARARGS_KEYWORDS
             at /usr/local/src/conda/python-3.12.7/Objects/descrobject.c:365:14
  10: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_call.h:92:11
  11: PyObject_Vectorcall
             at /usr/local/src/conda/python-3.12.7/Objects/call.c:325:12
  12: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1728056517259/work/build-static/Python/bytecodes.c:2715:19
  13: PyEval_EvalCode
             at /usr/local/src/conda/python-3.12.7/Python/ceval.c:578:21
  14: builtin_exec_impl
             at /usr/local/src/conda/python-3.12.7/Python/bltinmodule.c:1096
  15: builtin_exec
             at /usr/local/src/conda/python-3.12.7/Python/clinic/bltinmodule.c.h:586
  16: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1728056517259/work/build-static/Python/bytecodes.c:2975:19
  17: _PyEval_EvalFrame
             at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_ceval.h:89:16
  18: gen_send_ex2
             at /usr/local/src/conda/python-3.12.7/Objects/genobject.c:230:14
  19: gen_send_ex
             at /usr/local/src/conda/python-3.12.7/Objects/genobject.c:274
  20: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1728056517259/work/build-static/Python/bytecodes.c:3094:19
  21: _PyEval_EvalFrame
             at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_ceval.h:89:16
  22: _PyEval_Vector
             at /usr/local/src/conda/python-3.12.7/Python/ceval.c:1683:12
  23: _PyFunction_Vectorcall
             at /usr/local/src/conda/python-3.12.7/Objects/call.c:419:16
  24: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_call.h:92:11
  25: method_vectorcall
             at /usr/local/src/conda/python-3.12.7/Objects/classobject.c:61:18
  26: _PyVectorcall_Call
             at /usr/local/src/conda/python-3.12.7/Objects/call.c:283:24
  27: _PyObject_Call
             at /usr/local/src/conda/python-3.12.7/Objects/call.c:354:16
  28: PyCFunction_Call
             at /usr/local/src/conda/python-3.12.7/Objects/call.c:387:12
  29: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1728056517259/work/build-static/Python/bytecodes.c:3263:26
  30: _PyEval_EvalFrame
             at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_ceval.h:89:16
  31: gen_send_ex2
             at /usr/local/src/conda/python-3.12.7/Objects/genobject.c:230:14
  32: task_step_impl
             at /usr/local/src/conda/python-3.12.7/Modules/_asynciomodule.c:2859:22
  33: task_step
             at /usr/local/src/conda/python-3.12.7/Modules/_asynciomodule.c:3160:11
  34: cfunction_vectorcall_O
             at /usr/local/src/conda/python-3.12.7/Objects/methodobject.c:509:24
  35: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_call.h:92:11
  36: context_run
             at /usr/local/src/conda/python-3.12.7/Python/context.c:668
  37: cfunction_vectorcall_FASTCALL_KEYWORDS
             at /usr/local/src/conda/python-3.12.7/Objects/methodobject.c:438:24
  38: PyCFunction_Call
             at /usr/local/src/conda/python-3.12.7/Objects/call.c:387:12
  39: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1728056517259/work/build-static/Python/bytecodes.c:3263:26
  40: PyEval_EvalCode
             at /usr/local/src/conda/python-3.12.7/Python/ceval.c:578:21
  41: builtin_exec_impl
             at /usr/local/src/conda/python-3.12.7/Python/bltinmodule.c:1096
  42: builtin_exec
             at /usr/local/src/conda/python-3.12.7/Python/clinic/bltinmodule.c.h:586
  43: cfunction_vectorcall_FASTCALL_KEYWORDS
             at /usr/local/src/conda/python-3.12.7/Objects/methodobject.c:438:24
  44: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_call.h:92:11
  45: PyObject_Vectorcall
             at /usr/local/src/conda/python-3.12.7/Objects/call.c:325:12
  46: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1728056517259/work/build-static/Python/bytecodes.c:2715:19
  47: pymain_run_module
             at /usr/local/src/conda/python-3.12.7/Modules/main.c:300
  48: pymain_run_python
             at /usr/local/src/conda/python-3.12.7/Modules/main.c:627
  49: Py_RunMain
             at /usr/local/src/conda/python-3.12.7/Modules/main.c:713
  50: Py_BytesMain
             at /usr/local/src/conda/python-3.12.7/Modules/main.c:767:12
  51: <unknown>
  52: __libc_start_main
  53: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Issue description

When using .fill_null(pl.lit([])) on a dataframe with enforced schema that has been created by exploding and unnesting structs, the result will look fine but calling to_arrow() on it will panic. Note this issue does not occur if

Sorry for the big repro, haven't been able to distill it down any further until now. This also affects dataframe comparisons and assert_frame_equal, they error with the same panic.

Panic looks similar to #17479, although no enums are involved here.

Expected behavior

No panic on the to_arrow() call. The result looks fine (i.e. no List[Null] or similar) so I'd expect it to be fine internally as well

Installed versions

``` --------Version info--------- Polars: 1.12.0 Index type: UInt32 Platform: Linux-6.8.0-47-generic-x86_64-with-glibc2.39 Python: 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib nest_asyncio 1.6.0 numpy 2.1.2 openpyxl pandas pyarrow 18.0.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
coastalwhite commented 4 days ago

I have done some digging and the issue is that fill_null does not grab the supertype of both inputs. I could theoretically solve this in the type coercion, but we have decided that we would like to instead move to dyn list, dyn struct, dyn enum and dyn categorical. This is a refactor that would solve this issue in a wider context.