pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.56k stars 1.98k forks source link

PanicException when combining very specific operations #19873

Open rhshadrach-8451 opened 1 week ago

rhshadrach-8451 commented 1 week ago

Checks

Reproducible example

df1 = pl.DataFrame({"A": ["x", "x"]})
df2 = pl.DataFrame({"A": ["y"]})
df = pl.concat([df1, df2]).with_columns(
    B=pl.col("A").min().over("A"),
    C=0,
)
df.filter(~pl.col("A").eq("y") | ~pl.col("A").is_in(['y'])).filter(pl.col("A").gt(pl.lit("0")))

Log output

dataframe filtered
thread '<unnamed>' panicked at crates/polars-core/src/series/mod.rs:226:34:
index out of bounds: the len is 1 but the index is 1
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic_bounds_check
   3: polars_core::series::Series::select_chunk
   4: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
   5: <polars_mem_engine::executors::filter::FilterExec as polars_mem_engine::executors::executor::Executor>::execute::{{closure}}
   6: <polars_mem_engine::executors::filter::FilterExec as polars_mem_engine::executors::executor::Executor>::execute
   7: polars_lazy::frame::LazyFrame::collect
   8: polars_python::lazyframe::general::<impl polars_python::lazyframe::PyLazyFrame>::__pymethod_collect__
   9: pyo3::impl_::trampoline::trampoline
  10: polars_python::lazyframe::general::_::__INVENTORY::trampoline
  11: _method_vectorcall_VARARGS_KEYWORDS
  12: _call_function
  13: __PyEval_EvalFrameDefault
  14: __PyEval_Vector
  15: _method_vectorcall
  16: _call_function
  17: __PyEval_EvalFrameDefault
  18: __PyEval_Vector
  19: _call_function
  20: __PyEval_EvalFrameDefault
  21: __PyEval_Vector
  22: _PyEval_EvalCode
  23: _run_eval_code_obj
  24: _run_mod
  25: _pyrun_file
  26: __PyRun_SimpleFileObject
  27: __PyRun_AnyFileObject
  28: _pymain_run_file_obj
  29: _pymain_run_file
  30: _Py_RunMain
  31: _Py_BytesMain
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "[snip]/run.py", line 9, in <module>
    print(df.filter(~pl.col("A").eq("y") | ~pl.col("A").is_in(['y'])).filter(pl.col("A").gt(pl.lit("0"))).with_columns(C=0))
  File "[snip]/lib/python3.10/site-packages/polars/dataframe/frame.py", line 4774, in filter
    return self.lazy().filter(*predicates, **constraints).collect(_eager=True)
  File "[snip]/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 2029, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: index out of bounds: the len is 1 but the index is 1

Issue description

This first appeared in 1.13.0. 1.12.0 gives the expected result.

The above code is from a heavily reduced computation, so appears nonsensical. I believe it to be minimal in that the correct result appears if I remove any one of the following elements:

  1. Remove one row from df1 on L1.
  2. Remove the concat on L3 by using pl.DataFrame({"A": ["x", "x", "y"]}) instead of df1 and df2.
  3. Remove the addition of column B.
  4. Remove the addition of column C.
  5. Replace .is_in(['y']) with .eq('y') on L7.
  6. Remove either condition in the first call to filter on L7.
  7. Remove the negation of either condition in the first call to filter on L7.
  8. Combine the two calls to filter into a single call to filter on L7.

The nature of this looks similar to https://github.com/pola-rs/polars/issues/16830

Expected behavior

shape: (2, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 │
╞═════╪═════╪═════╡
│ x   ┆ x   ┆ 0   │
│ x   ┆ x   ┆ 0   │
└─────┴─────┴─────┘

Installed versions

``` --------Version info--------- Polars: 1.14.0 Index type: UInt32 Platform: macOS-14.7-arm64-arm-64bit Python: 3.10.15 (main, Sep 7 2024, 00:20:06) [Clang 15.0.0 (clang-1500.3.9.4)] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair boto3 cloudpickle connectorx deltalake fastexcel 0.12.0 fsspec 2024.9.0 gevent google.auth 2.35.0 great_tables matplotlib 3.9.2 nest_asyncio 1.6.0 numpy 1.26.4 openpyxl 3.1.5 pandas 2.2.3 pyarrow 18.0.0 pydantic 2.9.2 pyiceberg sqlalchemy torch xlsx2csv xlsxwriter 3.2.0 ```
cmdlineluser commented 1 week ago

Can reproduce.

It seems it may be rechunk related - forcing a rechunk in the concat (or using .rechunk().filter()) avoids the panic.

df1 = pl.DataFrame({"A": ["x", "x"]})
df2 = pl.DataFrame({"A": ["y"]})
df = pl.concat([df1, df2], rechunk=True).with_columns(
    B=pl.col("A").min().over("A"),
    C=0,
)
df.filter(~pl.col("A").eq("y") | ~pl.col("A").is_in(['y'])).filter(pl.col("A").gt(pl.lit("0")))

# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ A   ┆ B   ┆ C   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ i32 │
# ╞═════╪═════╪═════╡
# │ x   ┆ x   ┆ 0   │
# │ x   ┆ x   ┆ 0   │
# └─────┴─────┴─────┘
rhshadrach-8451 commented 6 days ago

I forgot to mention this worked fine in 1.12.0; I've updated the OP.