pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.22k stars 1.95k forks source link

Panic when filtering a dataframe with an object field #18665

Open fedyakov opened 1 month ago

fedyakov commented 1 month ago

Checks

Reproducible example

import polars as pl

df = pl.DataFrame([{f"c{i}": 0 for i in range(11)} | {"c": object()}] * 12)
df.filter((pl.col("c0") == 0) & (pl.col("c1") == 0))

Log output

thread '<unnamed>' panicked at /Users/runner/work/polars/polars/crates/polars-core/src/chunked_array/ops/chunkops.rs:146:17:
implementation error
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: polars_core::chunked_array::ops::chunkops::<impl polars_core::chunked_array::ChunkedArray<T>>::rechunk
   3: polars_core::chunked_array::ops::gather::<impl polars_core::chunked_array::ops::ChunkTakeUnchecked<polars_core::chunked_array::ChunkedArray<polars_core::datatypes::UInt32Type>> for polars_core::chunked_array::ChunkedArray<T>>::take_unchecked
   4: polars_core::series::implementations::object::<impl polars_core::series::series_trait::SeriesTrait for polars_core::series::implementations::SeriesWrap<polars_core::chunked_array::ChunkedArray<polars_core::datatypes::ObjectType<T>>>>::take
   5: polars_core::series::Series::clear
   6: polars_core::series::Series::select_chunk
   7: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
   8: <polars_mem_engine::executors::filter::FilterExec as polars_mem_engine::executors::executor::Executor>::execute::{{closure}}
   9: <polars_mem_engine::executors::filter::FilterExec as polars_mem_engine::executors::executor::Executor>::execute
  10: polars_lazy::frame::LazyFrame::collect
  11: polars_python::lazyframe::general::<impl polars_python::lazyframe::PyLazyFrame>::__pymethod_collect__
  12: pyo3::impl_::trampoline::trampoline
  13: polars_python::lazyframe::general::_::__INVENTORY::trampoline
  14: _method_vectorcall_VARARGS_KEYWORDS
  15: _call_function
  16: __PyEval_EvalFrameDefault
  17: __PyEval_Vector
  18: _method_vectorcall
  19: _call_function
  20: __PyEval_EvalFrameDefault
  21: __PyEval_Vector
  22: _call_function
  23: __PyEval_EvalFrameDefault
  24: __PyEval_Vector
  25: _PyEval_EvalCode
  26: _run_eval_code_obj
  27: _run_mod
  28: _pyrun_file
  29: __PyRun_SimpleFileObject
  30: __PyRun_AnyFileObject
  31: _pymain_run_file_obj
  32: _pymain_run_file
  33: _Py_RunMain
  34: _Py_BytesMain
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "/Users/fedyakov/GitHub/apple/neutron/web-scraping-hours/apps/streamlits/golden_set/polars_bug.py", line 4, in <module>
    df.filter((pl.col("c0") == 0) & (pl.col("c1") == 0))
  File "/Users/fedyakov/GitHub/apple/neutron/web-scraping-hours/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py", line 4701, in filter
    return self.lazy().filter(*predicates, **constraints).collect(_eager=True)
  File "/Users/fedyakov/GitHub/apple/neutron/web-scraping-hours/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 2034, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: implementation error

Issue description

Expected behavior

Filter must work without exceptions.

Installed versions

``` --------Version info--------- Polars: 1.6.0 Index type: UInt32 Platform: macOS-14.6.1-arm64-arm-64bit Python: 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager altair 4.2.2 cloudpickle 3.0.0 connectorx deltalake fastexcel fsspec 2024.3.1 gevent great_tables matplotlib 3.9.0 nest_asyncio 1.6.0 numpy 1.26.4 openpyxl pandas 2.2.2 pyarrow 15.0.2 pydantic 2.7.3 pyiceberg sqlalchemy torch 2.2.2 xlsx2csv xlsxwriter None ```
egaban commented 1 month ago

Can confirm this issue on a 14 column dataframe WITHOUT any object. Only dtypes are Float64, Int64, and String. Dropping any 3 columns makes it work

Edit: adding more details, the data frame was created with two cross joins

egaban commented 1 month ago

Also, somehow printing works. So print(df.filter(...)) prints as expected the filtered dataframe. When assigning, it breaks

cmdlineluser commented 1 month ago

@egaban If you can make a reproducible example (with synthetic data if needed) - you should open a new issue.

The object dtype has very limited support, meaning your case would have much higher priority for the devs.

egaban commented 1 month ago

Just tried the exact same code with same inputs on another machine and I don't know how but it did work 🤯

Same Python/Polars versions, but the first was running Linux and the second Mac. I'll try to create a simple reproduction of the problem in the Linux machine and send it here

NXP-KetelsJ commented 4 weeks ago

I have the same behavior with Polar 1.9.0, Python 3.10 on a Windows machine