pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.71k stars 1.9k forks source link

PanicException when slicing a LazyFrame streaming from globbed CSV #16163

Closed riley-harper closed 5 months ago

riley-harper commented 5 months ago

Checks

Reproducible example

from pathlib import Path
import polars as pl

csv_dir = Path("./test.csv")
csv_dir.mkdir()
df = pl.DataFrame({"A": [1, 2, 3]})
df.write_csv(csv_dir / "test-1.csv")

lf = pl.scan_csv(csv_dir / "*.csv")

# Setting streaming=True causes a panic here
lf.slice(0, 4).collect(streaming=True)

The exception traceback (with RUST_BACKTRACE=1) is

thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-utils/src/arena.rs:82:31:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic
   3: core::option::unwrap_failed
   4: polars_pipe::pipeline::convert::get_sink
   5: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
   6: polars_lazy::physical_plan::streaming::convert_alp::insert_streaming_nodes
   7: polars_lazy::frame::LazyFrame::optimize_with_scratch
   8: polars_lazy::frame::LazyFrame::collect
   9: polars::lazyframe::PyLazyFrame::__pymethod_collect__
  10: pyo3::impl_::trampoline::trampoline
  11: polars::lazyframe::_::__INVENTORY::trampoline
  12: method_vectorcall_VARARGS_KEYWORDS
             at /usr/local/src/conda/python-3.12.3/Objects/descrobject.c:365:14
  13: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.12.3/Include/internal/pycore_call.h:92:11
  14: PyObject_Vectorcall
             at /usr/local/src/conda/python-3.12.3/Objects/call.c:325:12
  15: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1713204800955/work/build-static/Python/bytecodes.c:2706:19
  16: PyEval_EvalCode
             at /usr/local/src/conda/python-3.12.3/Python/ceval.c:578:21
  17: run_eval_code_obj
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:1722
  18: run_mod
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:1743
  19: PyRun_InteractiveOneObjectEx
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:260
  20: _PyRun_InteractiveLoopObject
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:137
  21: _PyRun_AnyFileObject
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:72
  22: PyRun_AnyFileExFlags
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:104
  23: pymain_run_stdin
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:520
  24: pymain_run_python
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:632
  25: Py_RunMain
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:709
  26: Py_BytesMain
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:763:12
  27: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  28: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  29: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rileyh/micromamba/envs/rileyh_test/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 1816, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

Log output

thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-utils/src/arena.rs:82:31:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rileyh/micromamba/envs/rileyh_test/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 1816, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

Issue description

In the specific case where I use scan_csv() with a *.csv glob, LazyFrame.slice(), and collect() with streaming set to True, I get a PanicException. If I set streaming to False, or don't call LazyFrame.slice() before collecting, I get the result I expect, not a panic.

Expected behavior

I would expect that the result with streaming=True would be the same as with streaming=False, which is a DataFrame that looks like

shape: (3, 1)
┌─────┐
│ A   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Installed versions

``` --------Version info--------- Polars: 0.20.25 Index type: UInt32 Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35 Python: 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: nest_asyncio: numpy: openpyxl: pandas: pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 5 months ago

Can reproduce.

It seems something is up with slice_pushdown on the streaming engine.

>>> lf.slice(0, 4).collect(streaming=True, slice_pushdown=False)
shape: (3, 1)
┌─────┐
│ A   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Update: It seems to be specific to the frame method, Expr.slice is ok:

>>> lf.select(pl.all().slice(0, 4)).collect(streaming=True)
shape: (3, 1)
┌─────┐
│ A   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘