pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.69k stars 1.9k forks source link

`.sink_parquet()` sometimes panics when `statistics` has `"null_count": False` #17306

Open etiennebacher opened 3 months ago

etiennebacher commented 3 months ago

Checks

Reproducible example

import os
os.environ["POLARS_VERBOSE"] = "1"
import polars as pl

test = pl.LazyFrame({"a": [1]})

test.sink_parquet("foo.parquet", statistics={"null_count": False, "min": True, "max": False, "distinct_count": True})

Log output

RUN STREAMING PIPELINE
[df -> parquet_sink]
thread '<unnamed>' panicked at crates/polars-pipe/src/executors/sinks/output/parquet.rs:47:33:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("parquet: File out of specification: null count of a page is required"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at crates/polars-pipe/src/executors/sinks/output/parquet.rs:127:14:
called `Result::unwrap()` on an `Err` value: Any { .. }
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/etienne/.local/lib/python3.10/site-packages/polars/_utils/unstable.py", line 58, in wrapper
    return function(*args, **kwargs)
  File "/home/etienne/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 2233, in sink_parquet
    return lf.sink_parquet(
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Any { .. }

Issue description

sink_parquet() panics when null_count is False in the argument statistics, but only when other values of statistics are provided. For example, this works:

test.sink_parquet("foo.parquet", statistics={"null_count": False})

but this panics:

test.sink_parquet("foo.parquet", statistics={"null_count": False, "min": True, "max": False, "distinct_count": True})

Expected behavior

Should work or give a proper error instead of panicking.

Installed versions

``` --------Version info--------- Polars: 1.0.0-rc.2 Index type: UInt32 Platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: 1.21.5 openpyxl: pandas: pyarrow: pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
etiennebacher commented 3 months ago

This produces a proper error for write_parquet():

import os
os.environ["POLARS_VERBOSE"] = "1"
import polars as pl

test = pl.DataFrame({"a": [1]})

test.write_parquet("foo.parquet", statistics={"null_count": False, "min": True, "max": False, "distinct_count": True})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/etienne/.local/lib/python3.10/site-packages/polars/dataframe/frame.py", line 3554, in write_parquet
    self._df.write_parquet(
polars.exceptions.ComputeError: parquet: File out of specification: null count of a page is required