pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.42k stars 1.97k forks source link

Polars: panic, found compressed page in the middle of the pages #18085

Closed kmcentush closed 3 months ago

kmcentush commented 3 months ago

Checks

Reproducible example

This is a re-reporting of a bug originally found here: https://github.com/posit-dev/positron/issues/4218. The parquet file in question can be downloaded from: https://github.com/posit-dev/qa-example-content/blob/main/data-files/100x100/100x100.parquet

import polars as pl

pl.read_parquet("100x100.parquet")

Log output

thread 'polars-4' panicked at crates/polars-parquet/src/parquet/read/compression.rs:222:17:
Found compressed page in the middle of the pages
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'polars-9' panicked at crates/polars-parquet/src/parquet/read/compression.rs:222:17:
Found compressed page in the middle of the pages
thread 'polars-3' panicked at crates/polars-parquet/src/parquet/read/compression.rs:222:17:
Found compressed page in the middle of the pages
thread 'polars-0' panicked at crates/polars-parquet/src/parquet/read/compression.rs:222:17:
Found compressed page in the middle of the pages
thread 'polars-5' panicked at crates/polars-parquet/src/parquet/read/compression.rs:222:17:
Found compressed page in the middle of the pages
Traceback (most recent call last):
  File "/Users/kylemcentush/Documents/code/observables/.venv/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-5520d71e7571>", line 1, in <module>
    a = pl.read_parquet("/Users/kylemcentush/Downloads/100x100.parquet")
  File "/Users/kylemcentush/Documents/code/observables/.venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
  File "/Users/kylemcentush/Documents/code/observables/.venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
  File "/Users/kylemcentush/Documents/code/observables/.venv/lib/python3.10/site-packages/polars/io/parquet/functions.py", line 208, in read_parquet
    return lf.collect()
  File "/Users/kylemcentush/Documents/code/observables/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: Found compressed page in the middle of the pages

Issue description

Issue happens on read.

Expected behavior

Read file without issue. This file works with 1.3.0, so I believe this is a regression.

Installed versions

``` --------Version info--------- Polars: 1.4.1 Index type: UInt32 Platform: macOS-14.4.1-arm64-arm-64bit Python: 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2024.6.0 gevent: great_tables: hvplot: matplotlib: 3.9.0 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 16.1.0 pydantic: 2.7.3 pyiceberg: sqlalchemy: 2.0.30 torch: xlsx2csv: xlsxwriter: ```
ritchie46 commented 3 months ago

@coastalwhite can you take a look?