pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.23k stars 1.95k forks source link

Polars cannot read DeltaBinaryPacked encoded files #15214

Open Steiniche opened 7 months ago

Steiniche commented 7 months ago

Checks

Reproducible example

filepath = "data/1.parquet"
df = pl.scan_parquet(filepath, n_rows=200)

Log output

File "/main.py", line 17, in <module>
    .collect()
  File "/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1943, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ComputeError: Decoding Int64 "DeltaBinaryPacked"-encoded required  parquet pages not yet implemented

Issue description

Polars cannot read values which are Delta Binary Packed as described here: https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-encoding-delta_binary_packed--5

Expected behavior

That polars can read parquet files with Delta Binary Packed encoded columns.

Installed versions

``` --------Version info--------- Polars: 0.20.16 Index type: UInt32 Platform: Linux-6.6.10-76060610-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.3.1 gevent: hvplot: matplotlib: numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
ritchie46 commented 7 months ago

Could you share an example file that contains these?

trueb2 commented 7 months ago

The Parquet writers and readers have hardcoded encodings right now. Related to https://github.com/pola-rs/polars/issues/10680#issuecomment-1693058954

I have some example files that demonstrate how some data can be significantly smaller with alternative encodings. I have integral data sampled continuously over the course of about a day (pictured below). The delta binary packed encoding was about twice as small as the plain encoding data (when using the default ztd compression). The schema in these files was a datetime[µs] column and an i64 column.

Screenshot 2024-03-19 at 11 48 54 AM (1) Screenshot 2024-03-19 at 11 37 30 AM (1)

Steiniche commented 7 months ago

Unfortunately, I cannot share the files we are working on as they contain sensitive information.

I have tried to create a working example showing the problem. However, to my surprise it seems like Polars will read Delta Binary Packed files based on the following example

import pyarrow as pa
import pyarrow.parquet as pq
import polars as pl
import numpy
import random

num_rows = 1_000_000

data = {
    'id': [numpy.int64(random.randint(1, 10_000)) for _ in range(num_rows)],
    'value': [numpy.int64(random.randint(1, 10_000)) for _ in range(num_rows)],
}

table = pa.Table.from_pydict(data)

datafile = "data.parquet"

pq.write_table(table=table, where=datafile, column_encoding="DELTA_BINARY_PACKED", use_dictionary=False)

df = pl.scan_parquet(datafile)
print(df.collect())

My current hypothesis is that it must be a combination of multiple things that make the error occur. I will keep investing on my end and see if I can come up with an example that will make Polars throw the error.