Open Steiniche opened 7 months ago
Could you share an example file that contains these?
The Parquet writers and readers have hardcoded encodings right now. Related to https://github.com/pola-rs/polars/issues/10680#issuecomment-1693058954
I have some example files that demonstrate how some data can be significantly smaller with alternative encodings. I have integral data sampled continuously over the course of about a day (pictured below). The delta binary packed encoding was about twice as small as the plain encoding data (when using the default ztd compression). The schema in these files was a datetime[µs] column and an i64 column.
Unfortunately, I cannot share the files we are working on as they contain sensitive information.
I have tried to create a working example showing the problem. However, to my surprise it seems like Polars will read Delta Binary Packed files based on the following example
import pyarrow as pa
import pyarrow.parquet as pq
import polars as pl
import numpy
import random
num_rows = 1_000_000
data = {
'id': [numpy.int64(random.randint(1, 10_000)) for _ in range(num_rows)],
'value': [numpy.int64(random.randint(1, 10_000)) for _ in range(num_rows)],
}
table = pa.Table.from_pydict(data)
datafile = "data.parquet"
pq.write_table(table=table, where=datafile, column_encoding="DELTA_BINARY_PACKED", use_dictionary=False)
df = pl.scan_parquet(datafile)
print(df.collect())
My current hypothesis is that it must be a combination of multiple things that make the error occur. I will keep investing on my end and see if I can come up with an example that will make Polars throw the error.
Checks
Reproducible example
Log output
Issue description
Polars cannot read values which are Delta Binary Packed as described here: https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-encoding-delta_binary_packed--5
Expected behavior
That polars can read parquet files with Delta Binary Packed encoded columns.
Installed versions