Use parquet statistics when collecting column statistics from scanned parquet

borchero commented 1 week ago

Description

When saving a dataframe via write_parquet("...", use_statistics=True), I would expect

lf = pl.scan_parquet("...")
lf.select(pl.all().null_count()).collect()

to read only the column statistics from the parquet file. However, judging from execution time and memory consumption, all of the data is read.

Interestingly, this issue even applies to simpler properties that are available in the parquet metadata, e.g.

lf.select(pl.all().len()).collect()

Would it be possible to push down relevant operations when statistics are available?

deanm0000 commented 1 week ago

similar to this one https://github.com/pola-rs/polars/issues/14936.

the tldr of that one is if min=max and null_count=0 then don't read any data and just propagate the one known value.

borchero commented 1 week ago

@stinodego happy to try contributing this if you can point me to some documentation on where to touch code when augmenting the projection pushdown logic 🫣

pola-rs / polars

Use parquet statistics when collecting column statistics from scanned parquet #16051

Description