pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.98k stars 1.72k forks source link

Utilize parquet's PageIndex for more efficient IO/faster queries #12752

Open kszlim opened 7 months ago

kszlim commented 7 months ago

Description

The parquet format has support for a metadata item called the PageIndex that provides information that allows more efficient IO. I believe current minimum unit of IO within polars is a column. Ie. you're forced to download all pages within a column (within a rowgroup) when doing a query that utilizes that column (though i believe decoding, but not downloading of the pages can be pushed down/elided based off page statistics if available).

Utilizing the page index will be highly advantages in the cloud scenario where it will allow the minimum unit of IO to be the pages you need to answer your query (ie. you can download and decode precisely the pages you need as opposed to the entire column (within the rowgroup)). From the cloudera link, you can see this is especially advantages for somewhat sorted data.

Utilization of the PageIndex makes statistics inferior (though likely unless users explicitly write page index data, it won't be available within their parquet files, hence making support for both statistics and page index useful).

This is likely a pretty big/complicated feature, so it might take a while to implement, just making an issue for tracking purposes.

I believe other query engines that support utilizing the PageIndex are: Apache Impala Apache Datafusion Apache Spark

I know pyarrow now supports writing parquet files with the PageIndex as well (as of pyarrow 13).

Relevant links: https://github.com/apache/parquet-format/blob/master/PageIndex.md https://github.com/apache/arrow-rs/issues/1705 https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/

kszlim commented 7 months ago

Seems like this exists in polar's parquet implementation already. Means that you could write the query optimizer to utilize it in an alternative pathway to statistics.

deanm0000 commented 3 months ago

That's pretty interesting. When I write files I use the pyarrow context writer to customize my row groups like this

with pq.ParquetWriter("somefile.parquet") as writer:
    for myrowgroup in myrowgroups:
        writer.write_table(monthly.filter(pl.col('node').is_in(myrowgroup)).to_arrow())

It seems the pageindex would obviate that so I could just sort it before saving.