pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.25k stars 1.95k forks source link

Hive partition ranges #15446

Open deanm0000 opened 7 months ago

deanm0000 commented 7 months ago

Description

For data with high cardinality that is the most optimal choice to partition by, it'd be nice if we could have ranges instead of strict equality in hive partitions. I don't think pyarrow or duckdb do this so it'd be a trailblazing feature.

Suppose you have data that looks like

{'fruit':String [tens of thousands uniques],
'timestamp':DateTime,
'somedata':Float64}

Most of the time when I query data I want some small subset of nodes but I can't have a hive partition with tens of thousands of folders or else it just takes forever during the file discovery phase (using Azure DL2) so it'd be nice if the hive could look like

/fruit=(apple-cherry)/file-0.parquet
/fruit=(date-fig)/file-0.parquet
/fruit=(grape-kiwi)/file-0.parquet
/fruit=(lime-raspberry)/file-0.parquet
/fruit=(tangerine-ziziphus)/file-0.parquet

and then have the hive reader set the min and max stats according to what's in the parenthesis.

Writing these partitions automatically would obviously be desirable too but I think the ability to query data like this is 95% of the value of the feature and having to manually write them like this is not really too bad of a burden.

stinodego commented 7 months ago

I don't think this is part of the official Hive spec, although I cannot seem to find an official spec anywhere.

We may support different partitioning strategies in the future, and I can see the strategy you're describing being useful. But I don't think this would fall under Hive partitioning.

ion-elgreco commented 7 months ago

@deanm0000 delta lake solves this by keeping track of the stats per file add action. If Polars can add support for adding statistics manually per file then you get the same thing

kszlim commented 7 months ago

@deanm0000 delta lake solves this by keeping track of the stats per file add action. If Polars can add support for adding statistics manually per file then you get the same thing

Parquet statistics already achieve this to an extent, but the main problem is that with enough files you end up having to do a lot of work just grabbing and parsing the statistics afaict.

ion-elgreco commented 7 months ago

@kszlim those file stats are grabbed from the parquet file during write

deanm0000 commented 7 months ago

@ion-elgreco That is pretty compelling. I have resisted going delta because I already have my scrapers setup to just use normal parquets and because I'm uncertain of what hurdles there will be due to a lack of full polars integration.

@stinodego The feature will smell equally sweet by any other name so I'm more than happy to not call it a "hive partition".