Closed kszlim closed 4 months ago
This would be super helpful. As it is, I can't use polars to load the hive partitioned files I work with and have to fall back to duckdb. I lose the benefit of lazy loading for the files that would most benefit from it.
Not sure about the proposed API, but this feature would definitely be nice to have.
I work with hive partitioned data that has the partition key in the file, which I suspect is done to avoid the casting/type issue from https://github.com/pola-rs/polars/issues/13892 and https://github.com/pola-rs/polars/issues/14838 while taking ~0 space (since, by definition, the key is the same for all rows in each partition file). I'm indifferent about the API implementation details.
I work with hive partitioned data that has the partition key in the file, which I suspect is done to avoid the casting/type issue from #13892 and #14838 while taking ~0 space (since, by definition, the key is the same for all rows in each partition file). I'm indifferent about the API implementation details.
If that is done consistently, you can just set hive_partitioning=False
and you're good to go.
@nameexhaustion this one might also be a good data point in the design for the hive partition redesign.
I don't believe this should be fully closed, if you have a hive partition column that conflicts with a parquet column, especially if that data is different you have no workaround besides rewriting the data or partitions?
Right. Can we make a new feature request for the reduced scope. Then we can make a decision about that.
Sounds good, i opened https://github.com/pola-rs/polars/issues/12041
I still met duplicated columns on polars 1.6.0 when the column exists in both hive path and parquet.
How could I send you the parquet file for reproducing?
edit: There are 5 cols in the path, and only 1 of 5 in the parquet file, does that matter?
edit: Well, I just took a look at the implementation. It seems like it depends on whether the first col in the schema? https://github.com/pola-rs/polars/pull/17203/files#diff-5fbba3b3c960c05ee1ff71769819814cb53e44b17e2004e33b42a301aa91eb57R166-R170
Description
For context see: https://github.com/pola-rs/polars/issues/12036
I propose that we either deprecate
hive_partitioning
in favor ofhive_partitioning_strategy
(alternatively maybe even repurposing the old name, but that might be more confusing) inscan_parquet
which takes in"favor_partition"
(means drop the physical column/don't read it), "favor_physical
" (idk about this name, but means ignore the partition key), "favor_none
" (which should throw an error if there are conflicts and should be the default + maps to the old parameter set toTrue
) andNone
for no hive partitioning (equivalent ofhive_partitioning=False
)?I think it's a pretty important feature, as the person querying the data often has no control over how it's written.
Created a feature as per @ritchie46 's request. Thanks!