pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.26k stars 1.96k forks source link

Provide a way to de-conflict columns that come from hive partitioning vs what's in a physical file #12041

Closed kszlim closed 4 months ago

kszlim commented 1 year ago

Description

For context see: https://github.com/pola-rs/polars/issues/12036

I propose that we either deprecate hive_partitioning in favor of hive_partitioning_strategy (alternatively maybe even repurposing the old name, but that might be more confusing) in scan_parquet which takes in "favor_partition" (means drop the physical column/don't read it), "favor_physical" (idk about this name, but means ignore the partition key), "favor_none" (which should throw an error if there are conflicts and should be the default + maps to the old parameter set to True) and None for no hive partitioning (equivalent of hive_partitioning=False)?

I think it's a pretty important feature, as the person querying the data often has no control over how it's written.

Created a feature as per @ritchie46 's request. Thanks!

jrothbaum commented 11 months ago

This would be super helpful. As it is, I can't use polars to load the hive partitioned files I work with and have to fall back to duckdb. I lose the benefit of lazy loading for the files that would most benefit from it.

stinodego commented 7 months ago

Not sure about the proposed API, but this feature would definitely be nice to have.

jrothbaum commented 7 months ago

I work with hive partitioned data that has the partition key in the file, which I suspect is done to avoid the casting/type issue from https://github.com/pola-rs/polars/issues/13892 and https://github.com/pola-rs/polars/issues/14838 while taking ~0 space (since, by definition, the key is the same for all rows in each partition file). I'm indifferent about the API implementation details.

stinodego commented 7 months ago

I work with hive partitioned data that has the partition key in the file, which I suspect is done to avoid the casting/type issue from #13892 and #14838 while taking ~0 space (since, by definition, the key is the same for all rows in each partition file). I'm indifferent about the API implementation details.

If that is done consistently, you can just set hive_partitioning=False and you're good to go.

ritchie46 commented 4 months ago

@nameexhaustion this one might also be a good data point in the design for the hive partition redesign.

kszlim commented 4 months ago

I don't believe this should be fully closed, if you have a hive partition column that conflicts with a parquet column, especially if that data is different you have no workaround besides rewriting the data or partitions?

ritchie46 commented 4 months ago

Right. Can we make a new feature request for the reduced scope. Then we can make a decision about that.

kszlim commented 4 months ago

Sounds good, i opened https://github.com/pola-rs/polars/issues/12041

Veiasai commented 2 months ago

I still met duplicated columns on polars 1.6.0 when the column exists in both hive path and parquet.

How could I send you the parquet file for reproducing?

edit: There are 5 cols in the path, and only 1 of 5 in the parquet file, does that matter?

edit: Well, I just took a look at the implementation. It seems like it depends on whether the first col in the schema? https://github.com/pola-rs/polars/pull/17203/files#diff-5fbba3b3c960c05ee1ff71769819814cb53e44b17e2004e33b42a301aa91eb57R166-R170