pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.02k stars 1.72k forks source link

Projection pushdown with hive partitions may not be respected #17104

Closed nameexhaustion closed 2 weeks ago

nameexhaustion commented 3 weeks ago

Checks

Reproducible example

import polars as pl
from pathlib import Path

root = Path(".env/data2")

dfs = [
    pl.DataFrame({"x": 1}),
    pl.DataFrame({"x": 2}),
]

paths = [
    root / "a=1/b=1/data.bin",
    root / "a=2/b=2/data.bin",
]

[
    [paths[i].parent.mkdir(exist_ok=True, parents=True), dfs[i].write_parquet(paths[i])]
    for i in range(len(dfs))
]

lf = pl.scan_parquet(root / "**/*.bin")
lf = lf.select("x", "a")

print(lf.collect())

Observe the select was not respected:

shape: (2, 3)
┌─────┬─────┬─────┐
│ x   ┆ a   ┆ b   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 1   │
│ 2   ┆ 2   ┆ 2   │
└─────┴─────┴─────┘

Log output

No response

Issue description

https://github.com/pola-rs/polars/pull/15573 removes the projection node after projection pushdown into the parquet reader, which reveals that the parquet reader was not applying the projection properly on the hive partition columns.

Expected behavior

shape: (2, 2)
┌─────┬─────┐
│ x   ┆ a   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 2   ┆ 2   │
└─────┴─────┘

Installed versions

main @ 8a6bf4bc58e7fed9b6728bad66e0590fccb11f0e

deanm0000 commented 3 weeks ago

duplicate of this I put some notes in there that might be helpful.

nameexhaustion commented 2 weeks ago

duplicate of this I put some notes in there that might be helpful.

Thanks for the triage!