Closed nameexhaustion closed 2 weeks ago
import polars as pl from pathlib import Path root = Path(".env/data2") dfs = [ pl.DataFrame({"x": 1}), pl.DataFrame({"x": 2}), ] paths = [ root / "a=1/b=1/data.bin", root / "a=2/b=2/data.bin", ] [ [paths[i].parent.mkdir(exist_ok=True, parents=True), dfs[i].write_parquet(paths[i])] for i in range(len(dfs)) ] lf = pl.scan_parquet(root / "**/*.bin") lf = lf.select("x", "a") print(lf.collect())
Observe the select was not respected:
select
shape: (2, 3) ┌─────┬─────┬─────┐ │ x ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 1 ┆ 1 │ │ 2 ┆ 2 ┆ 2 │ └─────┴─────┴─────┘
No response
https://github.com/pola-rs/polars/pull/15573 removes the projection node after projection pushdown into the parquet reader, which reveals that the parquet reader was not applying the projection properly on the hive partition columns.
shape: (2, 2) ┌─────┬─────┐ │ x ┆ a │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 1 │ │ 2 ┆ 2 │ └─────┴─────┘
main @ 8a6bf4bc58e7fed9b6728bad66e0590fccb11f0e
duplicate of this I put some notes in there that might be helpful.
Thanks for the triage!
Checks
Reproducible example
Observe the
select
was not respected:Log output
No response
Issue description
https://github.com/pola-rs/polars/pull/15573 removes the projection node after projection pushdown into the parquet reader, which reveals that the parquet reader was not applying the projection properly on the hive partition columns.
Expected behavior
Installed versions
main @ 8a6bf4bc58e7fed9b6728bad66e0590fccb11f0e