pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.1k stars 1.94k forks source link

Scan_parquet no longer returns the hive partition column of parquet files in polars==1.0.0 #17360

Closed lmocsi closed 4 months ago

lmocsi commented 4 months ago

Checks

Reproducible example

import polars as pl
tr = pl.scan_parquet(parq_path+"my_transaction/**/*.parquet")
tr.columns

Log output

The hive partitioning column is missing both on nfs and in aws.

Issue description

The partitioning column is missing both on nfs and in aws. Was working fine in polars==0.20.31

Expected behavior

Return the partitioning column as well.

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: Linux-4.18.0-372.76.1.el8_6.x86_64-x86_64-with-glibc2.28 Python: 3.9.13 (main, Oct 13 2022, 21:15:33) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.0.0 connectorx: deltalake: fastexcel: fsspec: 2022.02.0 gevent: great_tables: hvplot: matplotlib: 3.8.4 nest_asyncio: 1.5.5 numpy: 1.23.5 openpyxl: 3.0.9 pandas: 2.2.2 pyarrow: 16.1.0 pydantic: pyiceberg: sqlalchemy: 1.4.27 torch: 1.10.2 xlsx2csv: xlsxwriter: 3.2.0 ```
stinodego commented 4 months ago

Thanks for opening an issue. This is an intentional change introduced with https://github.com/pola-rs/polars/pull/17106

Please check the description of that PR for directions.