pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.57k stars 1.69k forks source link

`scan_parquet` expects columns to be in the same order iff `POLARS_FORCE_ASYNC=1` or scanning against cloud, local works regardless of ordering #17254

Open kszlim opened 3 days ago

kszlim commented 3 days ago

Checks

Reproducible example

import os
import polars as pl
import tempfile

with tempfile.TemporaryDirectory() as tmpdir:
  pl.DataFrame({"a": [1,2,3], "b": [3,2,1]}).write_parquet(f"{tmpdir}/0.parquet")
  pl.DataFrame({"b": [1,2,3], "a": [3,2,1]}).write_parquet(f"{tmpdir}/1.parquet")
  ldf = pl.scan_parquet(f"{tmpdir}/*.parquet")
  os.environ["POLARS_FORCE_ASYNC"] = "0"
  ldf.collect() # This works
  os.environ["POLARS_FORCE_ASYNC"] = "1"
  ldf.collect() # This fails

Log output

polars.exceptions.ComputeError: schema of all files in a single scan_parquet must be equal

Expected: Schema:
name: a, data type: Int64
name: b, data type: Int64

Got: Schema:
name: b, data type: Int64
name: a, data type: Int64

Issue description

scan_parquet is sensitive to column ordering in a cloud (async reader) context, but is fine when reading locally

I found that this has is a regression from 0.19.11 -> 0.19.12, and has failed ever since.

Expected behavior

Should work regardless of column ordering if the schema matches

Installed versions

``` --------Version info--------- Polars: 1.0.0-rc.2 Index type: UInt32 Platform: macOS-14.3.1-arm64-arm-64bit Python: 3.9.6 (default, Feb 3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: 0.5.1 cloudpickle: connectorx: 0.3.1 deltalake: 0.10.0 fastexcel: fsspec: 2023.6.0 gevent: great_tables: hvplot: 0.9.2 matplotlib: 3.7.2 nest_asyncio: 1.5.6 numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.0 pydantic: 2.0.2 pyiceberg: sqlalchemy: 2.0.18 torch: xlsx2csv: 0.8.1 xlsxwriter: 3.1.2 ```
stinodego commented 3 days ago

Thanks for the report - and thanks for the repro with tempfile, my filesystem is grateful!