pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.27k stars 1.85k forks source link

Polars scan_parquet with wildcard fails where schema column index positions dont align #18568

Open ShahBinoy opened 1 week ago

ShahBinoy commented 1 week ago

Checks

Reproducible example

fitbit_coalesced_paths_ = 'blah-blah-path/*.parquet'

fb_steps_data = pl.scan_parquet(fitbit_coalesced_paths_, include_file_paths='file_name')
fb_steps_data.collect()

Same files are read correctly when read from duckdb

import duckdb
from duckdb import duckdb as dd
sql_load = "select * from read_parquet('/path-location/polars-issue-schema-mismatch/*.parquet')"
duckdb.sql(sql_load).pl()
Out[9]: 
shape: (2, 5)
┌───────────┬────────────────────────────┬─────────┬───────────────────┬───────────────┐
│ vibrentId ┆ date                       ┆ payload ┆ payload_type      ┆ json_src_file │
│ ---       ┆ ---                        ┆ ---     ┆ ---               ┆ ---           │
│ str       ┆ datetime[μs]               ┆ str     ┆ str               ┆ str           │
╞═══════════╪════════════════════════════╪═════════╪═══════════════════╪═══════════════╡
│ null      ┆ 2024-07-02 13:51:00.698826 ┆ null    ┆ NO_DATA_AVAILABLE ┆ null          │
│ null      ┆ 2024-07-05 00:00:00        ┆ null    ┆ NO_DATA_AVAILABLE ┆ null          │
└───────────┴────────────────────────────┴─────────┴───────────────────┴───────────────┘

Tried with attached files

polars-issue-schema-mismatch.zip

Log output

in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2032 # Only for testing purposes
   2033 callback = _kwargs.get("post_opt_callback", callback)
-> 2034 return wrap_df(ldf.collect(callback))

SchemaError: schema names differ at index 0: payload != vibId

Issue description

Schema of some files is payload,date,vibId in a folder Another set of files have schema vibId, payload,date

Wild card matching does not take into consideration the schema by names, but rather just positions. Columnar records should be able to match/fetch by column names too and not just position index

Expected behavior

Same records are processed correctly by duckdb via code

select * from read_parquet('blah-blah-path/src=Fitbit/year=2024/mon=7/day=5/**/*.parquet', filename = true)

Installed versions

``` Polars Version 1.6.0 DuckDb Version 0.9.2 --------Version info--------- Polars: 1.6.0 Index type: UInt32 Platform: macOS-14.4.1-arm64-arm-64bit Python: 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx 0.3.3 deltalake fastexcel fsspec 2024.6.1 gevent great_tables matplotlib nest_asyncio 1.6.0 numpy 1.26.4 openpyxl pandas 1.5.3 pyarrow 17.0.0 pydantic 1.10.18 pyiceberg sqlalchemy 1.4.53 torch xlsx2csv xlsxwriter 3.2.0 ```
coastalwhite commented 1 week ago

I don't necessarily see this as a bug. Are there partitioned writers that mix up the columns like this?

ShahBinoy commented 1 week ago

I don't necessarily see this as a bug. Are there partitioned writers that mix up the columns like this?

This actually does not even involve partitions, the failure is directly on the columns of the parquet file, I am not even engaging partitions during scan_parquet call

It even fails when my path is local and I look it up as ~/polars-issue-schema-mismatch/*.parquet, it still fails.

Reading columnar records, should not be dependent on the order of the column's index

ritchie46 commented 1 week ago

Reading columnar records, should not be dependent on the order of the column's index

In Polars schema's must align. This isn't a bug. But we are investigating support for unaligned reads.

GregAru commented 1 week ago

If schemas must align, how is schema evolution handled? In general this is a huge limitation also in my case.

lmocsi commented 1 week ago

If schemas must align, how is schema evolution handled? In general this is a huge limitation also in my case.

I agree: schema evolution should be handled. The sooner, the better.