pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.02k stars 1.72k forks source link

Panic when glob scanning with two files with different schemas #17067

Closed coastalwhite closed 1 week ago

coastalwhite commented 3 weeks ago

Checks

Reproducible example

import polars as pl

pl.DataFrame({ 'a': [1, 2, 3], 'b': [4, 5, 6] }).write_parquet('1.parquet')
pl.DataFrame({ 'c': [1, 2, 3], 'd': [4, 5, 6] }).write_parquet('2.parquet')
pl.scan_parquet('*.parquet').collect()

Log output

thread 'polars-0' panicked at /home/johndoe/Projects/polars/crates/polars-parquet/src/arrow/read/deserialize/mod.rs:144:31:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/johndoe/Projects/polars/x.py", line 7, in <module>
    pl.scan_parquet('*.parquet', row_index_name='idx').filter(pl.col.idx > 0).collect()
  File "/home/johndoe/Projects/polars/py-polars/polars/lazyframe/frame.py", line 1896, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

Issue description

If there is a mismatch in schemas between two glob files, there is a panic instead of a proper error.

Expected behavior

No panic.

Installed versions

``` --------Version info--------- Polars: 1.0.0-beta.1 Index type: UInt32 Platform: Linux-6.6.32-x86_64-with-glibc2.39 Python: 3.11.9 (main, Apr 2 2024, 08:25:04) [GCC 13.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: 0.3.3 deltalake: 0.17.4 fastexcel: fsspec: 2024.3.0 gevent: 24.2.1 great_tables: hvplot: 0.9.2 matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.1 pyarrow: 16.0.0 pydantic: 2.6.3 pyiceberg: sqlalchemy: 2.0.30 torch: xlsx2csv: 0.8.2 xlsxwriter: 3.2.0 ```
coastalwhite commented 3 weeks ago

It seems in general that projection pushdown with globs is broken. You don't even have to filter.

import polars as pl

pl.DataFrame({ 'a': [1, 2, 3], 'b': [4, 5, 6] }).write_parquet('1.parquet')
pl.DataFrame({ 'a': [1, 2, 3], 'b': [4, 5, 6] }).write_parquet('2.parquet')
pl.scan_parquet('*.parquet', row_index_name='idx').select(pl.col.a, pl.col.idx).collect()

This also panics.

coastalwhite commented 3 weeks ago

Nevermind, the problem was that there was another file with a different schema that also got globbed in. Changing title.

nameexhaustion commented 3 weeks ago

This could be the same as https://github.com/pola-rs/polars/issues/13436