pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.25k stars 1.85k forks source link

Polars 1.3.0 fails to collect scanned Parquet containing struct columns #17933

Closed danielgafni closed 1 month ago

danielgafni commented 1 month ago

Checks

Reproducible example

import polars as pl
from hypothesis import given, settings
from polars.testing.parametric import dataframes

# this test fails 

@given(df=dataframes(min_size=5))
@settings(max_examples=100, deadline=None)
def test_polars_parquet_write_read_with_structs(
    df: pl.DataFrame, tmp_path_factory
):
    path = tmp_path_factory.mktemp("test1") / "df.parquet"

    df.write_parquet(path)
    pl.read_parquet(path)  # this works

    ldf = pl.scan_parquet(path)
    ldf.collect()  # this fails!

# this test passes because it's not running over `pl.Struct` types

@given(df=dataframes(excluded_dtypes=[pl.Struct], min_size=5))
@settings(max_examples=100, deadline=None)
def test_polars_parquet_write_read_without_structs(
    df: pl.DataFrame, tmp_path_factory
):
    path = tmp_path_factory.mktemp("test2") / "df.parquet"

    df.write_parquet(path)
    pl.read_parquet(path)  # this works

    ldf = pl.scan_parquet(path)
    ldf.collect()  # this passes!

Log output

>       return wrap_df(ldf.collect(callback))
E       pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("validity mask length must match the number of values"))

Issue description

Might be related:

Expected behavior

Writing and reading a Parquet file should always work

Installed versions

``` --------Version info--------- Polars: 1.3.0 Index type: UInt32 Platform: Linux-6.7.6-arch1-1-x86_64-with-glibc2.39 Python: 3.10.12 (main, Jul 26 2023, 13:14:21) [Clang 16.0.3 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: 2.0.1 openpyxl: pandas: 2.2.2 pyarrow: pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
danielgafni commented 1 month ago

oh, I found the issue: it doesn't work with pl.Struct anymore. Excluding this type from dataframes strategy fixes this test.