pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.58k stars 1.98k forks source link

Non deterministic error writing and then reading parquet files with column of type List(Catergorical) where the Categorical is nullable #19867

Open mkleinbort-ic opened 1 week ago

mkleinbort-ic commented 1 week ago

Checks

Reproducible example

I have a large dataframe that I can't write to parquet and load again. I narrowed it down to an 8800-row region of a single column.

The issue appears when there are Nulls in a column with schema List(Catergorical(ordering='physical')), more specifically a cell of type [null]

Thing is, I can't replicate the error.

Let's call my 1-column, 8800-row dataframe df.

This:

df.with_row_index().write_parquet('test.test')
pl.read_parquet('test.test')

Results in:

PanicException                            Traceback (most recent call last)
Cell In[117], [line 2](vscode-notebook-cell:?execution_count=117&line=2)
      [1](vscode-notebook-cell:?execution_count=117&line=1) df.with_row_index().write_parquet('test.test')
----> [2](vscode-notebook-cell:?execution_count=117&line=2) pl.read_parquet('test.test')

File c:\Users\MycchakaKleinbort\AppData\Local\Programs\Python\Python311\Lib\site-packages\polars\_utils\deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     [87](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:87) @wraps(function)
     [88](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:88) def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     [89](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:89)     _rename_keyword_argument(
     [90](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:90)         old_name, new_name, kwargs, function.__qualname__, version
     [91](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:91)     )
---> [92](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:92)     return function(*args, **kwargs)

File c:\Users\MycchakaKleinbort\AppData\Local\Programs\Python\Python311\Lib\site-packages\polars\_utils\deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     [87](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:87) @wraps(function)
     [88](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:88) def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     [89](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:89)     _rename_keyword_argument(
     [90](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:90)         old_name, new_name, kwargs, function.__qualname__, version
     [91](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:91)     )
---> [92](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/_utils/deprecation.py:92)     return function(*args, **kwargs)

File c:\Users\MycchakaKleinbort\AppData\Local\Programs\Python\Python311\Lib\site-packages\polars\io\parquet\functions.py:241, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, schema, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, credential_provider, retries, use_pyarrow, pyarrow_options, memory_map, include_file_paths, allow_missing_columns)
    [238](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/io/parquet/functions.py:238)     else:
    [239](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/io/parquet/functions.py:239)         lf = lf.select(columns)
--> [241](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/io/parquet/functions.py:241) return lf.collect()

File c:\Users\MycchakaKleinbort\AppData\Local\Programs\Python\Python311\Lib\site-packages\polars\lazyframe\frame.py:2029, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _eager, **_kwargs)
   [2027](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/lazyframe/frame.py:2027) # Only for testing purposes
   [2028](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/lazyframe/frame.py:2028) callback = _kwargs.get("post_opt_callback", callback)
-> [2029](file:///C:/Users/MycchakaKleinbort/AppData/Local/Programs/Python/Python311/Lib/site-packages/polars/lazyframe/frame.py:2029) return wrap_df(ldf.collect(callback))

PanicException: called `Option::unwrap()` on a `None` value

But


df.with_row_index().filter(pl.col('index')%2 == 0).write_parquet('test.test')
pl.read_parquet('test.test')

df.with_row_index().filter(pl.col('index')%2 == 1).write_parquet('test.test')
pl.read_parquet('test.test')

Runs without issue. So does

df.with_row_index().slice(0,8000).write_parquet('test.test')
pl.read_parquet('test.test')

df.with_row_index().slice(8000, len(df)-8000).write_parquet('test.test')
pl.read_parquet('test.test')

I tried rebuilding a minimal example, but they all work.

If it helps, the original df (the full one) is ~30m rows.

Log output

No response

Issue description

Some tables can't be written and then read to parquet. But more importantly, I'm unable to replicate the error.

Expected behavior

This should work (also, this worked in earlier versions of polars).

Installed versions

``` --------Version info--------- Polars: 1.14.0 Index type: UInt32 Platform: Windows-10-10.0.26100-SP0 Python: 3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)] LTS CPU: False ----Optional dependencies---- adbc_driver_manager 1.1.0 altair 4.2.2 boto3 cloudpickle 3.1.0 connectorx 0.3.3 deltalake fastexcel 0.10.4 fsspec 2023.12.2 gevent google.auth 2.35.0 great_tables matplotlib 3.9.2 nest_asyncio 1.6.0 numpy 1.26.4 openpyxl pandas 2.2.2 pyarrow 15.0.0 pydantic 2.9.2 pyiceberg 0.6.1 sqlalchemy 2.0.35 torch 2.3.0+cpu xlsx2csv 0.8.2 xlsxwriter 3.2.0 ```
cmdlineluser commented 1 week ago

I can replicate the error.

It seems to be non-deterministic - so can take several runs to trigger.

import polars as pl

rows = 1_000_000

for _ in range(10):
    (
        pl.DataFrame({
            "foo": [["a", "b"], None, [None]]
        })
        .with_columns(pl.col("foo").cast(pl.List(pl.Categorical)))
        .sample(n=rows, with_replacement=True)
        .write_parquet("19867.parquet")
    )

    pl.read_parquet("19867.parquet")

# thread 'polars-2' panicked at crates/polars-parquet/src/arrow/read/deserialize/dictionary.rs:101:41:
# called `Option::unwrap()` on a `None` value
# thread 'polars-4' panicked at crates/polars-parquet/src/arrow/read/deserialize/dictionary.rs:101:41:
# called `Option::unwrap()` on a `None` value
# PanicException: called `Option::unwrap()` on a `None` value

EDIT: This also triggers for me on 1.13.0 - tried a few runs on 1.12.0 and they all ran without error.

mkleinbort-ic commented 1 week ago

Slightly more minimal reproduction

import polars as pl

rows = 1_000_000

for _ in range(100):
    (
        pl.DataFrame({
            "foo": [["a", "b"], [None]]
            })
        .with_columns(pl.col("foo").cast(pl.List(pl.Categorical)))
        .sample(n=rows, with_replacement=True)
        .write_parquet("19867.parquet")
    )

    pl.read_parquet("19867.parquet")
aberres commented 3 days ago

We re also running into this one.