Open mkleinbort-ic opened 1 week ago
I can replicate the error.
It seems to be non-deterministic - so can take several runs to trigger.
import polars as pl
rows = 1_000_000
for _ in range(10):
(
pl.DataFrame({
"foo": [["a", "b"], None, [None]]
})
.with_columns(pl.col("foo").cast(pl.List(pl.Categorical)))
.sample(n=rows, with_replacement=True)
.write_parquet("19867.parquet")
)
pl.read_parquet("19867.parquet")
# thread 'polars-2' panicked at crates/polars-parquet/src/arrow/read/deserialize/dictionary.rs:101:41:
# called `Option::unwrap()` on a `None` value
# thread 'polars-4' panicked at crates/polars-parquet/src/arrow/read/deserialize/dictionary.rs:101:41:
# called `Option::unwrap()` on a `None` value
# PanicException: called `Option::unwrap()` on a `None` value
EDIT: This also triggers for me on 1.13.0 - tried a few runs on 1.12.0 and they all ran without error.
Slightly more minimal reproduction
import polars as pl
rows = 1_000_000
for _ in range(100):
(
pl.DataFrame({
"foo": [["a", "b"], [None]]
})
.with_columns(pl.col("foo").cast(pl.List(pl.Categorical)))
.sample(n=rows, with_replacement=True)
.write_parquet("19867.parquet")
)
pl.read_parquet("19867.parquet")
We re also running into this one.
Checks
Reproducible example
I have a large dataframe that I can't write to parquet and load again. I narrowed it down to an 8800-row region of a single column.
The issue appears when there are Nulls in a column with schema
List(Catergorical(ordering='physical'))
, more specifically a cell of type[null]
Thing is, I can't replicate the error.
Let's call my 1-column, 8800-row dataframe
df
.This:
Results in:
But
Runs without issue. So does
I tried rebuilding a minimal example, but they all work.
If it helps, the original df (the full one) is ~30m rows.
Log output
No response
Issue description
Some tables can't be written and then read to parquet. But more importantly, I'm unable to replicate the error.
Expected behavior
This should work (also, this worked in earlier versions of polars).
Installed versions