Open theelderbeever opened 3 months ago
I'm seeing the same issue - my list isn't overly large, its a list[struct] field (but the structure is slightly complicated. I've not at this stage managed to create a repro - I've found 1 row (a list of 5 structs) which causes the problem. But when I write to json and load back it now longer throws. This seems somewhat similar to the previous issue I raised that camge down to chunking - but this time rechunk
doesn't help.
You should be able to repro with
pl.read_json("crash.json").explode("change_history")
and the attached file.
edit: I've replaced the file with a smaller version that still repro's
I suspect this is caused in my case by the nested author
element in my dataset being either an entity "author": {...}, or "author": null
@david-waterworth I can reproduce your example.
It seems like it reduces down to:
pl.read_ndjson(b"""
{"A":[{"B":1}]}
{"A":[null]}
{"A":[]}
""").explode("A")
# PanicException: validity must be equal to the array's length
@cmdlineluser that's kind of what I thought but my attempt to reduce it failed, not sure if it's because you used read_ndjson
? I was trying to construct something very similar using read_dicts
and didn't quite get there.
In the end as a work-around I pre-filtered the json before loading it into polars.
Adding []
for read_json
:
pl.read_json(b"""[{"A":[null]},{"A":[{"B":1}]},{"A":[]}]""").explode("A")
# PanicException: validity must be equal to the array's length
This also happens with write_parquet
function.
thread 'polars-1' panicked at crates/polars-arrow/src/array/primitive/mod.rs:261:5:
validity must be equal to the array's length
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "/workspaces/cloud-prices/azure.py", line 51, in <module>
df.write_parquet(cache_file)
File "/workspaces/cloud-prices/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py", line 3630, in write_parquet
self._df.write_parquet(
pyo3_runtime.PanicException: validity must be equal to the array's length
@DanielHabenicht Perhaps you can provide a repro and file a new issue.
If you are seeing it in the parquet writer - that would suggest it is a separate problem.
Hello I've encountered the same issue, I've a reproducible example as follows:
import polars as pl
import datetime
df = pl.DataFrame([{
'ntpTime': datetime.datetime(2024, 1, 24, 10, 10, 13, 218167),
'relevantFlight': None},
{
'ntpTime': datetime.datetime(2024, 1, 24, 12, 34, 20, 501001),
'relevantFlight': {'flight': 'A',
'relevancy': True}}])
df.filter( pl.col('relevantFlight').shift(1).ne_missing(pl.col('relevantFlight')))
The above raises
PanicException: validity must be equal to the array's length
on polars 1.2.1 but works okay on polars 1.1.0 which is the other version I have available
Hi, is there any updates or fix? Thanks!
I'm running into this issue more and more frequently - it's becoming a major issue because I'm seeing pipelines randomly fail in production due to it and I'm having to add work-arounds. Any chance of a fix soon?
Yes - can still reproduce this on main:
df = pl.DataFrame({"A": [[{"B": 1}], [None], []]})
df.explode("A")
# PanicException: validity must be equal to the array's length
(Could probably have done with mentioning .explode()
in the issue title to draw better attention.)
Yeah, I think only the person who raise it originally can do that though?
My work-around is to replace any empty array of structs with null, i..e
df = pl.DataFrame({"A": [[{"B": 1}], [None], []]})
df = df.with_columns(A=pl.when(pl.col("A").list.len() == 0).then(None).otherwise(pl.col("A")))
df.explode("A")
But I'm not sure if this works in general
@cmdlineluser @david-waterworth updated the name.
1.6 was just released with a fix for the minimal repro.
If you can test that it was infact the same underlying issue - then this could be closed.
@DarkAmoeba's repro runs without error in 1.5 - so I think that one was already fixed.
Checks
Reproducible example
Log output
Issue description
I have a collection of ~400 json files containing ~500 objects each with further nested data types. For starters, I can't even read the files because
polars
can't handle{}
empty structs so I have to read it withduckdb
and convert to polars. As a result of the reading there is a single column that is alist<struct<52>>
. When I attempt to explode the list I receive the aforementioned error.FWIW if I unnest the list and struct using
duckdb
it can be converted to duckdb.If we need to figure out an exemplar dataset to use I can try and do that. I can't send the data as is right now.
Expected behavior
polars
should be able to explode/read any of the files thatduckdb
can.Installed versions