pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.1k stars 1.83k forks source link

PanicException: validity must be equal to the array's length during `.explode` #17745

Open theelderbeever opened 1 month ago

theelderbeever commented 1 month ago

Checks

Reproducible example

df.explode("really big list<struct<52>>")

Log output

thread 'polars-0' panicked at /Users/runner/work/polars/polars/crates/polars-arrow/src/array/struct_/mod.rs:214:5:
validity must be equal to the array's length
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[18], line 36
     32 if not subscriptions_parquet.exists():
     33     print("Collecting json files")
     34     duckdb.sql(
     35         f"SELECT * FROM read_json('{subscriptions_dir}/*.json', union_by_name = true)"
---> 36     ).pl().explode("json").unnest("json").write_parquet(subscriptions_parquet)
     38 subscriptions = pl.read_parquet(subscriptions_parquet)
     39 subscriptions[0]

File ~/.pyenv/versions/3.11.8/envs/billing-platform-pipelines/lib/python3.11/site-packages/polars/dataframe/frame.py:7709, in DataFrame.explode(self, columns, *more_columns)
   7652 def explode(
   7653     self,
   7654     columns: str | Expr | Sequence[str | Expr],
   7655     *more_columns: str | Expr,
   7656 ) -> DataFrame:
   7657     """
   7658     Explode the dataframe to long format by exploding the given columns.
   7659 
   (...)
   7707     └─────────┴─────────┘
   7708     """
-> 7709     return self.lazy().explode(columns, *more_columns).collect(_eager=True)
...
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

PanicException: validity must be equal to the array's length

Issue description

I have a collection of ~400 json files containing ~500 objects each with further nested data types. For starters, I can't even read the files because polars can't handle {} empty structs so I have to read it with duckdb and convert to polars. As a result of the reading there is a single column that is a list<struct<52>>. When I attempt to explode the list I receive the aforementioned error.

FWIW if I unnest the list and struct using duckdb it can be converted to duckdb.

duckdb.sql(
        f"SELECT UNNEST(json, recursive := true) FROM read_json('{subscriptions_dir}/*.json', union_by_name = true)"
).pl()

If we need to figure out an exemplar dataset to use I can try and do that. I can't send the data as is right now.

Expected behavior

polars should be able to explode/read any of the files that duckdb can.

Installed versions

``` --------Version info--------- Polars: 1.2.1 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.11.8 (main, Apr 27 2024, 07:50:56) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: 1.1.0 cloudpickle: 2.2.1 connectorx: 0.3.3 deltalake: fastexcel: fsspec: 2023.12.2 gevent: great_tables: hvplot: matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 17.0.0 pydantic: 2.5.3 pyiceberg: sqlalchemy: 2.0.31 torch: xlsx2csv: xlsxwriter: ```
david-waterworth commented 1 month ago

I'm seeing the same issue - my list isn't overly large, its a list[struct] field (but the structure is slightly complicated. I've not at this stage managed to create a repro - I've found 1 row (a list of 5 structs) which causes the problem. But when I write to json and load back it now longer throws. This seems somewhat similar to the previous issue I raised that camge down to chunking - but this time rechunk doesn't help.

david-waterworth commented 1 month ago

You should be able to repro with

pl.read_json("crash.json").explode("change_history")

and the attached file.

edit: I've replaced the file with a smaller version that still repro's

crash.json

david-waterworth commented 1 month ago

I suspect this is caused in my case by the nested author element in my dataset being either an entity "author": {...}, or "author": null

cmdlineluser commented 1 month ago

@david-waterworth I can reproduce your example.

It seems like it reduces down to:

pl.read_ndjson(b"""
{"A":[{"B":1}]}
{"A":[null]}
{"A":[]}
""").explode("A")
# PanicException: validity must be equal to the array's length
david-waterworth commented 1 month ago

@cmdlineluser that's kind of what I thought but my attempt to reduce it failed, not sure if it's because you used read_ndjson? I was trying to construct something very similar using read_dicts and didn't quite get there.

In the end as a work-around I pre-filtered the json before loading it into polars.

cmdlineluser commented 1 month ago

Adding [] for read_json:

pl.read_json(b"""[{"A":[null]},{"A":[{"B":1}]},{"A":[]}]""").explode("A")
# PanicException: validity must be equal to the array's length
DanielHabenicht commented 1 month ago

This also happens with write_parquet function.

thread 'polars-1' panicked at crates/polars-arrow/src/array/primitive/mod.rs:261:5:
validity must be equal to the array's length
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/workspaces/cloud-prices/azure.py", line 51, in <module>
    df.write_parquet(cache_file)
  File "/workspaces/cloud-prices/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py", line 3630, in write_parquet
    self._df.write_parquet(
pyo3_runtime.PanicException: validity must be equal to the array's length
cmdlineluser commented 1 month ago

@DanielHabenicht Perhaps you can provide a repro and file a new issue.

If you are seeing it in the parquet writer - that would suggest it is a separate problem.

DarkAmoeba commented 4 weeks ago

Hello I've encountered the same issue, I've a reproducible example as follows:

import polars as pl
import datetime

df = pl.DataFrame([{
  'ntpTime': datetime.datetime(2024, 1, 24, 10, 10, 13, 218167),
  'relevantFlight': None},
 {
  'ntpTime': datetime.datetime(2024, 1, 24, 12, 34, 20, 501001),
  'relevantFlight': {'flight': 'A',
   'relevancy': True}}])

df.filter( pl.col('relevantFlight').shift(1).ne_missing(pl.col('relevantFlight')))

The above raises PanicException: validity must be equal to the array's length on polars 1.2.1 but works okay on polars 1.1.0 which is the other version I have available

fzyzcjy commented 2 weeks ago

Hi, is there any updates or fix? Thanks!

david-waterworth commented 1 week ago

I'm running into this issue more and more frequently - it's becoming a major issue because I'm seeing pipelines randomly fail in production due to it and I'm having to add work-arounds. Any chance of a fix soon?

cmdlineluser commented 1 week ago

Yes - can still reproduce this on main:

df = pl.DataFrame({"A": [[{"B": 1}], [None], []]})

df.explode("A")
# PanicException: validity must be equal to the array's length

(Could probably have done with mentioning .explode() in the issue title to draw better attention.)

david-waterworth commented 1 week ago

Yeah, I think only the person who raise it originally can do that though?

My work-around is to replace any empty array of structs with null, i..e

df = pl.DataFrame({"A": [[{"B": 1}], [None], []]})

df = df.with_columns(A=pl.when(pl.col("A").list.len() == 0).then(None).otherwise(pl.col("A")))

df.explode("A")

But I'm not sure if this works in general

theelderbeever commented 1 week ago

@cmdlineluser @david-waterworth updated the name.

cmdlineluser commented 1 week ago

1.6 was just released with a fix for the minimal repro.

If you can test that it was infact the same underlying issue - then this could be closed.

@DarkAmoeba's repro runs without error in 1.5 - so I think that one was already fixed.