pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.75k stars 1.91k forks source link

Stated Limit is for 4.3 Billion rows, but really its 4.3 billion elements before needing polars-u64-idx #19203

Closed J-Meyers closed 2 days ago

J-Meyers commented 4 days ago

Description

Documentation states that the limit is with rows, but really from the response to #15534 it seems to be more related to the number of elements in a way not reflected in the documentation. From #15672 I'm not sure if the limit is supposed to be with single list size or sum of all lists in a series.

I have a large dataframe with ~5,000,000 rows each with a list of length ~2,000 and only when converting to arrow do I hit the issue, but working with it within arrow based tools seems to work fine, including duckdb and pyarrow.

I did not expect to need to use the u64 idx since I'm well under the row limit and so initially thought there was a bug, but it seems to just be improperly documented.

Relevant Logs:

thread '<unnamed>' panicked at crates/polars-core/src/chunked_array/ops/chunkops.rs:133:9:
polars' maximum length reached. Consider installing 'polars-u64-idx'.
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
/tmp/ipykernel_147/3621493996.py in ?()
----> 1 exploded_arrow = df_exploded.to_arrow()

~/environment/post_processing/lib/python3.11/site-packages/polars/_utils/deprecation.py in ?(*args, **kwargs)
     88         def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89             _rename_keyword_argument(
     90                 old_name, new_name, kwargs, function.__qualname__, version
     91             )
---> 92             return function(*args, **kwargs)

~/environment/post_processing/lib/python3.11/site-packages/polars/dataframe/frame.py in ?(self, compat_level)
   1596             compat_level = False  # type: ignore[assignment]
   1597         elif isinstance(compat_level, CompatLevel):
   1598             compat_level = compat_level._version  # type: ignore[attr-defined]
   1599 
-> 1600         record_batches = self._df.to_arrow(compat_level)
   1601         return pa.Table.from_batches(record_batches)

PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.

Sample Code:

df = pl.read_parquet(path) # ~42,500 rows
df = df.select("col").explode("col") # ~ 5,700,000 rows -struct with  6 fields one with a list of length ~2000 present in each row
# df = df.rechunk() # Rechunking doesn't cause issues or help anything
# df = df.unnest("col") # Unnesting seems to have no affect
# df = df.drop("list_col") # Dropping the list column allows conversion to arrow
df_arrow = df.to_arrow()

Link

https://docs.pola.rs/user-guide/installation/#big-index

ritchie46 commented 3 days ago

It's not elements. It's rows.

There might be a counting bug. Or you explode the DataFrame to more rows than anticipated.

Got a repro?

J-Meyers commented 3 days ago

@ritchie46

It's not elements. It's rows.

There might be a counting bug. Or you explode the DataFrame to more rows than anticipated.

Got a repro?

This is the most simple way I could get it to complain with something that only has 5,000,000 rows

import polars as pl
import numpy as np
# This crashes on construction with the same error
df = pl.DataFrame(
    {
        "a": [np.random.randint(-128, 127, size=2000, dtype=np.int8) for _ in range(5_000_000)],
    },
    schema={
        "a": pl.List(pl.Int8),
    }
)

Another example which also crashes with even fewer rows, but same number of elements:

# This doesn't crash on construction, but crashes when trying to work with it
df = pl.DataFrame(
    {
        "a": [[np.random.randint(-128, 127, size=2000, dtype=np.int8) for _ in range(1_000)] for _ in range(5_000)],
    },
    schema={
        "a": pl.List(pl.List(pl.Int8)),
    }
)
print(df) # this crashes
print(df.select(pl.col("a").list.get(0))) # this doesn't