pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.18k stars 1.95k forks source link

Memory allocation fails for flattening lists in streaming lazy pipeline #9698

Open AroneyS opened 1 year ago

AroneyS commented 1 year ago

Polars version checks

Issue description

memory allocation of 15277786 bytes failed when running lazy streaming pipeline. Much more RAM available on the machine (1TB total) than requested.

I came across this bug trying to groupby a column and aggregate a list[str] column. It was also failing with a similar error when I did groupby/agg with pl.col("b").flatten(). If instead I aggregated by pl.col("b"), it works fine but produces a list[list[str]] column. Not sure how to flatten that to list[str], except by below, which also gives the memory error.

Reproducible example

import polars as pl

df = pl.DataFrame({
    "a": [str(n) for n in range(10**7)],
    "b": [[[str(n)], [str(n+1)]] for n in range(10**7)],
})

(
    df
    .lazy()
    .with_columns(pl.col("b").list.eval(pl.element().flatten()))
    .collect(streaming=True)
)

Expected behavior

Memory allocation succeeds. Above example works on my machine with 10**6 rows.

Installed versions

``` --------Version info--------- Polars: 0.18.4 Index type: UInt32 Platform: Linux-4.12.14-122.159-default-x86_64-with-glibc2.22 Python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] ----Optional dependencies---- numpy: 1.25.0 pandas: 2.0.3 pyarrow: 12.0.1 connectorx: deltalake: fsspec: matplotlib: xlsx2csv: xlsxwriter: ```
ritchie46 commented 1 year ago

And if you replace collect with sink_parquet?

AroneyS commented 1 year ago

I still get memory allocation errors with sink_parquet, but it looks like there are now two requests that fail? One for 937,500 bytes and one for 1,093,750 bytes.

Actually I get those memory request with streaming collect as well now. I guess its machine state dependent?

memory allocation of memory allocation of memory allocation of memory allocation of memory allocation of 937500 bytes failed
1093750 bytes failed
memory allocation of 1093750 bytes failed
1093750 bytes failed
9375001093750 bytes failed
 bytes failed
memory allocation of 1093750 bytes failed
Aborted (core dumped)