pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.21k stars 1.95k forks source link

Explode/flatten on `list(str)` uses excessive memory in `group_by` #12659

Open ion-elgreco opened 11 months ago

ion-elgreco commented 11 months ago

Checks

Reproducible example

import polars as pl
import numpy as np

df = pl.DataFrame({
    "id": [*np.arange(0,200000)]*3,
    "type": [1,2,3]*200000, 
    "list_col": [np.arange(1,25), np.arange(1,40), np.arange(1,10)]*200000
})

## This is fast and uses not that much of memory
df.explode('list_col').group_by('id').agg(
    pl.col('list_col').filter(pl.col('type')==1).alias('type_1'),
    pl.col('list_col').filter(pl.col('type')==2).alias('type_2'),
    pl.col('list_col').filter(pl.col('type')==3).alias('type_3'),
)

## This however uses extreme amount of memory
df.group_by('id').agg(
    pl.col('list_col').filter(pl.col('type')==1).flatten().alias('type_1'),
    pl.col('list_col').filter(pl.col('type')==2).flatten().alias('type_2'),
    pl.col('list_col').filter(pl.col('type')==3).flatten().alias('type_3'),
)

Log output

No response

Issue description

When you aggregate a list in a group_by, you get a list of list as result, so I wanted to explode them within the group_by but this was using excessive amounts of memory. Talking about 10x + more memory usage.

So, in my dataset at work a explode before group_by consumes only 40-50Gb of memory. An explode within a group_by consumes exceedingly more than 500Gb memory (which causes OOM even in streaming).

Expected behavior

Don't use excessive memory.

Installed versions

``` --------Version info--------- Polars: 0.19.15 Index type: UInt32 Platform: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.31 Python: 3.10.12 (main, Aug 9 2023, 14:47:34) [GCC 9.4.0] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: 2.2.1 connectorx: deltalake: 0.13.0 fsspec: 2023.10.0 gevent: matplotlib: 3.8.1 numpy: 1.23.5 openpyxl: pandas: 1.5.3 pyarrow: 8.0.0 pydantic: 1.10.13 pyiceberg: pyxlsb: sqlalchemy: 2.0.22 xlsx2csv: 0.8.1 xlsxwriter: 3.1.9 ```
ritchie46 commented 11 months ago

I cannot reproduce?

ion-elgreco commented 11 months ago

@ritchie46 I think the sample data doesn't mimic my real data that well..

My real world data has list(str) columns with quite lengthy strings in the list.

ion-elgreco commented 11 months ago

@ritchie46 Ah it's only when it's list(str), can you try with this dataframe:

df = pl.DataFrame({
    "id": [*np.arange(0,200000)]*3,
    "type": [1,2,3]*200000, 
    "list_col": [np.arange(1,25), np.arange(1,40), np.arange(1,10)]*200000
}).with_columns(
    pl.col('id','type').cast(pl.Utf8),
    pl.col('list_col').list.eval(pl.element().hash().cast(pl.Utf8))
)
ion-elgreco commented 11 months ago

@ritchie46 this is what happens to the mem when the explode in list(str) is in group by:

https://github.com/pola-rs/polars/assets/15728914/ad4975c7-8f44-4e64-a625-f3ec186621ca

If I explode first, you see no significant change in mem usage.

cmdlineluser commented 11 months ago

can you try with this dataframe

@ion-elgreco With your updated dataframe example, the original MRE no longer runs - perhaps you can fix it up.

ComputeError: cannot compare utf-8 with numeric data

I guess it's due to type being cast to a string.

    pl.col('id','type').cast(pl.Utf8),

Changing this, I can reproduce.

The .agg + .flatten approach goes OOM for me whereas the .explode first executes immediately.

ion-elgreco commented 7 months ago

@ritchie46 this is still causing excessive memory usage, even after all the change around string types