Open ion-elgreco opened 11 months ago
I cannot reproduce?
@ritchie46 I think the sample data doesn't mimic my real data that well..
My real world data has list(str) columns with quite lengthy strings in the list.
@ritchie46 Ah it's only when it's list(str)
, can you try with this dataframe:
df = pl.DataFrame({
"id": [*np.arange(0,200000)]*3,
"type": [1,2,3]*200000,
"list_col": [np.arange(1,25), np.arange(1,40), np.arange(1,10)]*200000
}).with_columns(
pl.col('id','type').cast(pl.Utf8),
pl.col('list_col').list.eval(pl.element().hash().cast(pl.Utf8))
)
@ritchie46 this is what happens to the mem when the explode in list(str) is in group by:
https://github.com/pola-rs/polars/assets/15728914/ad4975c7-8f44-4e64-a625-f3ec186621ca
If I explode first, you see no significant change in mem usage.
can you try with this dataframe
@ion-elgreco With your updated dataframe example, the original MRE no longer runs - perhaps you can fix it up.
ComputeError: cannot compare utf-8 with numeric data
I guess it's due to type
being cast to a string.
pl.col('id','type').cast(pl.Utf8),
Changing this, I can reproduce.
The .agg
+ .flatten
approach goes OOM for me whereas the .explode
first executes immediately.
@ritchie46 this is still causing excessive memory usage, even after all the change around string types
Checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
No response
Issue description
When you aggregate a list in a group_by, you get a list of list as result, so I wanted to explode them within the
group_by
but this was using excessive amounts of memory. Talking about 10x + more memory usage.So, in my dataset at work a explode before group_by consumes only 40-50Gb of memory. An explode within a group_by consumes exceedingly more than 500Gb memory (which causes OOM even in streaming).
Expected behavior
Don't use excessive memory.
Installed versions