Open anordertoreclaim opened 2 days ago
What happens if you use scan_parquet
and a streaming^1 collect?
agg_df = (
pl.scan_parquet('test/*')
.group_by('col1', 'col2')
.agg(pl.col('col3').sum())
.collect(streaming=True)
)
due to our filesystem's specifics, we cannot use streaming. however, I've tried scan_parquet
and collect
without streaming and it worked! thank you.
do you know why there might be such a difference in RAM usage between newer versions and 0.18 tho?
It would need one of the devs to take a closer look.
(You would likely need to provide a full repro for them - i.e. code that generates a dummy dataset with the same schema / size)
can I provide an actual (but obfuscated) dataset and describe a way to subsample it? how can I ask the devs to take a look at it?
hey, can someone get back to me here?
As I understand it, priority support is offered as a paid service: https://pola.rs/our-services/
Other than that, you just have to wait for the devs to respond to you here.
Checks
Reproducible example
Log output
No response
Issue description
Hey! A big fan of polars here:)
I ran into an issue while working on a task at my job. The code above reads a bunch of dataframes (60 in my case) with 3 columns, concatenates them and then performs a groupby-sum aggregation for one of the columns. To lower RAM usage, I've coded it so that it preaggregates dataframes in windows of size 10, and then concatenates and aggregates the intermediate results. The problem is that when I bumped my polars from version
0.18.2
to1.12
, this code started crashing with an OOM error. Turns out that the problem starts appearing around version0.20
. The code is ran on a pod with 80 GBs of RAM. Each dataframe has around 8 million rows.What bothers me is that I tried to reproduce the behaviour locally, on a MacOS, on a subset of data, however, code ran on 1.12 had consistently lower RAM usage compared to the same code on 0.18.2. Below are the screenshots for 1.12 and 0.18.2.
1.12
0.18.2
Do you have any thoughts on what might've changed since 0.18.2 and how I can make improvements to my code to handle it? This really stops me from using polars' newer functionality.
Expected behavior
Code is supposed to work fine on 1.12 as well, but I run into OOM.
Installed versions