OOM on newer versions of polars, works fine on 0.18.2

anordertoreclaim commented 2 days ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from memory_profiler import profile
import polars as pl
from pathlib import Path
from glob import glob

@profile
def main():
    df_paths = glob(f'test/*')

    dfs = []
    dfs_preaggd = []

    for ind, df_path in enumerate(df_paths):
        dfs.append(pl.read_parquet(df_path))

        if (ind + 1) % 10 == 0:
            dfs_preaggd.append(
                pl.concat(dfs)
                    .groupby(['col1', 'col2'])
                    .agg(pl.col('col3').sum())
            )
            dfs = []

    if len(dfs) > 0:
        dfs_preaggd.append(
            pl.concat(dfs)
                .groupby(agg_key)
                .agg(agg_expr)
        )

    agg_df = pl.concat(dfs_preaggd) \
        .groupby(['col1', 'col2']) \
        .agg(pl.col('col3').sum())

    return agg_df

main()

Log output

No response

Issue description

Hey! A big fan of polars here:)

I ran into an issue while working on a task at my job. The code above reads a bunch of dataframes (60 in my case) with 3 columns, concatenates them and then performs a groupby-sum aggregation for one of the columns. To lower RAM usage, I've coded it so that it preaggregates dataframes in windows of size 10, and then concatenates and aggregates the intermediate results. The problem is that when I bumped my polars from version 0.18.2 to 1.12, this code started crashing with an OOM error. Turns out that the problem starts appearing around version 0.20. The code is ran on a pod with 80 GBs of RAM. Each dataframe has around 8 million rows.

What bothers me is that I tried to reproduce the behaviour locally, on a MacOS, on a subset of data, however, code ran on 1.12 had consistently lower RAM usage compared to the same code on 0.18.2. Below are the screenshots for 1.12 and 0.18.2.

1.12

0.18.2

Do you have any thoughts on what might've changed since 0.18.2 and how I can make improvements to my code to handle it? This really stops me from using polars' newer functionality.

Expected behavior

Code is supposed to work fine on 1.12 as well, but I run into OOM.

Installed versions

``` --------Version info--------- Polars: 1.12.0 Index type: UInt32 Platform: macOS-12.5.1-arm64-arm-64bit Python: 3.11.3 (main, Apr 19 2023, 18:49:55) [Clang 14.0.6 ] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair cloudpickle 2.2.1 connectorx deltalake fastexcel fsspec 2023.4.0 gevent great_tables matplotlib 3.7.1 nest_asyncio 1.5.6 numpy 1.24.3 openpyxl 3.0.10 pandas 1.5.3 pyarrow 11.0.0 pydantic pyiceberg sqlalchemy 1.4.39 torch 2.0.1 xlsx2csv xlsxwriter ```

cmdlineluser commented 2 days ago

What happens if you use scan_parquet and a streaming^1 collect?

agg_df = (
    pl.scan_parquet('test/*')
      .group_by('col1', 'col2') 
      .agg(pl.col('col3').sum())
      .collect(streaming=True)
)

anordertoreclaim commented 2 days ago

due to our filesystem's specifics, we cannot use streaming. however, I've tried scan_parquet and collect without streaming and it worked! thank you.

anordertoreclaim commented 2 days ago

do you know why there might be such a difference in RAM usage between newer versions and 0.18 tho?

cmdlineluser commented 2 days ago

It would need one of the devs to take a closer look.

(You would likely need to provide a full repro for them - i.e. code that generates a dummy dataset with the same schema / size)

anordertoreclaim commented 1 day ago

can I provide an actual (but obfuscated) dataset and describe a way to subsample it? how can I ask the devs to take a look at it?

anordertoreclaim commented 10 hours ago

hey, can someone get back to me here?

cmdlineluser commented 7 hours ago

As I understand it, priority support is offered as a paid service: https://pola.rs/our-services/

Other than that, you just have to wait for the devs to respond to you here.

pola-rs / polars