pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.67k stars 1.99k forks source link

OOM issue when running `drop_nulls()` on aggregated results. #19961

Open dalejung opened 1 week ago

dalejung commented 1 week ago

Checks

Reproducible example

    import polars as pl

    df = pl.DataFrame({
        'timestamp': pl.datetime_range(
            pl.datetime(2016, 1, 1),
            pl.datetime(2024, 1, 1),
            interval='1s',
            eager=True,
        ),
    }).with_columns(
        price=1,
    )

    # OOM ERROR
    res = df.group_by_dynamic('timestamp', every='2s').agg(
        pl.col.price.drop_nulls().first().alias('open'),
    )

    # RUNS FINE
    res = df.group_by_dynamic('timestamp', every='2s').agg(
        pl.col.price.first().alias('open'),
    )

Log output

No response

Issue description

I noticed that I'm getting OOM when using drop_nulls() with large amount of data. I get expected memory usage when not using drop_nulls()

Expected behavior

No memory issues. Can't think of why drop_nulls() should increase memory usage specially since each interval bar contains only 2 data points.

Installed versions

``` --------Version info--------- Polars: 1.14.0 Index type: UInt32 Platform: Linux-6.9.6-arch1-1-x86_64-with-glibc2.40 Python: 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:36:51) [GCC 12.4.0] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair 5.4.1 boto3 1.34.59 cloudpickle 3.0.0 connectorx deltalake fastexcel fsspec 2024.5.0 gevent 24.2.1 google.auth 2.28.2 great_tables matplotlib 3.8.4 nest_asyncio 1.6.0 numpy 1.26.3 openpyxl 3.1.2 pandas 3.0.0.dev0+756.ge8e6be071c pyarrow 18.0.0.dev487+g3130bb1ce pydantic 2.7.2 pyiceberg sqlalchemy 2.0.30 torch 2.4.1+cu121 xlsx2csv xlsxwriter ```
ritchie46 commented 1 week ago

drop_nulls materializes a new column without nulls. This requires allocating memory.

dalejung commented 1 week ago

@ritchie46

But each subgroup is only 2 rows. Why would the drop_nulls version cause OOM?

    df['price'].drop_nulls()

doesn't eat up memory.

The non drop_nulls memory uses ~5gb of memory. The drop_nulls version eats up 64GB before causing OOM.