pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.23k stars 1.95k forks source link

Possibly memory leak on Linux #15725

Open ngandrewhh opened 6 months ago

ngandrewhh commented 6 months ago

Checks

Reproducible example

import os
os.environ['POLARS_MAX_THREADS'] = '1'

from functools import wraps

import numpy as np
import pandas as pd
import polars as pl
import psutil

def trace(f):
    @wraps(f)
    def wrap(*args, **kwargs):
        print(f"pid #{psutil.Process().pid}: {psutil.Process().memory_info().rss / 1024 / 1024:,.1f} MB")
        result = f(*args, **kwargs)
        print(f"pid #{psutil.Process().pid}: {psutil.Process().memory_info().rss / 1024 / 1024:,.1f} MB")
        return result
    return wrap

@trace
def transform_data(pl_data, i):
    out = {}
    expr = [pl.col(col).rolling_mean(window_size=f"{i}i").over('ID') for col in list(set(pl_data.columns) - {'TS', 'ID'})]
    pl_data = pl_data.with_columns(expr)

    for col in list(set(pl_data.columns)):
        # leaks
        # out[col] = pl_data.select(col)
        # out[col] = pl_data.get_column(col).to_frame()

        # no leak
        # out[col] = pl_data.select(col).to_pandas()
        # out[col] = pl_data.get_column(col).to_pandas()

    return out

@trace
def do_something(data):
    pl_data = pl.from_pandas(data, include_index=True)

    out = {}
    for i in range(5, 10):
        out |= transform_data(pl_data, i)

    # return pl.concat(out.values(), how='horizontal').to_pandas().set_index(['TS', 'ID'])
    # return pd.concat(out.values(), axis=1).set_index(['TS', 'ID'])

@trace
def main():
    data = pd.DataFrame(columns=['TS', 'ID', *list(range(30))])
    np.random.seed(1234)
    data['TS'] = pd.date_range("2020-01-01 00:00", "2020-06-30 23:59", freq='5min')[:50000]
    data['ID'] = np.random.randint(5, size=50000)
    data = data.set_index(['TS', 'ID'])
    data = data.map(lambda e: np.random.rand())

    for _ in range(10000):
        data = do_something(data)

if __name__ == "__main__":
    main()

Log output

No response

Issue description

Memory saturates in Windows but possibly leaks in Linux.

While I have read about memory allocator, interestingly when we convert back to pandas df, no memory increase is observed. Memory increase is seen when performing pl.concat(..., how='horizontal') on pl.select(...) or pl.get_column(...).to_frame(). I have read some discussion on the memory allocators on relevant threads, but not sure entirely relevant here.

Windows: hover around 140 MB for conversion to and from pandas, hover around 180 MB for polars Linux: hover around 140 MB for conversion to and from pandas, goes up to 500 / 600 MB for polars and gradually increasing

Expected behavior

Polars should take less memory than Pandas at idle, and memory footprint should not be increasing.

Installed versions

Windows

``` --------Version info--------- Polars: 0.20.10 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```

Linux

``` --------Version info--------- Polars: 0.20.15 Index type: UInt32 Platform: Linux-5.14.0-362.24.1.e19_3.x86_64-x86_64-with-glibc2.34 Python: 3.11.8 (main, Feb 26 2024, 21:34:05) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: 2024.3.1 gevent: hvplot: matplotlib: numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.1.4 pyarrow: 14.0.2 pydantic: 2.7.0 pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
orlp commented 6 months ago

Can you please provide a minimal working example? The given example doesn't run. Preferably one without pandas.

ngandrewhh commented 6 months ago

Hi @orlp , can you uncomment these lines below? Those marked with leaks introduce the problem.

        # leaks
        # out[col] = pl_data.select(col)
        # out[col] = pl_data.get_column(col).to_frame()

        # no leak
        # out[col] = pl_data.select(col).to_pandas()
        # out[col] = pl_data.get_column(col).to_pandas()
    # leaks
    # return pl.concat(out.values(), how='horizontal').to_pandas().set_index(['TS', 'ID'])
    # no leak
    # return pd.concat(out.values(), axis=1).set_index(['TS', 'ID'])

Converting to pandas is a way I attempt to tell the system that we are done with the polars object and they should be deallocated unless there are better ways of doing so, please let me know.