Possibly memory leak on Linux

ngandrewhh commented 6 months ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import os
os.environ['POLARS_MAX_THREADS'] = '1'

from functools import wraps

import numpy as np
import pandas as pd
import polars as pl
import psutil

def trace(f):
    @wraps(f)
    def wrap(*args, **kwargs):
        print(f"pid #{psutil.Process().pid}: {psutil.Process().memory_info().rss / 1024 / 1024:,.1f} MB")
        result = f(*args, **kwargs)
        print(f"pid #{psutil.Process().pid}: {psutil.Process().memory_info().rss / 1024 / 1024:,.1f} MB")
        return result
    return wrap

@trace
def transform_data(pl_data, i):
    out = {}
    expr = [pl.col(col).rolling_mean(window_size=f"{i}i").over('ID') for col in list(set(pl_data.columns) - {'TS', 'ID'})]
    pl_data = pl_data.with_columns(expr)

    for col in list(set(pl_data.columns)):
        # leaks
        # out[col] = pl_data.select(col)
        # out[col] = pl_data.get_column(col).to_frame()

        # no leak
        # out[col] = pl_data.select(col).to_pandas()
        # out[col] = pl_data.get_column(col).to_pandas()

    return out

@trace
def do_something(data):
    pl_data = pl.from_pandas(data, include_index=True)

    out = {}
    for i in range(5, 10):
        out |= transform_data(pl_data, i)

    # return pl.concat(out.values(), how='horizontal').to_pandas().set_index(['TS', 'ID'])
    # return pd.concat(out.values(), axis=1).set_index(['TS', 'ID'])

@trace
def main():
    data = pd.DataFrame(columns=['TS', 'ID', *list(range(30))])
    np.random.seed(1234)
    data['TS'] = pd.date_range("2020-01-01 00:00", "2020-06-30 23:59", freq='5min')[:50000]
    data['ID'] = np.random.randint(5, size=50000)
    data = data.set_index(['TS', 'ID'])
    data = data.map(lambda e: np.random.rand())

    for _ in range(10000):
        data = do_something(data)

if __name__ == "__main__":
    main()

Log output

No response

Issue description

Memory saturates in Windows but possibly leaks in Linux.

While I have read about memory allocator, interestingly when we convert back to pandas df, no memory increase is observed. Memory increase is seen when performing pl.concat(..., how='horizontal') on pl.select(...) or pl.get_column(...).to_frame(). I have read some discussion on the memory allocators on relevant threads, but not sure entirely relevant here.

Windows: hover around 140 MB for conversion to and from pandas, hover around 180 MB for polars Linux: hover around 140 MB for conversion to and from pandas, goes up to 500 / 600 MB for polars and gradually increasing

Expected behavior

Polars should take less memory than Pandas at idle, and memory footprint should not be increasing.

Installed versions

Windows

``` --------Version info--------- Polars: 0.20.10 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```

Linux

``` --------Version info--------- Polars: 0.20.15 Index type: UInt32 Platform: Linux-5.14.0-362.24.1.e19_3.x86_64-x86_64-with-glibc2.34 Python: 3.11.8 (main, Feb 26 2024, 21:34:05) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: 2024.3.1 gevent: hvplot: matplotlib: numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.1.4 pyarrow: 14.0.2 pydantic: 2.7.0 pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```

orlp commented 6 months ago

Can you please provide a minimal working example? The given example doesn't run. Preferably one without pandas.

ngandrewhh commented 6 months ago

Hi @orlp , can you uncomment these lines below? Those marked with leaks introduce the problem.

        # leaks
        # out[col] = pl_data.select(col)
        # out[col] = pl_data.get_column(col).to_frame()

        # no leak
        # out[col] = pl_data.select(col).to_pandas()
        # out[col] = pl_data.get_column(col).to_pandas()

    # leaks
    # return pl.concat(out.values(), how='horizontal').to_pandas().set_index(['TS', 'ID'])
    # no leak
    # return pd.concat(out.values(), axis=1).set_index(['TS', 'ID'])

Converting to pandas is a way I attempt to tell the system that we are done with the polars object and they should be deallocated unless there are better ways of doing so, please let me know.

pola-rs / polars