PERF: Memory leak when returning subset of DataFrame and deleting the rest

mar-ses commented 1 year ago

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[X] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I think the simplest way is for me to just post this bare example which shows the memory leak (I had a similar use case in my code with some data that I was pulling from a database):

import os
import gc
import psutil

import numpy as np
import pandas as pd

print("Pandas version:", pd.__version__)

process = psutil.Process(os.getpid())

def get_df(N):
    df = pd.DataFrame({"a": np.linspace(0, 1e6, N), "b": None, "c": None})
    df["b"] = "blabla"

    for i in df.index:
        df.at[i, "c"] = {f"blabla_{j}": j for j in range(i)}

    return df

def get_filtered_df(N):
    df = get_df(N=N)
    df["d"] = 2*df.a
    out = df[df.d.between(1e5, 5e5)].copy()
    # I added these to try to force garbage collection of any part of the df I wasn't returning:
    del df
    gc.collect()
    return out

print(f"Initial memory usage: {process.memory_info().rss / 1e6} MB")

df = []

for i in range(10):
    df.append(get_filtered_df(10000))
    print(f"Memory usage after iteration: {process.memory_info().rss / 1e6} MB")

df = pd.concat(df, ignore_index=True)
gc.collect()
print("Memory usage of df:", df.memory_usage(deep=True).sum() / 1e6)
print(f"Final memory usage: {process.memory_info().rss / 1e6} MB")

The output of this is:

Pandas version: 1.1.5
Initial memory usage: 92.499968 MB
Memory usage after iteration: 2263.015424 MB
Memory usage after iteration: 2538.786816 MB
Memory usage after iteration: 2814.967808 MB
Memory usage after iteration: 3090.624512 MB
Memory usage after iteration: 3366.793216 MB
Memory usage after iteration: 3642.445824 MB
Memory usage after iteration: 3918.639104 MB
Memory usage after iteration: 4194.566144 MB
Memory usage after iteration: 4470.468608 MB
Memory usage after iteration: 4746.375168 MB
Memory usage of df: 1124.821968
Final memory usage: 4746.088448 MB

To me, this looks like a memory leak. The extra 3.5 GB or so that the process is using cannot be accounted for. I tried to look into it further by counting the sizes of everything in globals() with this hacky idea (taken and modified from this page):

import copy
from sys import getsizeof
from collections.abc import Mapping, Container

def deep_getsizeof(o, ids):
    """Find the memory footprint of a Python object

    This is a recursive function that drills down a Python object graph
    like a dictionary holding nested dictionaries with lists of lists
    and tuples and sets.

    The sys.getsizeof function does a shallow size of only. It counts each
    object inside a container as pointer only regardless of how big it
    really is.

    :param o: the object
    :param ids: set of object ids to ignore
    :return:
    """
    d = deep_getsizeof
    if id(o) in ids:
        return 0

    ids.add(id(o))

    if isinstance(o, pd.DataFrame):
        return o.memory_usage(deep=True).sum()
    if isinstance(o, pd.Series):
        return o.memory_usage(deep=True)
    else:
        r = getsizeof(o)

    if isinstance(o, str) or isinstance(0, str):
        return r

    if isinstance(o, dict):
        return r + sum(d(k, ids) + d(v, ids) for k, v in o.items())
    if isinstance(o, Mapping):
        return r + sum(d(k, ids) + d(v, ids) for k, v in o.iteritems())

    if isinstance(o, Container):
        return r + sum(d(x, ids) for x in o)

    return r

print("Globals:", deep_getsizeof(globals(), set()) / 1e6)

for var_name, var in copy.copy(globals()).items():
    print(f"{var_name}: {deep_getsizeof(var, set()) / 1e6}")

and I get the same picture:

Globals: 1124.833909
__name__: 5.7e-05
__doc__: 0.000113
__package__: 1.6e-05
__loader__: 1.6e-05
__spec__: 1.6e-05
__builtin__: 8e-05
__builtins__: 8e-05
_ih: 0.002586
_oh: 0.00024
_dh: 0.000213
In: 0.002586
Out: 0.00024
get_ipython: 6.4e-05
exit: 5.6e-05
quit: 5.6e-05
_: 4.9e-05
__: 4.9e-05
___: 4.9e-05
_i: 0.000993
_ii: 4.9e-05
_iii: 4.9e-05
_i1: 0.000993
os: 8e-05
gc: 8e-05
psutil: 8e-05
np: 8e-05
pd: 8e-05
process: 5.6e-05
get_df: 0.000136
get_filtered_df: 0.000136
df: 1124.821968
i: 2.8e-05
_i2: 0.001448
copy: 8e-05
getsizeof: 7.2e-05
Mapping: 0.000888
Container: 0.000888
deep_getsizeof: 0.000136

Am I doing something wrong or is this a true memory leak? Could it be related to the fact that I have all these dicts in the DataFrame? Does the space for their hash image not get deallocated or something? I don't really know how to debug this further.

I know some might say it's bad practice to have dicts in a DataFrame, but in a real-life example, I'm getting them because I query this data from a database, and some of the elements are json records and stuff like that.

Installed Versions

INSTALLED VERSIONS ------------------ commit : b5958ee1999e9aead1938c0bba2b674378807b3d python : 3.6.5.final.0 python-bits : 64 OS : Linux OS-release : 4.14.111-1.el7.centos.x86_64 Version : #1 SMP Wed Apr 17 17:45:41 CEST 2019 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.1.5 numpy : 1.19.5 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 58.2.0 Cython : None pytest : 6.2.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.0.2 lxml.etree : 4.6.4 html5lib : None pymysql : None psycopg2 : 2.9.1 (dt dec pq3 ext lo64) jinja2 : 3.0.2 IPython : 7.16.2 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : 2.8.0 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pyxlsb : None s3fs : None scipy : 1.5.4 sqlalchemy : 1.4.27 tables : 3.6.1 tabulate : None xarray : 0.16.2 xlrd : 1.2.0 xlwt : None numba : 0.53.1

Prior Performance

No response

jbrockmendel commented 1 year ago

Can you give this a try on pandas 1.5.1 or main?

mar-ses commented 1 year ago

Yes, the values are quite different, but the issue is the same. Partly, the reason for the different values could be because I ran the main example in Jupyter, while this one I ran directly as a file script. I didn't fully understand why I get this big 2GB offset in the main example.

Here is the output:

Pandas version: 1.5.1
Initial memory usage: 62.513152 MB
Memory usage after iteration: 478.826496 MB
Memory usage after iteration: 1270.603776 MB
Memory usage after iteration: 1547.554816 MB
Memory usage after iteration: 1823.973376 MB
Memory usage after iteration: 2101.18656 MB
Memory usage after iteration: 2424.328192 MB
Memory usage after iteration: 3216.216064 MB
Memory usage after iteration: 3605.475328 MB
Memory usage after iteration: 3882.16832 MB
Memory usage after iteration: 4383.531008 MB
Memory usage of df: 1124.821968
Final memory usage: 4154.92096 MB

Overall, when constructing and testing the example I posted, as well as dealing with the real use case I had, I got the impression that the exact memory usage was very unstable and unexplainable.

E.g., I am sure that sometimes, one of the iterations wouldn't increase the memory, and other times it would increase it by double the normal step.

In the real example, sometimes the memory goes down, some times it goes up. I've been doing a deep dive and the jumps in memory usage are sometimes larger than any single DataFrame that I query or create anywhere. Other times usage goes down (but overall it's slowly inflating).

I don't know if htis is a common problem when profiling memory usage in python.

mar-ses commented 1 year ago

Also, isn't it a bit weird how the increments in memory usage are not regular? Sometimes it's 600 MB, sometimes 800 MB, sometimes 200. Even though each step should be identical. Do you konw why this could be?

Could it just be random timing of the garbage collector?

Also, if you run this on your system, what values do you get?

somurzakov commented 1 year ago

@mar-ses the memory leak happens because you are doing the most common pandas bad practice - continuously mutating dataframe in a loop, one cell at a time.

if you rewrite your lines in def get_df(N):

    for i in df.index:
        df.at[i, "c"] = {f"blabla_{j}": j for j in range(i)}

into below, the memory leak will disappear.

    df["c"] = pd.Series([[{f"blabla_{j}": j for j in range(i)}] for i in df.index])

I am not sure how pandas team can fix this issue, but underlying problem is users are using dataframes as mutable variables just like any other python variable. While in reality the best practice is to treat dataframe as an immutable object and chain your transformations on dataframe by copying and creating new dataframe, and letting GC collect old dfs.

mar-ses commented 1 year ago

Very surprised to see that actually; it was my impression that it was actually recommended that you first create the dataframe, with its full size allocated but empty, and then modify its elements, instead of adding/appending rows. Because adding rows results in constant array creation and is less efficient.

At least that's what I thought I heard from places like stackoverflow, perhaps I was mistaken.

somurzakov commented 1 year ago

@mar-ses modifying df is definitely not recommended, because when you modify one cell - it invalidates entire memory block behind it, that stores values other nearby cells. Doing it in a loop for every cell - and you can see how much waste will be created by allocating and invalidating memory blocks at each iteration.

For details about BlockManager you can read this blog https://uwekorn.com/2020/05/24/the-one-pandas-internal.html

this blog post also contains few other really good recommendations for other pandas anti-patterns https://www.aidancooper.co.uk/pandas-anti-patterns/

jbrockmendel commented 1 year ago

I'm now seeing

Initial memory usage: 79.9744 MB

Memory usage after iteration: 442.945536 MB
Memory usage after iteration: 247.074816 MB
Memory usage after iteration: 292.450304 MB
Memory usage after iteration: 345.153536 MB
Memory usage after iteration: 343.887872 MB
Memory usage after iteration: 369.041408 MB
Memory usage after iteration: 362.8032 MB
Memory usage after iteration: 409.530368 MB
Memory usage after iteration: 554.610688 MB
Memory usage after iteration: 431.632384 MB

@somurzakov is right that setting df.at[i, "c"] is not encouraged, but that mostly for speed reasons, not memory usage. In fact, replacing .at[i, "c"] loop with df["c"] = pd.Series([[{f"blabla_{j}": j for j in range(i)}] for i in df.index]) increases the memory footprint:

Memory usage after iteration: 445.48096 MB
Memory usage after iteration: 637.48096 MB
Memory usage after iteration: 807.624704 MB
Memory usage after iteration: 815.869952 MB
Memory usage after iteration: 846.483456 MB
Memory usage after iteration: 876.25728 MB
Memory usage after iteration: 974.516224 MB
Memory usage after iteration: 1005.826048 MB
Memory usage after iteration: 908.234752 MB
Memory usage after iteration: 946.753536 MB

@mar-ses can you confirm either of these results on main?

Also, any chance there is a typo in what you're trying to set? Each entry in df["c"] is a single-element list containing a decent-sized dict. Nested data is not encouraged.

mar-ses commented 1 year ago

Regarding the first question, I don't remember anymore and I don't think I have the example that was causing this at hand. I did try other ways of doing this than the example I gave, but I don't think I tried to create the series with such an inner list comprehension.

Regarding the second point, it's no typo, though I know it's discouraged. In this case, I was dealing with data from a database that includes a lot of "metadata" which is stored in json files, and I actually need almost all of the contents. The jsons are quite large and mostly of a fixed structure, but not exactly, so it would have been very awkward to try to expand them out first.

pandas-dev / pandas