pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.86k stars 18.01k forks source link

Memory stays around after pickle cycle #43156

Open mrocklin opened 3 years ago

mrocklin commented 3 years ago

Hi Folks,

Related to #43155 I'm running into memory issues when pickling many small pandas dataframes. The following script creates a pandas dataframe, splits it up, and then pickles each little split and brings them back again. It then deletes all objects from memory but something is still sticking around. Here is the script, followed by the outputs of using memory_profiler

import numpy as np
import pandas as pd
import pickle

@profile
def test():
    df = pd.DataFrame(np.random.random((20000, 1000)))
    df["partitions"] = (df[0] * 10000).astype(int)
    _, groups = zip(*df.groupby("partitions"))
    del df

    groups = [pickle.dumps(group) for group in groups]
    groups = [pickle.loads(group) for group in groups]

    del groups

if __name__ == "__main__":
    test()
python -m memory_profiler memory_issue.py
Filename: memory_issue.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     7   76.574 MiB   76.574 MiB           1   @profile
     8                                         def test():
     9  229.445 MiB  152.871 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    10  230.738 MiB    1.293 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    11  398.453 MiB  167.715 MiB           1       _, groups = zip(*df.groupby("partitions"))
    12  245.633 MiB -152.820 MiB           1       del df
    13                                         
    14  445.688 MiB   47.273 MiB        8631       groups = [pickle.dumps(group) for group in groups]
    15  712.285 MiB  266.598 MiB        8631       groups = [pickle.loads(group) for group in groups]
    16                                         
    17  557.488 MiB -154.797 MiB           1       del groups

As you can see, we start with 70 MiB in memory and end with 550 MiB, despite all relevant objects being released. This leak increases with the number of groups (scale the 10000 number to move the leak up or down). Any help or pointers on how to track this down would be welcome.

jbrockmendel commented 3 years ago

Another datapoint: running your script on OSX I'm seeing a lot more being released at the end

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     7   68.484 MiB   68.484 MiB           1   @profile
     8                                         def test():
     9  221.121 MiB  152.637 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    10  221.828 MiB    0.707 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    11  395.141 MiB  173.312 MiB           1       _, groups = zip(*df.groupby("partitions"))
    12  242.551 MiB -152.590 MiB           1       del df
    13                                         
    14  499.613 MiB  104.137 MiB        8684       groups = [pickle.dumps(group) for group in groups]
    15  915.664 MiB  284.641 MiB        8684       groups = [pickle.loads(group) for group in groups]
    16                                         
    17  286.395 MiB -629.270 MiB           1       del groups

Also if I add a gc.collect() after del groups i get another 40mb back.

mrocklin commented 3 years ago

Same with gc.collect()

Filename: memory_issue.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     8   98.180 MiB   98.180 MiB           1   @profile
     9                                         def test():
    10  250.863 MiB  152.684 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    11  252.039 MiB    1.176 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    12  420.848 MiB  168.809 MiB           1       _, groups = zip(*df.groupby("partitions"))
    13  267.980 MiB -152.867 MiB           1       del df
    14                                         
    15  468.211 MiB   47.391 MiB        8643       groups = [pickle.dumps(group) for group in groups]
    16  738.316 MiB  270.105 MiB        8643       groups = [pickle.loads(group) for group in groups]
    17                                         
    18  579.688 MiB -158.629 MiB           1       del groups
    19  528.438 MiB  -51.250 MiB           1       gc.collect()
jbrockmendel commented 3 years ago

Going though gc.get_objects() I don't see any big objects left behind

gjoseph92 commented 3 years ago

FYI you should probably run this with MALLOC_TRIM_THRESHOLD_=0 python -m memory_profiler memory_issue.py on linux / DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python -m memory_profiler memory_issue.py on macOS to encourage the allocator to release pages back to the OS. memory_profiler is just tracking the RSS of the process, nothing fancier, so it's possible the memory has been free'd as far as pandas can get it, just not fully released.

mrocklin commented 3 years ago

I believe that I have run this with MALLOC_TRIMTHRESHOLD=0 already and saw the same results, but I should verify

On Tue, Aug 24, 2021 at 5:57 PM Gabe Joseph @.***> wrote:

FYI you should probably run this with MALLOC_TRIMTHRESHOLD=0 python -m memory_profiler memory_issue.py on linux / DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python -m memory_profiler memory_issue.py on macOS to encourage the allocator to release pages back to the OS. memory_profiler is just tracking the RSS of the process, nothing fancier, so it's possible the memory has been free'd as far as pandas can get it, just not fully released.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/43156#issuecomment-905029887, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTDPNMSWMLWF5M42T2TT6QPVDANCNFSM5CR6GRMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

mrocklin commented 3 years ago

Yes, same result on my linux/ubuntu machine running mambafoge.

jbrockmendel commented 3 years ago

Is there a viable non-pickle alternative?

When I change pickle.dumps(group) to pickle.dumps(group.values) to pickle the underlying ndarrays I end up with 50-60 mb less (and the gc.collect no longer gets anything) than i do with pickling the DataFrames, but thats still 2-3 times the original footprint.