Open mrocklin opened 3 years ago
Another datapoint: running your script on OSX I'm seeing a lot more being released at the end
Line # Mem usage Increment Occurences Line Contents
============================================================
7 68.484 MiB 68.484 MiB 1 @profile
8 def test():
9 221.121 MiB 152.637 MiB 1 df = pd.DataFrame(np.random.random((20000, 1000)))
10 221.828 MiB 0.707 MiB 1 df["partitions"] = (df[0] * 10000).astype(int)
11 395.141 MiB 173.312 MiB 1 _, groups = zip(*df.groupby("partitions"))
12 242.551 MiB -152.590 MiB 1 del df
13
14 499.613 MiB 104.137 MiB 8684 groups = [pickle.dumps(group) for group in groups]
15 915.664 MiB 284.641 MiB 8684 groups = [pickle.loads(group) for group in groups]
16
17 286.395 MiB -629.270 MiB 1 del groups
Also if I add a gc.collect()
after del groups
i get another 40mb back.
Same with gc.collect()
Filename: memory_issue.py
Line # Mem usage Increment Occurences Line Contents
============================================================
8 98.180 MiB 98.180 MiB 1 @profile
9 def test():
10 250.863 MiB 152.684 MiB 1 df = pd.DataFrame(np.random.random((20000, 1000)))
11 252.039 MiB 1.176 MiB 1 df["partitions"] = (df[0] * 10000).astype(int)
12 420.848 MiB 168.809 MiB 1 _, groups = zip(*df.groupby("partitions"))
13 267.980 MiB -152.867 MiB 1 del df
14
15 468.211 MiB 47.391 MiB 8643 groups = [pickle.dumps(group) for group in groups]
16 738.316 MiB 270.105 MiB 8643 groups = [pickle.loads(group) for group in groups]
17
18 579.688 MiB -158.629 MiB 1 del groups
19 528.438 MiB -51.250 MiB 1 gc.collect()
Going though gc.get_objects()
I don't see any big objects left behind
FYI you should probably run this with MALLOC_TRIM_THRESHOLD_=0 python -m memory_profiler memory_issue.py
on linux / DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python -m memory_profiler memory_issue.py
on macOS to encourage the allocator to release pages back to the OS. memory_profiler
is just tracking the RSS of the process, nothing fancier, so it's possible the memory has been free
'd as far as pandas can get it, just not fully released.
I believe that I have run this with MALLOC_TRIMTHRESHOLD=0 already and saw the same results, but I should verify
On Tue, Aug 24, 2021 at 5:57 PM Gabe Joseph @.***> wrote:
FYI you should probably run this with MALLOC_TRIMTHRESHOLD=0 python -m memory_profiler memory_issue.py on linux / DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python -m memory_profiler memory_issue.py on macOS to encourage the allocator to release pages back to the OS. memory_profiler is just tracking the RSS of the process, nothing fancier, so it's possible the memory has been free'd as far as pandas can get it, just not fully released.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/43156#issuecomment-905029887, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTDPNMSWMLWF5M42T2TT6QPVDANCNFSM5CR6GRMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Yes, same result on my linux/ubuntu machine running mambafoge.
Is there a viable non-pickle alternative?
When I change pickle.dumps(group)
to pickle.dumps(group.values)
to pickle the underlying ndarrays I end up with 50-60 mb less (and the gc.collect no longer gets anything) than i do with pickling the DataFrames, but thats still 2-3 times the original footprint.
Hi Folks,
Related to #43155 I'm running into memory issues when pickling many small pandas dataframes. The following script creates a pandas dataframe, splits it up, and then pickles each little split and brings them back again. It then deletes all objects from memory but something is still sticking around. Here is the script, followed by the outputs of using
memory_profiler
As you can see, we start with 70 MiB in memory and end with 550 MiB, despite all relevant objects being released. This leak increases with the number of groups (scale the 10000 number to move the leak up or down). Any help or pointers on how to track this down would be welcome.