greenelab / connectivity-search-analyses

hetnet connectivity search research notebooks (previously hetmech)
BSD 3-Clause "New" or "Revised" License
9 stars 5 forks source link

Memory leak in bulk computation of permuted DWPCs #141

Closed dhimmel closed 6 years ago

dhimmel commented 6 years ago

In https://github.com/greenelab/hetmech/pull/140 / https://github.com/greenelab/hetmech/pull/140/commits/b882476bc4f1807033843e22392dacfdeaaec598, we specified computing degree-grouped permutation stats for 200 permutated hetnets. However, the computation died on the 99th iteration without any error message. Hence, I suspected the process was killed due to excessive memory consumption. I reran the bulk notebook while supervising memory usage and within a day or two the process was consuming 50 GB of RAM and counting.

Our cache sizes for path count matrices are set at 16GB, so max memory usage shouldn't exceed 20GB (4 GB is a generous estimate for the other objects that must be stored). Hence, it seems that the garbage collection is not working as expected, or that we are not properly clearing references to discarded files.

I stopped the notebook with the growing leak, with it's objects still in memory and then ran:

from pympler import muppy, summary
all_objects = muppy.get_objects()

Running these commands caused memory consumption to drop:

pympler-mem-decrease

Still not sure what to make of this clue.

dhimmel commented 6 years ago

In https://github.com/greenelab/hetmech/pull/142/commits/484f36caa67909d21f6db41c9d68479181e25e06 from https://github.com/greenelab/hetmech/pull/142, I stopped the computation mid memory leak, with Python consuming 48.9 GB of memory. Comparing tracemalloc snapshots using the lineno statistic, we get the following top two statistics (the rest are small :fish:)

/home/dhimmel/anaconda3/envs/hetmech/lib/python3.6/site-packages/pandas/core/indexes/multi.py:2688: size=28.7 GiB (+28.7 GiB), count=1994546 (+1994546), average=15.1 KiB
/home/dhimmel/anaconda3/envs/hetmech/lib/python3.6/site-packages/pandas/core/indexes/multi.py:2683: size=28.7 GiB (+28.7 GiB), count=1992955 (+1992955), average=15.1 KiB

Hence, it looks like many pandas multi-index instances are getting created (and presumably not destroyed), causing the leak. Here is where the line numbers point:

# pandas/core/indexes/multi.py:2683
slabels = slabels[slabels != -1]
# pandas/core/indexes/multi.py:2688
olabels = olabels[olabels != -1]

Still not fully sure how to interpret this besides that perhaps these are the lines where the leaking memory is allocated?

Update: opened https://github.com/pandas-dev/pandas/issues/23047