sweeneyde / multimerge

A faster multi-way merge algorithm interchangeable with heapq.merge
MIT License
23 stars 0 forks source link

Memory leak in long running process #6

Open scryclip opened 1 year ago

scryclip commented 1 year ago

I swapped out heapq.merge for this on a continuously running process. A new call to mulitmerge.merge would be made often it was passed a *list of generators with a key. The instance of the multimerge.merge iterator would not be exhausted it would be used in a loop until a break and just be garbage collected and created anew in a future loop. This would slowly leak memory. Switching back to heapq.merge resolved it. Either something is not deleted In the C or the python gc is not able to fully clean up the multimerge instance when it removes it.

sweeneyde commented 1 year ago

Can you provide a minimal reproducible example? Maybe use psutil to show memory usage over time.

I tested with the script below (key/no-key,forward/reverse,exhausting/stopping,various lengths,various numbers of iterables) and could not find good evidence of a leak. Since dynamic memory allocation is used, occasionally there is fragmentation that requires allocating new pages, but not in any consistent way: memory usage fluctuates, but is ultimately bounded. It may just be that multimerge.merge uses more dynamic memory allocation, so there is naturally more fragmentation and slight variations in memory usage over time.

I suppose another possibility is that multimerge.merge is holding onto some references slightly longer that heapq.merge does. For example, I believe the computed key(x) values might be preserved for one more iteration, which is within the realm of reasonable implementation details.

On the other hand, if you can demonstrate that something is systematically and permanently leaking, that would certainly be a bug worth fixing, but I can't know without a reproducer.

measure_memory.py ```python import random from itertools import product from operator import itemgetter if 1: print("using multimerge.merge") from multimerge import merge if 0: print("using heapq.merge") from heapq import merge lengths = range(20) keys = [None, itemgetter(0), itemgetter(1), itemgetter(1, 0)] reverses = [False, True] repeats = range(10) pool = list(product('ABC', (-500, 500))) def make_data(): data = [] for n, key, reverse, _ in product(lengths, keys, reverses, repeats): inputs = [] for _ in range(n): it = random.choices(pool, k=random.randrange(20)) it.sort(key=key, reverse=reverse) inputs.append(it) data.append((inputs, key, reverse)) return data def check_all(data): for inputs, key, reverse in data: for take in [1, 2, 5, 10, 20, 50, 100]: m = merge(*inputs, key=key, reverse=reverse) for _ in range(take): next(m, None) m = merge(*inputs, key=key, reverse=reverse) list(m) list(m) del m def main(): if 1: print("measuring memory usage") import psutil process = psutil.Process() get_stat = lambda: process.memory_info().rss if 0: print("measuring refcounts") # needs debug build of python from sys import gettotalrefcount get_stat = gettotalrefcount prev_stat = 0 for _ in range(1000): check_all(make_data()) stat = get_stat() if stat == prev_stat: print(".", end='', flush=True) else: print(f"\n{stat} (+{stat-prev_stat})") prev_stat = stat print("\ndone") if __name__ == "__main__": main() ```
output.txt ``` using multimerge.merge measuring memory usage 23158784 (+23158784) ......... 23969792 (+811008) .................................................................................................................................................... 22974464 (+-995328) .... 23781376 (+806912) ............................................................................................................................................................... 22974464 (+-806912) ..... 23781376 (+806912) ..................................... 22974464 (+-806912) 23240704 (+266240) ... 23781376 (+540672) .................................................................................. 22974464 (+-806912) .... 23781376 (+806912) ....................................................................... 22974464 (+-806912) 23240704 (+266240) ... 23781376 (+540672) ........................... 22974464 (+-806912) .... 23781376 (+806912) ................ 22974464 (+-806912) .... 23781376 (+806912) ..... 22974464 (+-806912) .... 23781376 (+806912) ..... 22974464 (+-806912) .... 23781376 (+806912) .................................................................................. 22974464 (+-806912) .... 23781376 (+806912) ....................................................................... 22974464 (+-806912) .... 23781376 (+806912) ........................................................................................................................................................................................................................... done ```