scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
847 stars 89 forks source link

Memory leak when iterating over array #1280

Closed jpearkes closed 1 year ago

jpearkes commented 2 years ago

Version of Awkward Array

1.7.0

Description and code to reproduce

Hi,

A student I am working with noticed what appears to be a memory leak when iterating over array. I've replicated the issue with the following snippet.

import psutil

A = ak.Array([[0,1],[0],[1,0]]*10000000)
my_sum = np.zeros(len(A))
i = 0
for event in A:
    my_sum[i] =  np.sum(event)
    i += 1
    if i %10000 == 0:
        print('RAM memory % used:', psutil.virtual_memory()[2])

A (much faster) solution without the leak to do the sum on the awkward array directly

my_sum = np.sum(A, axis=1) 

But I figured I should report the issue here in case it is helpful.

jpivarski commented 2 years ago

I'm on a phone right now, so I can't test this directly yet, but Python will use all the memory it has available until it reaches a limit before calling the garbage collector. The above code should show a linear increase in memory consumption until it gets to that limit, then I think it plateaus (instead of a sawtooth shape, which is the other conceptual possiblity). If this eventually stops increasing in memory use, even if that's at the limit of your computer's resources, or at the process's ulimit, then that is correct behavior for a garbage-collected language.

What it comes down to, though, is that this is an antipattern: you don't want to iterate over all elements of a large array with a Python for loop, since that creates Python objects for each element in the array (here, ak.Arrays of length 1 or 2). You want to do:

my_sum = ak.sum(A, axis=-1)

and possibly send that through ak.to_numpy if you need that to be NumPy, rather than Awkward. That does the sum entirely in compiled code—no Python objects for each short list, and no waiting on the garbage collector to bring memory use under control.

These should be thought of as techniques to avoid using Python for anything large (memory or time). You get the computational result, but without representing all the intermediate steps in Python objects. (Same philosophy as NumPy.)

jpearkes commented 2 years ago

Thanks for the detailed explanation Jim! We were seeing it use up all the memory available and then promptly crash the lxplus/SWAN node it was running on.

jpivarski commented 2 years ago

I didn't do this question justice at all. I missed this, for instance:

A (much faster) solution without the leak to do the sum on the awkward array directly

my_sum = np.sum(A, axis=1) 

where you were pretty clear that you know how it's supposed to be done.

I tested the sample code and the extra memory used after the loop doesn't seem to go away with gc.collect(), though it's hard to set up a clean experiment on this computer that also has Zoom running (since psutil.virtual_memory() returns total RAM usage). I think it could really be a memory leak, most likely in the "handle" objects that point to the arrays, rather than the arrays themselves.

If we weren't porting all of these "handle" objects from C++ to Python, that would be something we'd have to fix. As it is, the new Python version of this should be automatically memory-clean. I just tried it by replacing ak.Array with ak._v2.Array and np.sum with ak._v2.sum and the memory use vs i was a lot flatter. So, just because it seems to be a v1-only thing and we'll be replacing that, I'll be labeling this as "won't fix." Thanks for reporting it, though!

jpivarski commented 1 year ago

Good news! PR #2311 apparently fixes this memory leak, too! I'm reopening this issue just so that it can be closed by the PR, for record-keeping.