readbeyond / aeneas

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
http://www.readbeyond.it/aeneas/
GNU Affero General Public License v3.0
2.49k stars 228 forks source link

ndarrays allocated in C extensions are leaking memory #205

Closed rotemdan closed 6 years ago

rotemdan commented 6 years ago

The following causes the process' memory usage to grow indefinitely (calls to del and gc.collect() have no effect):

import numpy as np
import aeneas.cmfcc.cmfcc
import gc
import time

for i in range (0, 1000):
    print(i)
    audioFloats = np.frombuffer(bytes(int(10000000)), dtype='int16').astype('float64') / 32768
    audioMFCCs, _, _ = aeneas.cmfcc.cmfcc.compute_from_data(audioFloats, 16000, 40, 13, 512, 133.3333, 6855.4976, 0.97, 0.025, 0.010)
    del audioFloats
    del audioMFCCs
    gc.collect()
    time.sleep(0.1)

I believe the reason is that the underlying c arrays are not freed as the OWNDATA flag is not explicitly enabled on the returned ndarray objects:

Based on what I read, applying:

PyArray_ENABLEFLAGS(arr, NPY_ARRAY_OWNDATA);

on each of the returned ndarrays should fix the issue. Based on this stackoverflow question it seems like it is not possible to do it from python (the flag is not writable), it has to be done within the C code itself (also: second stackoverflow question).

More info from the official numpy documentation:

You should ensure that the provided memory is not freed while the returned array is in existence. The easiest way to handle this is if data comes from another reference-counted Python object. The reference count on this object should be increased after the pointer is passed in, and the base member of the returned ndarray should point to the Python object that owns the data. Then, when the ndarray is deallocated, the base-member will be DECREF’d appropriately. If you want the memory to be freed as soon as the ndarray is deallocated then simply set the OWNDATA flag on the returned ndarray.

rotemdan commented 6 years ago

So far (in my own build) I've applied the flag to the outputs of the methods:

And it seems to help. (These are all the calls to PyArray_SimpleNewFromData that I could find in the codebase)

There are also various calls to methods like:

Which I'm not sure exactly if require this modification.

This means I can only give a pull request for the ones I've found so far. I don't think I can cover everything without being more familiar with the code.

readbeyond commented 6 years ago

Thank you for your feedback and contribution.

I think to remember that, when I developed the C extensions, I ran them through Valgrind, so I am a bit surprised by the appearance of this issue. On the other hand, I think almost all applications using aeneas just run a few executions before the host process dies, so it might also be the case that the memory leak never surfaced because of it.

Before merging your patch, I need to read a bit more about numpy memory management. I plan to do that in the upcoming weekend.

Best regards,

Alberto Pettarin

On 05/03/2018 05:38 PM, Rotem Dan wrote:

So far (in my own build) I've applied the flag on out outputs of the methods:

  • |compute_from_file| and |compute_from_data| in |cmfcc_py.c| (|cmfcc| submodule)
  • |read_audio_data| in |cwave_py.c| (|cwave| sobmodule)

And it seems to help. (These are all the calls to |PyArray_SimpleNewFromData| that I could find in the codebase)

There also various calls to methods like:

  • |PyArray_SimpleNew|
  • |PyArray_ContiguousFromAny|
  • ..

Which I'm not sure exactly if require this modification.

This means I can only give a pull request for the ones I've found so far. I don't think I can cover everything without being more familiar with the code.

readbeyond commented 6 years ago

I merged your patch, thank you for your contribution.

Now I remember that I run under Valgrind the pure C code, not the Python wrappers, because of the many false positives that CPython generates when run under Valgrind (and the lack of time to actually craft a suitable white/black list).