robbmcleod / pyfastnoisesimd

Python module wrapping C++ FastNoiseSIMD
BSD 3-Clause "New" or "Revised" License
39 stars 6 forks source link

Noise().genAsGrid() Memory Leak #18

Closed Birdulon closed 5 years ago

Birdulon commented 5 years ago

Possibly an upstream problem but I'll leave that for others to work out.

System:

Linux x86_64

What is done:

Generating many arrays of noise without assigning them lasting identifiers (e.g. as frames of an animation)

Expectation:

Memory footprint of the python process does not grow without bounds

Result:

Memory footprint of the python process grows with each call, until closed or system enters OOM state. Manual gc.collect() calls do nothing to alleviate it.

Reproduction:

import pyfastnoisesimd as fns; n = fns.Noise()
for i in range(1000):  # Repeat as necessary, but be careful not to OOM yourself
  n.genAsGrid((1024, 1024, 1))
robbmcleod commented 5 years ago

It seems the returned arrays from genAsGrid has 2 ref counts.

I'm calling PyArray_SimpleNewFromData which apparently doesn't own the data. But if I do set

noiseArray = PyArray_SimpleNewFromData(1, dims, NPY_FLOAT32, data);
PyArray_UpdateFlags((PyArrayObject *)noiseArray, NPY_ARRAY_ALIGNED|NPY_ARRAY_CARRAY|NPY_ARRAY_OWNDATA);
return noiseArray;

It still leaks... using

import numpy as np
import pyfastnoisesimd as fns
import psutil, gc, sys

gc.set_debug(gc.DEBUG_UNCOLLECTABLE)

n = fns.Noise()
n.maxWorkers = 1
n_cycles = 128
mem = np.zeros(n_cycles)
for I in range(n_cycles):  
    # Repeat as necessary, but be careful not to OOM yourself
    result = n.genAsGrid((1024, 1024, 1))
    mem[I] = psutil.virtual_memory().available

print('Result has {} ref counts'.format(sys.getrefcount(result)))
print('Leaking {} kB per call'.format((mem[0] - mem[-1])/n_cycles/1024))
print('Garbage: ', gc.garbage)

As a test script. So not really sure how to fix this. I could use PyArray_SimpleNew instead but then I have to make sure the arrays are aligned correctly for the SIMD instruction sets.

robbmcleod commented 5 years ago
PyArray_UpdateFlags((PyArrayObject *)noiseArray, NPY_ARRAY_ALIGNED|NPY_ARRAY_CARRAY|NPY_ARRAY_OWNDATA);

doesn't actually managed to set OWNDATA. You have to use

PyArray_ENABLEFLAGS((PyArrayObject *)noiseArray, NPY_ARRAY_OWNDATA);

However, this crashes on iteration #2 when this object is freed. It seems that Cython has a callback that can be used to free such data, but I can't find one in the C-API.

Birdulon commented 5 years ago

From quick tests with sys.getrefcount on random tuples, it seems like a result of 2 is to be expected for something with 1 reference, which is in line with the documentation on it:

The count returned is generally one higher than you might expect, because it includes the (temporary) reference as an argument to getrefcount().

My initial thought was to subclass numpy.ndarray for the result of PyFNS_FillNoiseSet and add FastNoiseSIMD::FreeNoiseSet to the destructor but I can't see anything special about FastNoiseSIMD::FreeNoiseSet.

Birdulon commented 5 years ago

In PyFNS_GetNoiseSet, replacing return PyArray_SimpleNewFromData(3, dims, NPY_FLOAT32, data); with

PyObject * noiseArray = PyArray_SimpleNewFromData(3, dims, NPY_FLOAT32, data);
PyArray_ENABLEFLAGS((PyArrayObject *)noiseArray, NPY_ARRAY_OWNDATA);
return noiseArray;

Appears to fix the issue entirely in single-threaded mode on my setup.

Birdulon commented 5 years ago

Issue has a similar solution for the multi-threaded function, only difference is that I had to make a separate function to be called after all the threads finish on the final array. Setting NPY_ARRAY_OWNDATA on the individual chunks naturally caused segmentation faults as they were garbage collected while the master array was still instanced. There's probably a nicer/cleaner way of solving this and it might not work on Windows but I'll be using this on my end for the time being as it seems to work flawlessly.

static PyObject *
PyFNS_ClaimNoiseSet(FNSObject *self, PyObject *args)
{
    PyObject* noiseObj;
    const char *format = "O";

    if (!PyArg_ParseTuple(args, format, &noiseObj))
    {
        return NULL;
    }
    PyArray_ENABLEFLAGS((PyArrayObject *)noiseObj, NPY_ARRAY_OWNDATA);

    Py_RETURN_NONE;
}
...
for peon in workers:
  peon.result()
self._fns.ClaimNoiseSet(noise)
return noise
robbmcleod commented 5 years ago

That's a good idea for collecting the data. I'll probably use that thanks.

I wonder if Windows is having trouble due to the call to _aligned_malloc which in FastNoiseSIMD has a corresponding _aligned_free call. The Microsoft docs aren't very illuminating:

https://msdn.microsoft.com/en-us/library/8z34s9c6.aspx?f=255&MSPPError=-2147217396

https://msdn.microsoft.com/en-us/library/17b5h8td.aspx

robbmcleod commented 5 years ago

Past discussion by Python-dev and NumPy on the subject of aligned memory:

https://github.com/numpy/numpy/issues/5312

https://bugs.python.org/issue18835

It looks like the problem has not been resolved.

Apparently the Windows API forbids a free of an _aligned_malloc region so that's likely why it faults.

Perhaps we could switch the C++ lib to use std::aligned_alloc and see if that works. That would limit us to MSVC2015 which is Python 3.5 which is fine.

robbmcleod commented 5 years ago

For the moment I've simply regressed to using standard allocators for Windows. I'll possibly email the NumPy dev list and ask for advice, as this is something that should be in NumExpr as well Thanks for the feedback.

Birdulon commented 5 years ago

If the existing destructor is compatible with std::aligned_alloc then that'd be a good solution, otherwise subclassing numpy.ndarray to override the constructor and destructor is probably the only way forward. As for the memory ownership, it may also be solved by calling Py_DECREF rather than the NPY_ARRAY_OWNDATA flag - I'll test that later as there may be invalid references very slowly piling up somewhere.

robbmcleod commented 5 years ago

Yeah I'll probably have to look around for a C-extension subclass for numpy.ndarray. I think astropy does that...