Memory leak in local laplacian filter Manual scheduling

mshafiei commented 2 months ago

Hi,

I'm observing that the GPU runs out of memory when I call local laplacian filter in a loop. It's reproducible by the following code snippet. When I only enable Mullapudi2016 and disable Manual, I do not observe this issue anymore.


from local_laplacian import local_laplacian
from local_laplacian_Mullapudi2016 import local_laplacian_Mullapudi2016
import halide.imageio
import numpy as np
import sys
import timeit
import halide as hl

def llf(input_path, levels, alpha, beta, output_path):
    timing_iterations = 10

    print("Reading from %s ..." % input_path)
    input_buf_u8 = halide.imageio.imread(input_path)
    assert input_buf_u8.dtype == np.uint8
    # Convert to uint16 in range [0..1]
    input_buf = input_buf_u8.astype(np.uint16) * 257
    h = input_buf.shape[1]
    w = input_buf.shape[2]
    output_buf = np.empty([3, h, w], dtype=input_buf.dtype)
    tests = {
        "Manual": local_laplacian
        # "Mullapudi2016": local_laplacian_Mullapudi2016,
    }

    for name, fn in tests.items():
        print("Running %s... " % name, end="")
        t = timeit.Timer(lambda: fn(input_buf, levels, alpha / (levels - 1), beta, output_buf))
        avg_time_sec = t.timeit(number=timing_iterations) / timing_iterations
        print("time: %fms" % (avg_time_sec * 1e3))

    output_buf_u8 = (output_buf // 257).astype(np.uint8)

    print("Saving to %s ..." % output_path)
    halide.imageio.imwrite(output_path, output_buf_u8)

def main():

    input_path = sys.argv[1]
    levels = int(sys.argv[2])
    alpha = float(sys.argv[3])
    beta = float(sys.argv[4])
    output_path = sys.argv[5]

    for i in range(100):
        llf(input_path, levels, alpha, beta, output_path)

    print("Success!")
    sys.exit(0)

if __name__ == "__main__":
    main()

abadams commented 2 months ago

@shoaibkamil

shoaibkamil commented 2 months ago

I can repro this behavior running on macOS with Metal. Investigating.

mshafiei commented 2 months ago

It's also happening for blur app and bilateral grid. Is the root cause in generator compilation step? Other pieces of information that might be helpful: I'm using host-cuda-profile argument in add_halide_library. to enable GPU scheduling on RTX 3070 with nvidia driver 535.183.01 and CUDA 12.1.

abadams commented 1 month ago

It looks like the generated extension code makes no attempt to free any gpu allocations made by the pipeline. It does set host dirty and copies back to host though, so I'm not sure what the intention was here. @steven-johnson is this just an oversight? Should the PyHalideBuffer destructor be calling device_free?

template<int dimensions>
struct PyHalideBuffer {
    // Must allocate at least 1, even if d=0
    static constexpr int dims_to_allocate = (dimensions < 1) ? 1 : dimensions;

    Py_buffer py_buf;
    halide_dimension_t halide_dim[dims_to_allocate];
    halide_buffer_t halide_buf;
    bool py_buf_needs_release = false;

    bool unpack(PyObject *py_obj, int py_getbuffer_flags, const char *name) {
        return Halide::PythonRuntime::unpack_buffer(py_obj, py_getbuffer_flags, name, dimensions, py_buf, halide_dim, halide_buf, py_buf_needs_release);
    }

    ~PyHalideBuffer() {
        if (py_buf_needs_release) {
            PyBuffer_Release(&py_buf);
        }
    }

    PyHalideBuffer() = default;
    PyHalideBuffer(const PyHalideBuffer &other) = delete;
    PyHalideBuffer &operator=(const PyHalideBuffer &other) = delete;
    PyHalideBuffer(PyHalideBuffer &&other) = delete;
    PyHalideBuffer &operator=(PyHalideBuffer &&other) = delete;
};

}  // namespace

namespace Halide::PythonExtensions {

namespace {

const char* const local_laplacian_kwlist[] = {
  "input",
  "levels",
  "alpha",
  "beta",
  "output",
  nullptr
};

}  // namespace

// local_laplacian
PyObject *local_laplacian(PyObject *module, PyObject *args, PyObject *kwargs) {
  PyObject* py_input;
  int py_levels;
  float py_alpha;
  float py_beta;
  PyObject* py_output;
  if (!PyArg_ParseTupleAndKeywords(args, kwargs, "OiffO", (char**)local_laplacian_kwlist
    , &py_input
    , &py_levels
    , &py_alpha
    , &py_beta
    , &py_output
  )) {
    PyErr_Format(PyExc_ValueError, "Internal error");
    return nullptr;
  }
  PyHalideBuffer<3> b_input;
  PyHalideBuffer<3> b_output;
  if (!b_input.unpack(py_input, 0, local_laplacian_kwlist[0])) return nullptr;
  if (!b_output.unpack(py_output, PyBUF_WRITABLE, local_laplacian_kwlist[4])) return nullptr;

  b_input.halide_buf.set_host_dirty();
  int result;
  Py_BEGIN_ALLOW_THREADS
  result = local_laplacian(
    &b_input.halide_buf,
    py_levels,
    py_alpha,
    py_beta,
    &b_output.halide_buf
  );
  Py_END_ALLOW_THREADS
  if (result == 0) result = halide_copy_to_host(nullptr, &b_output.halide_buf);
  if (result != 0) {
    #ifndef HALIDE_PYTHON_EXTENSION_OMIT_ERROR_AND_PRINT_HANDLERS
    PyErr_Format(PyExc_RuntimeError, "Halide Runtime Error: %d", result);
    #else
    PyErr_Format(PyExc_ValueError, "Halide error %d", result);
    #endif  // HALIDE_PYTHON_EXTENSION_OMIT_ERROR_AND_PRINT_HANDLERS
    return nullptr;
  }

  Py_INCREF(Py_None);
  return Py_None;
}

steven-johnson commented 1 month ago

Should the PyHalideBuffer destructor be calling device_free?

If we do that, don't we risk freeing a device allocation that might be in use by a shared buffer allocation (e.g. via device_crop or similar)? Is it possible that we just don't free all the PyHalideBuffers?

abadams commented 1 month ago

It looks like the halide_buffer_t is being created right there from a numpy array, so I don't think it's possible that anything aliases with it. Or is it possible to pass some sort of wrapper of Halide::Runtime::Buffer?

steven-johnson commented 1 month ago

OK, I will take a look

steven-johnson commented 1 month ago

OK, yeah, I think an explicit call to halide_device_free() is likely needed in the dtor to PyHalideBuffer, let me do some testing first

steven-johnson commented 1 month ago

I think https://github.com/halide/Halide/pull/8439 is what we need, please give it a try

halide / Halide

Memory leak in local laplacian filter Manual scheduling #8399