Open mshafiei opened 2 months ago
@shoaibkamil
I can repro this behavior running on macOS with Metal. Investigating.
It's also happening for blur app and bilateral grid. Is the root cause in generator compilation step?
Other pieces of information that might be helpful: I'm using host-cuda-profile
argument in add_halide_library
. to enable GPU scheduling on RTX 3070 with nvidia driver 535.183.01
and CUDA 12.1.
It looks like the generated extension code makes no attempt to free any gpu allocations made by the pipeline. It does set host dirty and copies back to host though, so I'm not sure what the intention was here. @steven-johnson is this just an oversight? Should the PyHalideBuffer destructor be calling device_free?
template<int dimensions>
struct PyHalideBuffer {
// Must allocate at least 1, even if d=0
static constexpr int dims_to_allocate = (dimensions < 1) ? 1 : dimensions;
Py_buffer py_buf;
halide_dimension_t halide_dim[dims_to_allocate];
halide_buffer_t halide_buf;
bool py_buf_needs_release = false;
bool unpack(PyObject *py_obj, int py_getbuffer_flags, const char *name) {
return Halide::PythonRuntime::unpack_buffer(py_obj, py_getbuffer_flags, name, dimensions, py_buf, halide_dim, halide_buf, py_buf_needs_release);
}
~PyHalideBuffer() {
if (py_buf_needs_release) {
PyBuffer_Release(&py_buf);
}
}
PyHalideBuffer() = default;
PyHalideBuffer(const PyHalideBuffer &other) = delete;
PyHalideBuffer &operator=(const PyHalideBuffer &other) = delete;
PyHalideBuffer(PyHalideBuffer &&other) = delete;
PyHalideBuffer &operator=(PyHalideBuffer &&other) = delete;
};
} // namespace
namespace Halide::PythonExtensions {
namespace {
const char* const local_laplacian_kwlist[] = {
"input",
"levels",
"alpha",
"beta",
"output",
nullptr
};
} // namespace
// local_laplacian
PyObject *local_laplacian(PyObject *module, PyObject *args, PyObject *kwargs) {
PyObject* py_input;
int py_levels;
float py_alpha;
float py_beta;
PyObject* py_output;
if (!PyArg_ParseTupleAndKeywords(args, kwargs, "OiffO", (char**)local_laplacian_kwlist
, &py_input
, &py_levels
, &py_alpha
, &py_beta
, &py_output
)) {
PyErr_Format(PyExc_ValueError, "Internal error");
return nullptr;
}
PyHalideBuffer<3> b_input;
PyHalideBuffer<3> b_output;
if (!b_input.unpack(py_input, 0, local_laplacian_kwlist[0])) return nullptr;
if (!b_output.unpack(py_output, PyBUF_WRITABLE, local_laplacian_kwlist[4])) return nullptr;
b_input.halide_buf.set_host_dirty();
int result;
Py_BEGIN_ALLOW_THREADS
result = local_laplacian(
&b_input.halide_buf,
py_levels,
py_alpha,
py_beta,
&b_output.halide_buf
);
Py_END_ALLOW_THREADS
if (result == 0) result = halide_copy_to_host(nullptr, &b_output.halide_buf);
if (result != 0) {
#ifndef HALIDE_PYTHON_EXTENSION_OMIT_ERROR_AND_PRINT_HANDLERS
PyErr_Format(PyExc_RuntimeError, "Halide Runtime Error: %d", result);
#else
PyErr_Format(PyExc_ValueError, "Halide error %d", result);
#endif // HALIDE_PYTHON_EXTENSION_OMIT_ERROR_AND_PRINT_HANDLERS
return nullptr;
}
Py_INCREF(Py_None);
return Py_None;
}
Should the PyHalideBuffer destructor be calling device_free?
If we do that, don't we risk freeing a device allocation that might be in use by a shared buffer allocation (e.g. via device_crop or similar)? Is it possible that we just don't free all the PyHalideBuffers?
It looks like the halide_buffer_t is being created right there from a numpy array, so I don't think it's possible that anything aliases with it. Or is it possible to pass some sort of wrapper of Halide::Runtime::Buffer?
OK, I will take a look
OK, yeah, I think an explicit call to halide_device_free()
is likely needed in the dtor to PyHalideBuffer
, let me do some testing first
I think https://github.com/halide/Halide/pull/8439 is what we need, please give it a try
Hi,
I'm observing that the GPU runs out of memory when I call local laplacian filter in a loop. It's reproducible by the following code snippet. When I only enable Mullapudi2016 and disable Manual, I do not observe this issue anymore.