inducer / pyopencl

OpenCL integration for Python, plus shiny features
http://mathema.tician.de/software/pyopencl
Other
1.04k stars 237 forks source link

Nanobind leak warnings #758

Open matthiasdiener opened 1 month ago

matthiasdiener commented 1 month ago

Describe the bug At the end of execution, what appears to be SVM-related pyopencl code causes nanobind leak warnings to appear:

nanobind: leaked 3 instances!
 - leaked instance 0x1010fdf88 of type "SVMAllocator"
 - leaked instance 0x1076bd3c8 of type "SVMPool"
 - leaked instance 0x101459258 of type "Context"
nanobind: leaked 3 types!
 - leaked type "pyopencl._cl.SVMAllocator"
 - leaked type "pyopencl._cl.Context"
 - leaked type "pyopencl._cl.SVMPool"
nanobind: leaked 20 functions!
 - leaked function ""
 - leaked function ""
 - leaked function ""
 - leaked function "__eq__"
 - leaked function ""
 - leaked function "alloc_size"
 - leaked function "free_held"
 - leaked function "__init__"
 - leaked function ""
 - leaked function "__call__"
 - leaked function "__init__"
 - ... skipped remainder
nanobind: this is likely caused by a reference counting issue in the binding code.

To Reproduce

  1. $ cd pyopencl
  2. $ python examples/demo_array_svm.py (causes leak warnings)
  3. $ python examples/demo_array.py (does not cause leak warnings)

Edit: Easiest way to reproduce:

$ python -c 'import pyopencl as cl; ctx = cl.create_some_context(); f=cl.SVMAllocation(ctx, 10, 0, 0)'
nanobind: leaked 1 instances!
 - leaked instance 0x106203858 of type "Context"
nanobind: leaked 1 types!
 - leaked type "pyopencl._cl.Context"
nanobind: leaked 7 functions!
 - leaked function "__eq__"
 - leaked function "from_int_ptr"
 - leaked function "set_default_device_command_queue"
 - leaked function "__hash__"
 - leaked function ""
 - leaked function "get_info"
 - leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.

(when not assigning to f, the warnings do not appear)

Expected behavior Don't show the warnings.

Environment (please complete the following information):

Additional context Note that #755 does not address this issue.

inducer commented 1 month ago

Thanks for the small reproducer, that helps! I think part of the problem here is that SVMAllocation holds a reference to the CL context in a shared_ptr:

https://github.com/inducer/pyopencl/blob/f5e1b7b7acdf5beaf3d9435163ad4d099e36dd7a/src/wrap_cl.hpp#L3700

It needs to hold a reference, because it needs the context to free the SVM:

https://github.com/inducer/pyopencl/blob/f5e1b7b7acdf5beaf3d9435163ad4d099e36dd7a/src/wrap_cl.hpp#L3786

Nanobind's support for shared_ptr increases the context's reference count, and the behavior is overall quite complex (see part 1, part 2). While I feel that the context should still get GC'd (especially after the SVMAllocation is gone, which it must be, because it isn't getting leaked), that somehow doesn't seem to be happening. I'm not sure I have a theory on the precise mechanism.

As an alternative to the complexity of shared ownership between C++ and Python, nanobind also has intrusive reference counting. Maybe that's worth a try, for all the cases where currently shared_ptrs are being used.

inducer commented 1 month ago

Actually, we may be dealing with what's described as the "uncollectable inter-language reference cycle" in nanobind's docs for intrusive RC.

inducer commented 1 month ago

I feel like we should move to intrusive refcounting for context, queue, memory pools, and allocators. I'm out of time for today, but if you have time to get this started, I'd be open to looking at it.

inducer commented 1 month ago

759 is a start that already addresses your small reproducer.

matthiasdiener commented 1 month ago

There still appear to be leak warnings (see e.g. https://github.com/inducer/pyopencl/actions/runs/9154012159/job/25163854069), but I can't reproduce them locally at the moment.

inducer commented 1 month ago

Thanks for spotting that! For posterity:

 nanobind: leaked 7 instances!
 - leaked instance 0x55eb67f303f0 of type "pyopencl._cl.Device"
 - leaked instance 0x55eb67f697e0 of type "pyopencl._cl.Device"
 - leaked instance 0x55eb680ec850 of type "pyopencl._cl.Device"
 - leaked instance 0x55eb67edd4b0 of type "pyopencl._cl.Device"
 - leaked instance 0x55eb67929de0 of type "pyopencl._cl.Platform"
 - leaked instance 0x55eb67c9e200 of type "pyopencl._cl.Device"
 - leaked instance 0x55eb6c9cbe50 of type "pyopencl._cl.Platform"
nanobind: leaked 18 types!
 - leaked type "pyopencl._cl.Device"
 - leaked type "pyopencl._cl.SVMAllocator"
 - leaked type "pyopencl._cl.SVMPointer"
 - leaked type "pyopencl._cl.CommandQueue"
 - leaked type "pyopencl._cl.PooledBuffer"
 - leaked type "pyopencl._cl.ImmediateAllocator"
 - leaked type "pyopencl._cl.AllocatorBase"
 - leaked type "pyopencl._cl.Event"
 - leaked type "pyopencl._cl.Buffer"
 - leaked type "pyopencl._cl.Kernel"
 - leaked type "pyopencl._cl.SVMPool"
 - ... skipped remainder
nanobind: leaked 109 functions!
 - leaked function "__init__"
 - leaked function ""
 - leaked function "__call__"
 - leaked function "__call__"
 - leaked function "unbind_from_queue"
 - leaked function "__hash__"
 - leaked function "_set_arg_svm"
 - leaked function "from_int_ptr"
 - leaked function ""
 - leaked function "enqueue_release"
 - leaked function "_set_arg_buf_pack_multi"
 - ... skipped remainder
nanobind: this is likely caused by a reference counting issue in the binding code.

We should track those down sometime.

vincefn commented 1 month ago

Could we disable these messages, e.g. by giving access to [nb::set_leak_warnings()] ?(https://nanobind.readthedocs.io/en/latest/faq.html#why-am-i-getting-errors-about-leaked-functions-and-types)

inducer commented 1 month ago

@vincefn The incidence of these should be substantially reduced post b659cad93a6f97535ea7fdd1c0c7c7e9b71f7a3e (unreleased). It might be good to disable these for release builds by default, because they don't have much utility from the user's end.

matthiasdiener commented 1 month ago

One way to reproduce the warnings after b659cad is:

$ python examples/demo-struct-reduce.py
nanobind: leaked 1 instances!
 - leaked instance 0xaaaadcaedf00 of type "pyopencl._cl.Device"
nanobind: leaked 17 types!
 - leaked type "pyopencl._cl.Device"
 - leaked type "pyopencl._cl.SVMAllocator"
 - leaked type "pyopencl._cl.SVMPointer"
 - leaked type "pyopencl._cl.CommandQueue"
 - leaked type "pyopencl._cl.PooledBuffer"
 - leaked type "pyopencl._cl.ImmediateAllocator"
 - leaked type "pyopencl._cl.AllocatorBase"
 - leaked type "pyopencl._cl.Event"
 - leaked type "pyopencl._cl.Buffer"
 - leaked type "pyopencl._cl.Kernel"
 - leaked type "pyopencl._cl.SVMPool"
 - ... skipped remainder
nanobind: leaked 103 functions!
 - leaked function "device_and_host_timer"
 - leaked function "get_cl_header_version"
 - leaked function "from_int_ptr"
 - leaked function "_set_arg_multi"
 - leaked function ""
 - leaked function ""
 - leaked function "set_arg"
 - leaked function "__init__"
 - leaked function "from_int_ptr"
 - leaked function "get_host_array"
 - leaked function "get_work_group_info"
 - ... skipped remainder
nanobind: this is likely caused by a reference counting issue in the binding code.

For some reason, these do not show up on my Mac, just on Linux.