ufo-kit / ufo-core

GLib-based framework for GPU-based data processing
GNU Lesser General Public License v3.0
24 stars 8 forks source link

Memory leak #191

Open tfarago opened 1 year ago

tfarago commented 1 year ago

When I use the same Ufo.Resources throughout my python program, all is fine. However, if new resources are generated every time a Ufo.Scheduler is created then not everything is freed. From a first glance at the resources there shouldn't be a leak, and yet there is. @gabs1234 we should look into this.

tfarago commented 1 year ago

I wrote a small script which demonstrates the issue. With watch -n 1 nvidia-smi one can nicely track the leak.

gabs1234 commented 1 year ago

Here is a version of your python code in C.

Here is an extract of what valgrind spits on this code (full output here):

==1293489== 32 bytes in 1 blocks are definitely lost in loss record 869 of 3,760
==1293489==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1293489==    by 0x5C456E7: ??? (in /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.105.01)
==1293489==    by 0x5C46013: ??? (in /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.105.01)
==1293489==    by 0x5C46537: ??? (in /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.105.01)
==1293489==    by 0x48DE7F3: alloc_device_array (ufo-buffer.c:177)
==1293489==    by 0x48E0CFE: ufo_buffer_get_device_array (ufo-buffer.c:1008)
==1293489==    by 0x57E39015: ufo_polar_coordinates_task_process (ufo-polar-coordinates-task.c:208)
==1293489==    by 0x48F4F20: ufo_task_process (ufo-task-iface.c:150)
==1293489==    by 0x48F38B9: run_task (ufo-scheduler.c:212)
==1293489==    by 0x4C9DA50: ??? (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7200.4)
==1293489==    by 0x4A83B42: start_thread (pthread_create.c:442)
==1293489==    by 0x4B14BB3: clone (clone.S:100)
==1293489== 
==1293489== 48 (16 direct, 32 indirect) bytes in 1 blocks are definitely lost in loss record 2,221 of 3,760
==1293489==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1293489==    by 0x4C77738: g_malloc (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7200.4)
==1293489==    by 0x4C8EB74: g_slice_alloc (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7200.4)
==1293489==    by 0x4C8F277: g_slist_append (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7200.4)
==1293489==    by 0x48ECE88: ufo_plugin_manager_get_plugin (ufo-plugin-manager.c:192)
==1293489==    by 0x48ED51A: ufo_plugin_manager_get_task (ufo-plugin-manager.c:342)
==1293489==    by 0x1094AD: get_memory_in (resources_leak.c:104)
==1293489==    by 0x1096D7: cartesian_to_polar (resources_leak.c:154)
==1293489==    by 0x1099EF: main (resources_leak.c:204)

Idea for the second leak: I believe that this might be related . When the plugin_manager class is finalized, the priv->modules are not freed as explained in the comment:

/* XXX: This is a necessary hack! We return a full reference for
 * ufo_plugin_manager_get_task() so that the Python run-time can cleanup
 * the tasks that are assigned. However, there is no relationship between
 * graphs, tasks and the plugin manager and it might happen, that the plugin
 * manager is destroy before the graph which in turn would unref invalid
 * objects. So, we just don't close the modules and hope for the best.
 */

For the first leak, the last traced function causing the leak in ufo is found here. This section allocates a buffer object in openCl. However it is correctly freed here when the class is finalized.

tfarago commented 1 year ago

Hm, I see there is no clRetainMemObject call when creating the device array, which is weird. Second, can you check the reference count on release? If that doesn't say 0 we know where the problem is. Let's set the second leak aside for now.