taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.51k stars 2.28k forks source link

Memory leaks in Taichi AOT runtime (GFX/Vulkan) #6448

Open k-ye opened 2 years ago

k-ye commented 2 years ago

Describe the bug

I've run Taichi AOT with valgrind, which reported memory leaks in several places:

You can check with valgrind with this cmd line:

valgrind --leak-check=full --log-file='mem-leak-test.txt' --track-origins=yes -v ${CMD} ${ARGS}

==3169151== 868,128 (288 direct, 867,840 indirect) bytes in 1 blocks are definitely lost in loss record 3,570 of 3,575
==3169151==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3169151==    by 0x525E254: taichi::lang::gfx::GfxRuntime::register_taichi_kernel(taichi::lang::gfx::GfxRuntime::RegisterParams) (in /libtaichi_c_api.so)
==3169151==    by 0x4D6E8C8: taichi::lang::gfx::KernelImpl::KernelImpl(taichi::lang::gfx::GfxRuntime*, taichi::lang::gfx::GfxRuntime::RegisterParams&&) (in /libtaichi_c_api.so)
==3169151==    by 0x52788A3: taichi::lang::gfx::(anonymous namespace)::AotModuleImpl::make_new_kernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /libtaichi_c_api.so)
==3169151==    by 0x4B3A193: taichi::lang::aot::Module::get_kernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /libtaichi_c_api.so)
==3169151==    by 0x5277D09: taichi::lang::gfx::(anonymous namespace)::AotModuleImpl::get_graph(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /libtaichi_c_api.so)
==3169151==    by 0x4AE2A0E: AotModule::get_cgraph(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /libtaichi_c_api.so)
==3169151==    by 0x4AE4B62: ti_get_aot_module_compute_graph (in /libtaichi_c_api.so)

==3169151== 109,720 (104 direct, 109,616 indirect) bytes in 1 blocks are definitely lost in loss record 3,556 of 3,575
==3169151==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3169151==    by 0x52A3574: std::_Hashtable<unsigned int, std::pair<unsigned int const, taichi::lang::vulkan::VulkanResourceBinder::Set>, std::allocator<std::pair<unsigned int const, taichi::lang::vulkan::VulkanResourceBinder::Set> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_rehash_aux(unsigned long, std::integral_constant<bool, true>) (in /libtaichi_c_api.so)
==3169151==    by 0x52A341B: std::_Hashtable<unsigned int, std::pair<unsigned int const, taichi::lang::vulkan::VulkanResourceBinder::Set>, std::allocator<std::pair<unsigned int const, taichi::lang::vulkan::VulkanResourceBinder::Set> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<unsigned int const, taichi::lang::vulkan::VulkanResourceBinder::Set>, false>*, unsigned long) (in /libtaichi_c_api.so)
==3169151==    by 0x528EADE: taichi::lang::vulkan::VulkanResourceBinder::buffer(unsigned int, unsigned int, taichi::lang::DevicePtr, unsigned long) (in /libtaichi_c_api.so)
==3169151==    by 0x528B25E: taichi::lang::vulkan::VulkanPipeline::create_descriptor_set_layout(taichi::lang::vulkan::VulkanPipeline::Params const&) (in /libtaichi_c_api.so)
==3169151==    by 0x528AE6D: taichi::lang::vulkan::VulkanPipeline::VulkanPipeline(taichi::lang::vulkan::VulkanPipeline::Params const&) (in /libtaichi_c_api.so)
==3169151==    by 0x529A3D5: taichi::lang::vulkan::VulkanDevice::create_pipeline(taichi::lang::PipelineSourceDesc const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (in /libtaichi_c_api.so)
==3169151==    by 0x525D1EB: taichi::lang::gfx::CompiledTaichiKernel::CompiledTaichiKernel(taichi::lang::gfx::CompiledTaichiKernel::Params const&) (in /libtaichi_c_api.so)

==3169151== 20,020 (272 direct, 19,748 indirect) bytes in 1 blocks are definitely lost in loss record 3,533 of 3,575
==3169151==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3169151==    by 0x1739F0FC: ???
==3169151==    by 0x177D2B26: ???
==3169151==    by 0x177F15D6: ???
==3169151==    by 0x177D9094: ???
==3169151==    by 0x177D91AF: ???
==3169151==    by 0x52858FD: vkapi::create_compute_pipeline(VkDevice_T*, unsigned int, VkPipelineShaderStageCreateInfo&, std::shared_ptr<vkapi::DeviceObjVkPipelineLayout>, std::shared_ptr<vkapi::DeviceObjVkPipelineCache>, std::shared_ptr<vkapi::DeviceObjVkPipeline>) (in /libtaichi_c_api.so)
==3169151==    by 0x528C70D: taichi::lang::vulkan::VulkanPipeline::create_compute_pipeline(taichi::lang::vulkan::VulkanPipeline::Params const&) (in libtaichi_c_api.so)
==3169151==    by 0x528AE8B: taichi::lang::vulkan::VulkanPipeline::VulkanPipeline(taichi::lang::vulkan::VulkanPipeline::Params const&) (in libtaichi_c_api.so)
jim19930609 commented 1 year ago

Tried the command valgrind --leak-check=full --log-file='mem-leak-test.txt' --track-origins=yes -v ${CMD} ${ARGS} on taichi-aot-demo. However, both tutorial and mpm88 does not seem to observe the memory leak mentioned above.

Does notice a mem-leak in C-API, but that's more related to Vulkan implementation of vkCmdBindPipeline()

==2628974== 5,496 bytes in 1 blocks are definitely lost in loss record 1,569 of 1,587
==2628974==    at 0x484147B: calloc (vg_replace_malloc.c:1340)
==2628974==    by 0x8D4D58F: ??? 
==2628974==    by 0x90F9DB1: ??? 
==2628974==    by 0x90FB8C4: ??? 
==2628974==    by 0x91096BE: ??? 
==2628974==    by 0x54D75A0: taichi::lang::vulkan::VulkanCommandList::bind_pipeline(taichi::lang::Pipeline*) (taichi/rhi/vulkan/vulkan_device.cpp:831)
==2628974==    by 0x54744CA: taichi::lang::gfx::GfxRuntime::launch_kernel(taichi::lang::gfx::GfxRuntime::KernelHandle, taichi::lang::RuntimeContext*) (taichi/runtime/gfx/runtime.cpp:508)
==2628974==    by 0x54ADF49: taichi::lang::gfx::KernelImpl::launch(taichi::lang::RuntimeContext*) (taichi/runtime/gfx/aot_graph_data.h:14)
==2628974==    by 0x4ED1582: ti_launch_kernel (c_api/src/taichi_core_impl.cpp:607)
==2628974==    by 0x404030: launch (taichi4/_skbuild/linux-x86_64-3.8/cmake-install/c_api/include/taichi/cpp/taichi.hpp:533)
==2628974==    by 0x404030: launch (taichi4/_skbuild/linux-x86_64-3.8/cmake-install/c_api/include/taichi/cpp/taichi.hpp:536)
==2628974==    by 0x404030: App0_tutorial::run() (0_tutorial_kernel/app.cpp:42)
==2628974==    by 0x40387C: main (0_tutorial_kernel/app.cpp:68)

To further investigate the memory leak problem, we'll probably have to separate out a minimal AOT run from taco.

mpm88_mem_leak_test.txt tutorial_mem_leak_test.txt