ml-explore / mlx

MLX: An array framework for Apple silicon
https://ml-explore.github.io/mlx/
MIT License
14.83k stars 845 forks source link

[Feature] Leak memory on exit #1086

Open zcbenz opened 1 week ago

zcbenz commented 1 week ago

Memory allocator is declared static and will free memory on exit when process quit gracefully:

https://github.com/ml-explore/mlx/blob/9814a2ae120385bc903d059a317233fa1be3bcef/mlx/backend/metal/allocator.cpp#L244-L247

Which can take more than 30 seconds after inferencing a LLM:

Call graph:
    16392 Thread_6280670   DispatchQueue_1: com.apple.main-thread  (serial)
    + 16392 start  (in dyld) + 2436  [0x18da7512c]
    +   16392 dyld4::LibSystemHelpers::exit(int) const  (in libdyld.dylib) + 20  [0x18de0eb80]
    +     16392 exit  (in libsystem_c.dylib) + 44  [0x18dcb1d20]
    +       16392 __cxa_finalize_ranges  (in libsystem_c.dylib) + 476  [0x18dcb1f98]
    +         16392 mlx::core::metal::MetalAllocator::~MetalAllocator()  (in mlx.node) + 48  [0x11bee2908]
    +           16354 mlx::core::metal::(anonymous namespace)::BufferCache::~BufferCache()  (in mlx.node) + 112  [0x11bee1c2c]
    +           ! 16318 -[AGXG15XFamilyBuffer dealloc]  (in AGXMetalG15X_B0) + 76  [0x1f9bb6b98]
    +           ! : 16317 -[AGXBuffer dealloc]  (in AGXMetalG15X_B0) + 44  [0x1f9b8d4e4]
    +           ! : | 16272 -[IOGPUMetalBuffer dealloc]  (in IOGPU) + 320  [0x1ac73976c]
    +           ! : | + 16076 -[IOGPUMetalResource dealloc]  (in IOGPU) + 248  [0x1ac748128]
    +           ! : | + ! 16032 _CFRelease  (in CoreFoundation) + 292  [0x18dfa5b90]
    +           ! : | + ! : 16023 ioGPUResourceFinalize  (in IOGPU) + 104  [0x1ac74e6a0]
    +           ! : | + ! : | 16023 iokit_user_client_trap  (in IOKit) + 8  [0x19155626c]
    +           ! : | + ! : 9 ioGPUResourceFinalize  (in IOGPU) + 140  [0x1ac74e6c4]
    +           ! : | + ! :   9 CFRelease  (in CoreFoundation) + 60  [0x18de609f4]
    +           ! : | + ! :     9 CF_IS_OBJC  (in CoreFoundation) + 256  [0x18dfa5520]
    +           ! : | + ! 27 CFRelease  (in CoreFoundation) + 60  [0x18de609f4]
    +           ! : | + ! : 27 CF_IS_OBJC  (in CoreFoundation) + 256,196  [0x18dfa5520,0x18dfa54e4]
    +           ! : | + ! 11 _CFRelease  (in CoreFoundation) + 1216  [0x18dfa5f2c]
    +           ! : | + ! : 6 _nanov2_free  (in libsystem_malloc.dylib) + 0,188,...  [0x18dc292a8,0x18dc29364,...]
    +           ! : | + ! : 2 CFAllocatorDeallocate  (in CoreFoundation) + 0,180  [0x18de5d14c,0x18de5d200]
    +           ! : | + ! : 1 __CFAllocatorSystemDeallocate  (in CoreFoundation) + 40  [0x18de5f140]
    +           ! : | + ! : | 1 malloc_default_zone  (in libsystem_malloc.dylib) + 0  [0x18dc0d2e4]
    +           ! : | + ! : 1 malloc_zone_free  (in libsystem_malloc.dylib) + 12  [0x18dc117fc]
    +           ! : | + ! : 1 nanov2_free  (in libsystem_malloc.dylib) + 0  [0x18dc0e5b0]
    +           ! : | + ! 3 _CFRelease  (in CoreFoundation) + 88,1192,...  [0x18dfa5ac4,0x18dfa5f14,...]
    +           ! : | + ! 2 _CFRelease  (in CoreFoundation) + 1228  [0x18dfa5f38]
    +           ! : | + ! : 1 CFRelease  (in CoreFoundation) + 60  [0x18de609f4]
    +           ! : | + ! : | 1 CF_IS_OBJC  (in CoreFoundation) + 252  [0x18dfa551c]
    +           ! : | + ! : 1 _CFRelease  (in CoreFoundation) + 136  [0x18dfa5af4]
    +           ! : | + ! 1 CFGetAllocator  (in CoreFoundation) + 68  [0x18de5d0ec]
    +           ! : | + 121 -[IOGPUMetalResource dealloc]  (in IOGPU) + 304  [0x1ac748160]
    +           ! : | + ! 119 -[_MTLObjectWithLabel dealloc]  (in Metal) + 64  [0x197f8ca1c]
    +           ! : | + ! : 55 _objc_rootDealloc  (in libobjc.A.dylib) + 80  [0x18da2ae24]
    +           ! : | + ! : | 34 objc_destructInstance  (in libobjc.A.dylib) + 144  [0x18da2aebc]
    +           ! : | + ! : | + 33 objc_object::clearDeallocating_slow()  (in libobjc.A.dylib) + 108  [0x18da360b4]
    +           ! : | + ! : | + ! 30 weak_clear_no_lock  (in libobjc.A.dylib) + 48  [0x18da36184]
    +           ! : | + ! : | + ! : 30 weak_entry_for_referent(weak_table_t*, objc_object*)  (in libobjc.A.dylib) + 92,76,...  [0x18da2e42c,0x18da2e41c,...]
    +           ! : | + ! : | + ! 3 weak_clear_no_lock  (in libobjc.A.dylib) + 28,48  [0x18da36170,0x18da36184]
    +           ! : | + ! : | + 1 objc_object::clearDeallocating_slow()  (in libobjc.A.dylib) + 56  [0x18da36080]
    +           ! : | + ! : | 21 objc_destructInstance  (in libobjc.A.dylib) + 80  [0x18da2ae7c]
    +           ! : | + ! : |   10 object_cxxDestructFromClass(objc_object*, objc_class*)  (in libobjc.A.dylib) + 116  [0x18da335f8]
    +           ! : | + ! : |   ! 8 -[IOGPUMetalResource .cxx_destruct]  (in IOGPU) + 36  [0x1ac748a50]
    +           ! : | + ! : |   ! : 8 objc_destroyWeak  (in libobjc.A.dylib) + 152  [0x18da345f4]
    +           ! : | + ! : |   ! :   7 weak_unregister_no_lock  (in libobjc.A.dylib) + 128,44,...  [0x18da4024c,0x18da401f8,...]
    +           ! : | + ! : |   ! :   1 weak_unregister_no_lock  (in libobjc.A.dylib) + 40  [0x18da401f4]
    +           ! : | + ! : |   ! :     1 weak_entry_for_referent(weak_table_t*, objc_object*)  (in libobjc.A.dylib) + 80  [0x18da2e420]
    +           ! : | + ! : |   ! 2 objc_destroyWeak  (in libobjc.A.dylib) + 32,184  [0x18da3457c,0x18da34614]
    +           ! : | + ! : |   5 object_cxxDestructFromClass(objc_object*, objc_class*)  (in libobjc.A.dylib) + 56,116,...  [0x18da335bc,0x18da335f8,...]
    +           ! : | + ! : |   3 -[IOGPUMetalResource .cxx_destruct]  (in IOGPU) + 52,56,...  [0x1ac748a60,0x1ac748a64,...]
    +           ! : | + ! : |   3 object_cxxDestructFromClass(objc_object*, objc_class*)  (in libobjc.A.dylib) + 76  [0x18da335d0]
    +           ! : | + ! : |     3 lookupMethodInClassAndLoadCache  (in libobjc.A.dylib) + 52  [0x18da33488]
    +           ! : | + ! : |       3 cache_getImp  (in libobjc.A.dylib) + 8,48,...  [0x18da29868,0x18da29890,...]
    +           ! : | + ! : 30 free_tiny  (in libsystem_malloc.dylib) + 100,512,...  [0x18dc0fd28,0x18dc0fec4,...]
    +           ! : | + ! : 23 free_tiny  (in libsystem_malloc.dylib) + 496  [0x18dc0feb4]
    +           ! : | + ! : | 11 tiny_free_no_lock  (in libsystem_malloc.dylib) + 356,1856,...  [0x18dc1019c,0x18dc10778,...]
    +           ! : | + ! : | 8 tiny_free_no_lock  (in libsystem_malloc.dylib) + 644  [0x18dc102bc]
    +           ! : | + ! : | + 8 tiny_free_list_add_ptr  (in libsystem_malloc.dylib) + 568,552,...  [0x18dc10ae8,0x18dc10ad8,...]
    +           ! : | + ! : | 3 tiny_free_no_lock  (in libsystem_malloc.dylib) + 1376  [0x18dc10598]
    +           ! : | + ! : | + 2 tiny_free_reattach_region  (in libsystem_malloc.dylib) + 160,240  [0x18dc171e8,0x18dc17238]
    +           ! : | + ! : | + 1 tiny_free_reattach_region  (in libsystem_malloc.dylib) + 368  [0x18dc172b8]
    +           ! : | + ! : | +   1 tiny_free_list_add_ptr  (in libsystem_malloc.dylib) + 144  [0x18dc10940]
    +           ! : | + ! : | 1 tiny_free_no_lock  (in libsystem_malloc.dylib) + 436  [0x18dc101ec]
    +           ! : | + ! : |   1 tiny_free_list_remove_ptr  (in libsystem_malloc.dylib) + 288  [0x18dc10c44]
    +           ! : | + ! : 3 _szone_free  (in libsystem_malloc.dylib) + 164  [0x18dc2159c]
    +           ! : | + ! : 3 nanov2_try_free_default  (in libsystem_malloc.dylib) + 0  [0x18dc29604]
    +           ! : | + ! : 2 _objc_rootDealloc  (in libobjc.A.dylib) + 8,76  [0x18da2addc,0x18da2ae20]
    +           ! : | + ! : 2 free  (in libsystem_malloc.dylib) + 92,124  [0x18dc0bd1c,0x18dc0bd3c]
    +           ! : | + ! : 1 _nanov2_free  (in libsystem_malloc.dylib) + 368  [0x18dc29418]
    +           ! : | + ! 1 -[_MTLObjectWithLabel dealloc]  (in Metal) + 32  [0x197f8c9fc]
    +           ! : | + ! : 1 objc_retain_x28  (in libobjc.A.dylib) + 68  [0x18da27fd0]
    +           ! : | + ! 1 free_tiny  (in libsystem_malloc.dylib) + 532  [0x18dc0fed8]
    +           ! : | + 42 -[IOGPUMetalResource dealloc]  (in IOGPU) + 216  [0x1ac748108]
    +           ! : | + ! 34 objc_storeWeak  (in libobjc.A.dylib) + 452  [0x18da2e184]
    +           ! : | + ! : 23 weak_unregister_no_lock  (in libobjc.A.dylib) + 316  [0x18da40308]
    +           ! : | + ! : 11 weak_unregister_no_lock  (in libobjc.A.dylib) + 40  [0x18da401f4]
    +           ! : | + ! :   11 weak_entry_for_referent(weak_table_t*, objc_object*)  (in libobjc.A.dylib) + 92,128  [0x18da2e42c,0x18da2e450]
    +           ! : | + ! 6 objc_storeWeak  (in libobjc.A.dylib) + 148  [0x18da2e054]
    +           ! : | + ! : 6 locker_mixin>::lockWith(lockdebug::lock_mixin&)  (in libobjc.A.dylib) + 132,32,...  [0x18da5a294,0x18da5a230,...]
    +           ! : | + ! 2 objc_storeWeak  (in libobjc.A.dylib) + 76,116  [0x18da2e00c,0x18da2e034]
    +           ! : | + 15 -[IOGPUMetalResource dealloc]  (in IOGPU) + 232  [0x1ac748118]
    +           ! : | + ! 6 objc_autorelease  (in libobjc.A.dylib) + 256  [0x18da2d920]
    +           ! : | + ! : 6 AutoreleasePoolPage::add(objc_object*)  (in libobjc.A.dylib) + 156,236,...  [0x18da5b954,0x18da5b9a4,...]
    +           ! : | + ! 4 objc_autorelease  (in libobjc.A.dylib) + 316,308  [0x18da2d95c,0x18da2d954]
    +           ! : | + ! 4 objc_loadWeak  (in libobjc.A.dylib) + 24  [0x18da2e8ec]
    +           ! : | + ! : 4 objc_loadWeakRetained  (in libobjc.A.dylib) + 468,24  [0x18da2eae8,0x18da2e92c]
    +           ! : | + ! 1 AutoreleasePoolPage::add(objc_object*)  (in libobjc.A.dylib) + 164  [0x18da5b95c]
    +           ! : | + 5 -[IOGPUMetalResource dealloc]  (in IOGPU) + 272  [0x1ac748140]
    +           ! : | + ! 3 objc_release  (in libobjc.A.dylib) + 0  [0x18da27fe0]
    +           ! : | + ! 2 objc_retain_x28  (in libobjc.A.dylib) + 68  [0x18da27fd0]
    +           ! : | + 4 -[IOGPUMetalResource dealloc]  (in IOGPU) + 260  [0x1ac748134]
    +           ! : | + ! 4 objc_release  (in libobjc.A.dylib) + 0,28,...  [0x18da27fe0,0x18da27ffc,...]
    +           ! : | + 4 -[IOGPUMetalResource dealloc]  (in IOGPU) + 0,260  [0x1ac748030,0x1ac748134]
    +           ! : | + 3 -[IOGPUMetalResource dealloc]  (in IOGPU) + 196  [0x1ac7480f4]
    +           ! : | + ! 3 -[IOGPUMemoryInfo removeResourceFromList:]  (in IOGPU) + 32  [0x1ac7462d0]
    +           ! : | + !   3 os_unfair_lock_lock  (in libsystem_platform.dylib) + 16,0  [0x18de22f00,0x18de22ef0]
    +           ! : | + 1 -[IOGPUMetalResource dealloc]  (in IOGPU) + 240  [0x1ac748120]
    +           ! : | + ! 1 objc_msgSend$_removeResource:  (in IOGPU) + 16  [0x1ac76f770]
    +           ! : | + 1 objc_loadWeak  (in libobjc.A.dylib) + 28  [0x18da2e8f0]
    +           ! : | 41 -[IOGPUMetalBuffer dealloc]  (in IOGPU) + 172  [0x1ac7396d8]
    +           ! : | + 41 -[IOGPUMetalDevice deallocBufferSubData:heapIndex:bufferIndex:bufferOffset:length:]  (in IOGPU) + 56  [0x1ac7414f8]
    +           ! : | +   17 IOGPUMetalSuballocatorFree  (in IOGPU) + 408,304,...  [0x1ac7508ec,0x1ac750884,...]
    +           ! : | +   9 IOGPUMetalSuballocatorFree  (in IOGPU) + 564  [0x1ac750988]
    +           ! : | +   ! 9 -[AGXG15XFamilyBuffer dealloc]  (in AGXMetalG15X_B0) + 76  [0x1f9bb6b98]
    +           ! : | +   !   9 -[AGXBuffer dealloc]  (in AGXMetalG15X_B0) + 44  [0x1f9b8d4e4]
    +           ! : | +   !     9 -[IOGPUMetalBuffer dealloc]  (in IOGPU) + 320  [0x1ac73976c]
    +           ! : | +   !       6 -[IOGPUMetalResource dealloc]  (in IOGPU) + 248  [0x1ac748128]
    +           ! : | +   !       : 6 _CFRelease  (in CoreFoundation) + 292  [0x18dfa5b90]
    +           ! : | +   !       :   6 ioGPUResourceFinalize  (in IOGPU) + 104  [0x1ac74e6a0]
    +           ! : | +   !       :     6 iokit_user_client_trap  (in IOKit) + 8  [0x19155626c]
    +           ! : | +   !       2 -[IOGPUMetalResource dealloc]  (in IOGPU) + 216  [0x1ac748108]
    +           ! : | +   !       : 2 objc_storeWeak  (in libobjc.A.dylib) + 452  [0x18da2e184]
    +           ! : | +   !       :   1 weak_unregister_no_lock  (in libobjc.A.dylib) + 40  [0x18da401f4]
    +           ! : | +   !       :   | 1 weak_entry_for_referent(weak_table_t*, objc_object*)  (in libobjc.A.dylib) + 128  [0x18da2e450]
    +           ! : | +   !       :   1 weak_unregister_no_lock  (in libobjc.A.dylib) + 16  [0x18da401dc]
    +           ! : | +   !       1 -[IOGPUMetalResource dealloc]  (in IOGPU) + 304  [0x1ac748160]
    +           ! : | +   !         1 -[_MTLObjectWithLabel dealloc]  (in Metal) + 64  [0x197f8ca1c]
    +           ! : | +   !           1 _objc_rootDealloc  (in libobjc.A.dylib) + 80  [0x18da2ae24]
    +           ! : | +   !             1 objc_destructInstance  (in libobjc.A.dylib) + 144  [0x18da2aebc]
    +           ! : | +   !               1 objc_object::clearDeallocating_slow()  (in libobjc.A.dylib) + 108  [0x18da360b4]
    +           ! : | +   !                 1 weak_clear_no_lock  (in libobjc.A.dylib) + 48  [0x18da36184]
    +           ! : | +   !                   1 weak_entry_for_referent(weak_table_t*, objc_object*)  (in libobjc.A.dylib) + 76  [0x18da2e41c]
    +           ! : | +   7 IOGPUMetalSuballocatorFree  (in IOGPU) + 180  [0x1ac750808]
    +           ! : | +   ! 7 MTLRangeAllocatorDeallocate  (in Metal) + 84,112,...  [0x197fc51d8,0x197fc51f4,...]
    +           ! : | +   3 IOGPUMetalSuballocatorFree  (in IOGPU) + 532  [0x1ac750968]
    +           ! : | +   ! 2 std::__tree, std::__map_value_compare, true>, IOGPUMetalSuballocatorHeap::Allocator>>::__emplace_multi>(std::pair&&)  (in IOGPU) + 76,112  [0x1ac751174,0x1ac751198]
    +           ! : | +   ! 1 std::__tree, std::__map_value_compare, true>, IOGPUMetalSuballocatorHeap::Allocator>>::__emplace_multi>(std::pair&&)  (in IOGPU) + 44  [0x1ac751154]
    +           ! : | +   !   1 IOGPUMetalSuballocatorHeap::Allocator, void*>>::allocate(unsigned long)  (in IOGPU) + 40  [0x1ac751248]
    +           ! : | +   !     1 posix_memalign  (in libsystem_malloc.dylib) + 40  [0x18dc13b68]
    +           ! : | +   !       1 _malloc_zone_memalign  (in libsystem_malloc.dylib) + 16  [0x18dc30bbc]
    +           ! : | +   2 IOGPUMetalSuballocatorFree  (in IOGPU) + 160  [0x1ac7507f4]
    +           ! : | +   ! 1 -[IOGPUMetalResource gpuAddress]  (in IOGPU) + 0  [0x1ac7481a0]
    +           ! : | +   ! 1 objc_msgSend$gpuAddress  (in IOGPU) + 0  [0x1ac770520]
    +           ! : | +   2 IOGPUMetalSuballocatorFree  (in IOGPU) + 472  [0x1ac75092c]
    +           ! : | +   ! 2 std::__tree, std::__map_value_compare, true>, IOGPUMetalSuballocatorHeap::Allocator>>::__remove_node_pointer(std::__tree_node, void*>*)  (in IOGPU) + 100  [0x1ac750d94]
    +           ! : | +   !   2 std::__tree_remove[abi:v160006]*>(std::__tree_node_base*, std::__tree_node_base*)  (in IOGPU) + 68,88  [0x1ac750de8,0x1ac750dfc]
    +           ! : | +   1 IOGPUMetalSuballocatorFree  (in IOGPU) + 556  [0x1ac750980]
    +           ! : | +     1 free_small  (in libsystem_malloc.dylib) + 876  [0x18dc0ee60]
    +           ! : | +       1 small_free_list_add_ptr  (in libsystem_malloc.dylib) + 156  [0x18dc10db8]
    +           ! : | 1 -[IOGPUMetalBuffer dealloc]  (in IOGPU) + 104  [0x1ac739694]
    +           ! : | + 1 AutoreleasePoolPage::push()  (in libobjc.A.dylib) + 124  [0x18da5ca1c]
    +           ! : | +   1 AutoreleasePoolPage::add(objc_object*)  (in libobjc.A.dylib) + 0  [0x18da5b8b8]
    +           ! : | 1 -[IOGPUMetalBuffer dealloc]  (in IOGPU) + 16  [0x1ac73963c]
    +           ! : | 1 -[IOGPUMetalResource dealloc]  (in IOGPU) + 308  [0x1ac748164]
    +           ! : | 1 objc_msgSendSuper2  (in libobjc.A.dylib) + 56  [0x18da29638]
    +           ! : 1 -[IOGPUMetalBuffer dealloc]  (in IOGPU) + 324  [0x1ac739770]
    +           ! 27 objc_msgSend  (in libobjc.A.dylib) + 0,8,...  [0x18da29400,0x18da29408,...]
    +           ! 9 _objc_rootRelease  (in libobjc.A.dylib) + 108  [0x18da2ed3c]
    +           29 mlx::core::metal::(anonymous namespace)::BufferCache::~BufferCache()  (in mlx.node) + 132,168,...  [0x11bee1c40,0x11bee1c64,...]
    +           9 mlx::core::metal::(anonymous namespace)::BufferCache::~BufferCache()  (in mlx.node) + 124  [0x11bee1c38]
    +             8 _nanov2_free  (in libsystem_malloc.dylib) + 668,784,...  [0x18dc29544,0x18dc295b8,...]
    +             1 free  (in libsystem_malloc.dylib) + 20  [0x18dc0bcd4]

How do you think if we just leak the memory on exit to save the time?

awni commented 1 week ago

I wonder why it's so slow / if there is a way to asynchronously free resources? Indeed I've noticed in the past that the program can take a while to exit when you are holding a lot of RAM. Leaking would be fairly easy ... just avoid releasing buffers in the buffer cache when the allocator is destroyed. Do we lose anything from doing that?

zcbenz commented 1 week ago

Memory allocators are slow. Doing it asynchronously does not help here because the program still has to wait for tasks to finish before exiting.

Leaking memory on exit is safe and standard behavior, for example when closing a Chrome tab all DOM elements are just leaked you can rely on OS safely collecting all the RAM used.

madrob commented 1 week ago

One danger of leaking memory on close is that in the future it makes things like ValGrind less useful when chasing a runtime memory leak.

zcbenz commented 1 week ago

You can tell sanitizers a specific leak is expected. Chromium explicitly leaks memory for static objects everywhere (base::NoDestructor) and still make use of various sanitizers.

awni commented 1 week ago

I'm not opposed to this. I can mark it as an enhancement. @zcbenz I'm not sure if you are planning to send a PR. Happy to take a look if so / try it out.

zcbenz commented 1 week ago

I will send a PR sometime later this month if no one else had worked on it.