ml-explore / mlx-swift

Swift API for MLX
https://ml-explore.github.io/mlx-swift/
MIT License
674 stars 54 forks source link

[Bug] Random `Address size fault` crashes on iOS devices #121

Open tqtifnypmb opened 3 months ago

tqtifnypmb commented 3 months ago

Describe the issue:

There is a chance that mlx would causes Address size fault crashes while eval the graph on iOS devices. This issue barely occur on macOS devices.

Configuration:

Devices: iPhone/iPad System Version: From iOS 16.7 to iOS 17.5 MLX-Swift version: 0.16.0

Crash Logs:

0   libsystem_kernel.dylib          0x00000001ddd0742c __pthread_kill + 8 (:-1)
1   libsystem_pthread.dylib         0x00000001f1aa6c0c pthread_kill + 268 (pthread.c:1721)
2   libsystem_c.dylib               0x000000019cbaaba0 abort + 180 (abort.c:118)
3   libc++abi.dylib                 0x00000001f19c4ca4 abort_message + 132 (abort_message.cpp:78)
4   libc++abi.dylib                 0x00000001f19b4e40 demangling_terminate_handler() + 320 (cxa_default_handlers.cpp:72)
5   libobjc.A.dylib                 0x000000018cb21e3c _objc_terminate() + 160 (objc-exception.mm:499)
6   libc++abi.dylib                 0x00000001f19c4068 std::__terminate(void (*)()) + 16 (cxa_handlers.cpp:59)
7   libc++abi.dylib                 0x00000001f19c735c __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88 (cxa_exception.cpp:152)
8   libc++abi.dylib                 0x00000001f19c72a0 __cxa_throw + 308 (cxa_exception.cpp:283)
9   Test                0x00000001047756f4 mlx::core::metal::Device::get_kernel_(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, MTL::Function const*) + 228 (device.cpp:396)
10  Test                0x0000000104775ce8 mlx::core::metal::Device::get_kernel(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, MTL::Library*, std::__1::basic_string<char, std::__1::char_traits<c... + 188 (device.cpp:541)
11  Test                0x0000000104789774 mlx::core::get_reduce_kernel(mlx::core::metal::Device&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_trai... + 824 (jit_kernels.cpp:353)
12  Test                0x00000001047a72dc mlx::core::row_reduce_general_dispatch(mlx::core::array const&, mlx::core::array&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, mlx::core::ReductionP... + 1028 (reduce.cpp:178)
13  Test                0x00000001047a96dc mlx::core::Reduce::eval_gpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 1380 (reduce.cpp:627)
14  Test                0x000000010479b4bc mlx::core::metal::make_task(mlx::core::array, bool)::$_1::operator()() + 256 (metal.cpp:66)
15  Test                0x000000010479b4bc decltype(std::declval<mlx::core::metal::make_task(mlx::core::array, bool)::$_1&>()()) std::__1::__invoke[abi:ue170006]<mlx::core::metal::make_task(mlx::core::array, bool)::$_1&>(mlx::core::metal::m... + 256 (invoke.h:340)
16  Test                0x000000010479b4bc void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ue170006]<mlx::core::metal::make_task(mlx::core::array, bool)::$_1&>(mlx::core::metal::make_task(mlx::core::array, bool)::$_1&) + 256 (invoke.h:415)
17  Test                0x000000010479b4bc std::__1::__function::__alloc_func<mlx::core::metal::make_task(mlx::core::array, bool)::$_1, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array, bool)::$_1>, void ()>::operator()[abi:... + 256 (function.h:193)
18  Test                0x000000010479b4bc std::__1::__function::__func<mlx::core::metal::make_task(mlx::core::array, bool)::$_1, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array, bool)::$_1>, void ()>::operator()() + 308 (function.h:364)
19  Test                0x000000010482644c std::__1::__function::__value_func<void ()>::operator()[abi:ue170006]() const + 20 (function.h:518)
20  Test                0x000000010482644c std::__1::function<void ()>::operator()() const + 20 (function.h:1169)
21  Test                0x000000010482644c mlx::core::scheduler::StreamThread::thread_fn() + 196 (scheduler.h:54)
22  Test                0x000000010482663c decltype(*std::declval<mlx::core::scheduler::StreamThread*>().*std::declval<void (mlx::core::scheduler::StreamThread::*)()>()()) std::__1::__invoke[abi:ue170006]<void (mlx::core::scheduler::StreamT... + 28 (invoke.h:308)
23  Test                0x000000010482663c void std::__1::__thread_execute[abi:ue170006]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), m... + 28 (thread.h:227)
24  Test                0x000000010482663c void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::Stream... + 72 (thread.h:238)
25  libsystem_pthread.dylib         0x00000001f1aa506c _pthread_start + 136 (pthread.c:931)
26  libsystem_pthread.dylib         0x00000001f1aa00d8 thread_start + 8 (:-1)

How to reproduce?

Don't know. This issue seems to occur occasionally. Can't provide meaningful code neither.

davidkoski commented 3 months ago

This is throwing an exception in the evaluation thread -- this should have printed a message in the crash log. Did you see anything like that?

tqtifnypmb commented 3 months ago

This is throwing an exception in the evaluation thread -- this should have printed a message in the crash log. Did you see anything like that?

This issue never occurred to me during development, so I did not see any printed message, for now I only have crash logs from iOS devices.

davidkoski commented 3 months ago

Near the top of the crash log it may have some more specific information. Something along these lines:

Crashed Thread:        1  Dispatch queue: setup-queue

Exception Type:        EXC_CRASH (SIGABRT)
Exception Codes:       0x0000000000000000, 0x0000000000000000

Termination Reason:    Namespace SIGNAL, Code 6 Abort trap: 6
Terminating Process:   fileproviderd [12223]

Application Specific Information:
Assertion failed: (Unexpected function requested: existing (null), requested (null)), function xxx
tqtifnypmb commented 3 months ago

I found no Application Specific Information section among the logs I have.

However, I noticed that all the devices that crashed are quite old models (iPhone 10, iPhone 11, iPad 8, iPad 11). I'm not sure if the issue is related to their chips.

AppVariant:          1:iPad11,3:16
Code Type:           ARM-64 (Native)
Role:                Foreground
Parent Process:      launchd [1]

Date/Time:           2024-07-27 10:43:59.2120 +0200
Launch Time:         2024-07-27 10:43:45.7787 +0200
OS Version:          iPhone OS 17.5.1 (21F90)
Release Type:        User
Report Version:      104

Exception Type:  EXC_CRASH (SIGABRT)
Exception Codes: 0x0000000000000000, 0x0000000000000000
Termination Reason: SIGNAL 6 Abort trap: 6

Triggered by Thread:  12
Thread 12 crashed with ARM Thread State (64-bit):
    x0: 0x0000000000000000   x1: 0x0000000000000000   x2: 0x0000000000000000   x3: 0x0000000000000000
    x4: 0x00000001fb8feea6   x5: 0x000000000000000e   x6: 0x0000000000000020   x7: 0x0000000000000450
    x8: 0xa0a9d41e6bff9b04   x9: 0xa0a9d41f1bc16b04  x10: 0x0000000000001b00  x11: 0x00000001703edd60
   x12: 0x00000000000007fb  x13: 0x00000000000007fd  x14: 0x0000000000000010  x15: 0x0000000000000037
   x16: 0x0000000000000148  x17: 0x00000001703ef000  x18: 0x0000000000000000  x19: 0x0000000000000006
   x20: 0x000000000000c407  x21: 0x00000001703ef0e0  x22: 0x0000000205e678a0  x23: 0x000000010617e740
   x24: 0x0000000301c7b780  x25: 0x00000001703eea68  x26: 0x00000001703eec40  x27: 0x0000000300b0ed60
   x28: 0x7ffffffffffffff8   fp: 0x00000001703ee1a0   lr: 0x00000001f6df1c0c
    sp: 0x00000001703ee180   pc: 0x00000001e309342c cpsr: 0x40000000
   esr: 0x56000080  Address size fault
davidkoski commented 3 months ago

Yes, that likely is an issue -- older GPUs didn't support the same Metal capabilities. I know mlx-swift works on an iPhone 12 Pro Max but I haven't tested on anything older.

There are some more details here: https://github.com/ml-explore/mlx-swift-examples/issues/21