Open tqtifnypmb opened 4 weeks ago
Sure, there may be more to the message -- it is throwing an exception here:
6 libc++abi.dylib 0x00000001fd428090 std::__terminate(void (*)()) + 16 (cxa_handlers.cpp:59)
7 libc++abi.dylib 0x00000001fd42b2e4 __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88 (cxa_exception.cpp:152)
8 libc++abi.dylib 0x00000001fd42b228 __cxa_throw + 308 (cxa_exception.cpp:283)
9 AI Transcription 0x000000010360acf0 mlx::core::reshape(mlx::core::array const&, std::__1::vector<int, std::__1::allocator<int>>, std::__1::variant<std::__1::monostate, mlx::core::Stream, mlx::core::Device>) + 916 (ops.cpp:328)
That is this line:
https://github.com/ml-explore/mlx/blob/main/mlx/ops.cpp#L328
msg << "[reshape] Cannot reshape array of size " << a.size()
<< " into shape " << shape << ".";
throw std::invalid_argument(msg.str());
You are doing an array index operation, not a reshape, but there are reshapes inside the indexing code. It would be good to see what the exception message is -- if there is a crashlog it might be in there.
If you get this in Xcode you can go to frame 9 and:
p msg.str()
(I think that is the right call, but maybe msg.c_str()
.
Another two cases.
It appears they are due to the same issue. (Crash when eval
)
I wonder if running it on the CPU will make a difference?
Case 1:
#0 0x00000001084a0428 in pthread_mutex_lock ()
#1 0x000000018c7f0300 in std::__1::mutex::lock() ()
#2 0x00000001035be5c8 in std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/usr/include/c++/v1/__mutex/unique_lock.h:41
#3 0x00000001035be504 in std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/usr/include/c++/v1/__mutex/unique_lock.h:40
#4 0x000000010425fd10 in void mlx::core::scheduler::StreamThread::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:61
#5 0x000000010425fce0 in void mlx::core::scheduler::Scheduler::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::Stream const&, mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /gsebzzaktrijzzdiyqoncozizhfk/SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:150
#6 0x000000010424d908 in void mlx::core::scheduler::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::Stream const&, mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:157
#7 0x000000010424d2cc in mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/transforms.cpp:165
#8 0x000000010424da74 in mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/transforms.cpp:176
#9 0x00000001035c0034 in mlx::core::array::eval() at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/array.cpp:101
#10 0x0000000103554b84 in ::mlx_array_eval(mlx_array) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/array.cpp:189
#11 0x00000001042f3578 in MLXArray.asArray<τ_0_0>(_:) at /SourcePackages/checkouts/mlx-swift/Source/MLX/MLXArray.swift:264
Case 2:
0 libsystem_pthread.dylib 0x00000001ffd2f400 pthread_mutex_lock + 12 (pthread_mutex.c:1709)
1 libc++.1.dylib 0x00000001a5e3ad20 std::__1::mutex::lock() + 16 (mutex.cpp:38)
2 Demo 0x0000000100f9505c std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) + 4 (unique_lock.h:41)
3 Demo 0x0000000100f9505c std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) + 4 (unique_lock.h:40)
4 Demo 0x0000000100f9505c void mlx::core::scheduler::StreamThread::enqueue<std::__1::function<void ()>>(std::__1::function<void ()>&&) + 28 (scheduler.h:61)
5 Demo 0x0000000100f958b0 void mlx::core::scheduler::Scheduler::enqueue<std::__1::function<void ()>>(mlx::core::Stream const&, std::__1::function<void ()>&&) + 16 (scheduler.h:150)
6 Demo 0x0000000100f958b0 void mlx::core::scheduler::enqueue<std::__1::function<void ()>>(mlx::core::Stream const&, std::__1::function<void ()>&&) + 20 (scheduler.h:157)
7 Demo 0x0000000100f958b0 mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool) + 1788 (transforms.cpp:144)
8 Demo 0x0000000100f95e54 mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>) + 56 (transforms.cpp:176)
9 Demo 0x0000000100b0567c mlx::core::array::eval() + 196 (array.cpp:101)
10 Demo 0x0000000101003de8 MLXArray.item<A>(_:) + 236 (MLXArray.swift:169)
11 Demo 0x00000001009b8408 specialized MLXArray.item<A>() + 24 (<compiler-generated>:0)
These look a little different -- I think it might be an assertion failure inside the pthread mutex. Are you using threads or async?
If you have a snippet that can reproduce the error I can investigate it.
Three cases above are in a thread.
I can’t provide meaningful snippets. I just simply call “item” and it occurs occasionally. Found no pattern to reproduce them
MLXArray is not thread safe -- you can't create one on one thread and evaluate it (e.g/ calling item()
) on another. This includes loading models and doing inference on different threads.
Try it without threads and see how it fares, if you can.
throw std::invalid_argument(
"[eval] Attempting to eval an array without a primitive.");
and I can reproduce that in python -- it looks like an issue in mlx.core.
Strange, there used to be a snippet of code in here! Anyway, per the discussion in the other issue, this is the problem:
func foo(x: MLXArray) -> MLXArray {
let y = MLX.softmax(x, axis: -1)
let logits = y[0,0,0].item(Float.self)
return MLX.full([1, 100], values: MLXArray(logits))
}
The call to item()
inside this function is illegal inside a compile (and it doesn't give a clear message) -- compile requires that the code produce an unbroken graph. Change it like this:
func foo(x: MLXArray) -> MLXArray {
let y = MLX.softmax(x, axis: -1)
let logits = y[0,0,0]
return MLX.full([1, 100], values: y)
}
Oops! I deleted the snippet because it didn't accurately reflect my question.
More crash info:
Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x00005fff69a438b0
Exception Codes: 0x0000000000000001, 0x00005fff69a438b0
Termination Reason: Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process: exc handler [3477]
VM Region Info: 0x5fff69a438b0 is not in any region. Bytes after previous region: 105069557332145 Bytes before following region: 2522597200
REGION TYPE START - END [ VSIZE] PRT/MAX SHRMOD REGION DETAIL
commpage (reserved) 1000000000-7000000000 [384.0G] ---/--- SM=NUL ...(unallocated)
---> GAP OF 0x5f9000000000 BYTES
MALLOC_NANO 600000000000-600020000000 [512.0M] rw-/rwx SM=PRV
OK, that is trying to access memory that is not mapped. The most likely cause is out of bounds indexing: the underlying mlx::core::array
does not do bounds checking.
What does idx
look like here and what is the shape of your array?
I am pretty sure that the idx
is within the bounds logically. (disregarding the possibility of MLX's array operations themselves provide incorrect results)
I think there are some data racing issues, and changing the default device seems to trigger some synchronization mechanism
. After testing for a few days, I have discovered the following workaround (at least so far, there have been no further instances of crashing):
// run on gpu
let logits = model(x)
MLX.Device.setDefault(device: .cpu)
// all array operations manipulating `logits` that call `eval` under the hook run on cpu
Well, running everything on the cpu probably isn't getting all the advantages of mlx :-) If you ever have a repo that can reproduce the issue I would be happy to investigate further.
If not, are we ok to close this?
It might be an issue with core as well.. hard to tell because there is no code to repro here. But @tqtifnypmb if you are able to provide the Swift repro code or similar code in Python that would be useful.
I would very much like to help, I've tried to extract a minimal repo. But in the process, I found that the logic is the same as whisper-mlx, so I guess using whisper should be the same. It's worth noting that most of the crashes come from the iOS system. The frequency on macOS is very low.
Code:
Occurs occasionally, not 100% reproducible
Crash Stacks:
Any hints about what is happening behind the scenes or how to fix it?
Thanks