[Bug] Crash in `item()`.

tqtifnypmb commented 4 weeks ago

Code:

let tokenId: Int = logits[idx].item()

Occurs occasionally, not 100% reproducible

Crash Stacks:

0   libsystem_kernel.dylib          0x00000001e9a86974 __pthread_kill + 8 (:-1)
1   libsystem_pthread.dylib         0x00000001fd5090ec pthread_kill + 268 (pthread.c:1717)
2   libsystem_c.dylib               0x00000001a94d7b80 abort + 180 (abort.c:118)
3   libc++abi.dylib                 0x00000001fd428ccc abort_message + 132 (abort_message.cpp:78)
4   libc++abi.dylib                 0x00000001fd418e68 demangling_terminate_handler() + 320 (cxa_default_handlers.cpp:72)
5   libobjc.A.dylib                 0x00000001994860cc _objc_terminate() + 160 (objc-exception.mm:499)
6   libc++abi.dylib                 0x00000001fd428090 std::__terminate(void (*)()) + 16 (cxa_handlers.cpp:59)
7   libc++abi.dylib                 0x00000001fd42b2e4 __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88 (cxa_exception.cpp:152)
8   libc++abi.dylib                 0x00000001fd42b228 __cxa_throw + 308 (cxa_exception.cpp:283)
9  Demo                                  0x000000010360acf0 mlx::core::reshape(mlx::core::array const&, std::__1::vector<int, std::__1::allocator<int>>, std::__1::variant<std::__1::monostate, mlx::core::Stream, mlx::core::Device>) + 916 (ops.cpp:328)
10  Demo                                 0x00000001031cb78c mlx_reshape + 108 (ops.cpp:633)
11  Demo                                 0x00000001036c2cd8 MLXArray.reshaped(_:stream:) + 256 (MLXArray+Ops.swift:1974)
12  Demo                                 0x00000001036b2610 getItemND(src:operations:stream:) + 2404 (MLXArray+Indexing.swift:722)
13  Demo                                 0x00000001036b5190 MLXArray.subscript.getter + 436

Any hints about what is happening behind the scenes or how to fix it?

Thanks

davidkoski commented 4 weeks ago

Sure, there may be more to the message -- it is throwing an exception here:

6   libc++abi.dylib                 0x00000001fd428090 std::__terminate(void (*)()) + 16 (cxa_handlers.cpp:59)
7   libc++abi.dylib                 0x00000001fd42b2e4 __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88 (cxa_exception.cpp:152)
8   libc++abi.dylib                 0x00000001fd42b228 __cxa_throw + 308 (cxa_exception.cpp:283)
9   AI Transcription                0x000000010360acf0 mlx::core::reshape(mlx::core::array const&, std::__1::vector<int, std::__1::allocator<int>>, std::__1::variant<std::__1::monostate, mlx::core::Stream, mlx::core::Device>) + 916 (ops.cpp:328)

That is this line:

https://github.com/ml-explore/mlx/blob/main/mlx/ops.cpp#L328

    msg << "[reshape] Cannot reshape array of size " << a.size()
        << " into shape " << shape << ".";
    throw std::invalid_argument(msg.str());

You are doing an array index operation, not a reshape, but there are reshapes inside the indexing code. It would be good to see what the exception message is -- if there is a crashlog it might be in there.

If you get this in Xcode you can go to frame 9 and:

p msg.str()

(I think that is the right call, but maybe msg.c_str().

tqtifnypmb commented 4 weeks ago

Another two cases.

It appears they are due to the same issue. (Crash when eval)

I wonder if running it on the CPU will make a difference?

Case 1:

#0  0x00000001084a0428 in pthread_mutex_lock ()
#1  0x000000018c7f0300 in std::__1::mutex::lock() ()
#2  0x00000001035be5c8 in std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/usr/include/c++/v1/__mutex/unique_lock.h:41
#3  0x00000001035be504 in std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/usr/include/c++/v1/__mutex/unique_lock.h:40
#4  0x000000010425fd10 in void mlx::core::scheduler::StreamThread::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:61
#5  0x000000010425fce0 in void mlx::core::scheduler::Scheduler::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::Stream const&, mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /gsebzzaktrijzzdiyqoncozizhfk/SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:150
#6  0x000000010424d908 in void mlx::core::scheduler::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::Stream const&, mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:157
#7  0x000000010424d2cc in mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/transforms.cpp:165
#8  0x000000010424da74 in mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/transforms.cpp:176
#9  0x00000001035c0034 in mlx::core::array::eval() at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/array.cpp:101
#10 0x0000000103554b84 in ::mlx_array_eval(mlx_array) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/array.cpp:189
#11 0x00000001042f3578 in MLXArray.asArray<τ_0_0>(_:) at /SourcePackages/checkouts/mlx-swift/Source/MLX/MLXArray.swift:264

Case 2:

0   libsystem_pthread.dylib         0x00000001ffd2f400 pthread_mutex_lock + 12 (pthread_mutex.c:1709)
1   libc++.1.dylib                  0x00000001a5e3ad20 std::__1::mutex::lock() + 16 (mutex.cpp:38)
2   Demo                0x0000000100f9505c std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) + 4 (unique_lock.h:41)
3   Demo                0x0000000100f9505c std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) + 4 (unique_lock.h:40)
4   Demo                0x0000000100f9505c void mlx::core::scheduler::StreamThread::enqueue<std::__1::function<void ()>>(std::__1::function<void ()>&&) + 28 (scheduler.h:61)
5   Demo                0x0000000100f958b0 void mlx::core::scheduler::Scheduler::enqueue<std::__1::function<void ()>>(mlx::core::Stream const&, std::__1::function<void ()>&&) + 16 (scheduler.h:150)
6   Demo                0x0000000100f958b0 void mlx::core::scheduler::enqueue<std::__1::function<void ()>>(mlx::core::Stream const&, std::__1::function<void ()>&&) + 20 (scheduler.h:157)
7   Demo                0x0000000100f958b0 mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool) + 1788 (transforms.cpp:144)
8   Demo                0x0000000100f95e54 mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>) + 56 (transforms.cpp:176)
9   Demo                0x0000000100b0567c mlx::core::array::eval() + 196 (array.cpp:101)
10  Demo                0x0000000101003de8 MLXArray.item<A>(_:) + 236 (MLXArray.swift:169)
11  Demo                0x00000001009b8408 specialized MLXArray.item<A>() + 24 (<compiler-generated>:0)

davidkoski commented 4 weeks ago

These look a little different -- I think it might be an assertion failure inside the pthread mutex. Are you using threads or async?

davidkoski commented 4 weeks ago

If you have a snippet that can reproduce the error I can investigate it.

tqtifnypmb commented 4 weeks ago

Three cases above are in a thread.

I can’t provide meaningful snippets. I just simply call “item” and it occurs occasionally. Found no pattern to reproduce them

davidkoski commented 3 weeks ago

MLXArray is not thread safe -- you can't create one on one thread and evaluate it (e.g/ calling item()) on another. This includes loading models and doing inference on different threads.

Try it without threads and see how it fares, if you can.

davidkoski commented 3 weeks ago

            throw std::invalid_argument(
                "[eval] Attempting to eval an array without a primitive.");

and I can reproduce that in python -- it looks like an issue in mlx.core.

davidkoski commented 3 weeks ago

Filed https://github.com/ml-explore/mlx/issues/1083.

davidkoski commented 3 weeks ago

Strange, there used to be a snippet of code in here! Anyway, per the discussion in the other issue, this is the problem:

func foo(x: MLXArray) -> MLXArray {
    let y = MLX.softmax(x, axis: -1)
    let logits = y[0,0,0].item(Float.self)
    return MLX.full([1, 100], values: MLXArray(logits))
}

The call to item() inside this function is illegal inside a compile (and it doesn't give a clear message) -- compile requires that the code produce an unbroken graph. Change it like this:

func foo(x: MLXArray) -> MLXArray {
    let y = MLX.softmax(x, axis: -1)
    let logits = y[0,0,0]
    return MLX.full([1, 100], values: y)
}

tqtifnypmb commented 3 weeks ago

Oops! I deleted the snippet because it didn't accurately reflect my question.

tqtifnypmb commented 3 weeks ago

More crash info:

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x00005fff69a438b0
Exception Codes:       0x0000000000000001, 0x00005fff69a438b0

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [3477]

VM Region Info: 0x5fff69a438b0 is not in any region.  Bytes after previous region: 105069557332145  Bytes before following region: 2522597200
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      commpage (reserved)        1000000000-7000000000   [384.0G] ---/--- SM=NUL  ...(unallocated)
--->  GAP OF 0x5f9000000000 BYTES
      MALLOC_NANO              600000000000-600020000000 [512.0M] rw-/rwx SM=PRV

davidkoski commented 3 weeks ago

OK, that is trying to access memory that is not mapped. The most likely cause is out of bounds indexing: the underlying mlx::core::array does not do bounds checking.

What does idx look like here and what is the shape of your array?

tqtifnypmb commented 2 weeks ago

I am pretty sure that the idx is within the bounds logically. (disregarding the possibility of MLX's array operations themselves provide incorrect results)

I think there are some data racing issues, and changing the default device seems to trigger some synchronization mechanism. After testing for a few days, I have discovered the following workaround (at least so far, there have been no further instances of crashing):

// run on gpu
let logits = model(x)

MLX.Device.setDefault(device: .cpu)

// all array operations manipulating `logits` that call `eval` under the hook run on cpu

davidkoski commented 1 week ago

Well, running everything on the cpu probably isn't getting all the advantages of mlx :-) If you ever have a repo that can reproduce the issue I would be happy to investigate further.

If not, are we ok to close this?

awni commented 1 week ago

It might be an issue with core as well.. hard to tell because there is no code to repro here. But @tqtifnypmb if you are able to provide the Swift repro code or similar code in Python that would be useful.

tqtifnypmb commented 1 week ago

I would very much like to help, I've tried to extract a minimal repo. But in the process, I found that the logic is the same as whisper-mlx, so I guess using whisper should be the same. It's worth noting that most of the crashes come from the iOS system. The frequency on macOS is very low.

ml-explore / mlx-swift

[Bug] Crash in `item()`. #82