Closed forthrin closed 7 months ago
One thing to try is to create a debug build of mpv with the AddressSanitizer enabled and run with that build for a while.
In issue #11626 I showed how I enabled the sanitizer when building using mpv-build:
printf "%s\n" -Dbuildtype=debug > mpv_options
printf "%s\n" -Db_sanitize=address,undefined >> mpv_options
If the problem reproduces the address sanitizer may be able to identify the offending code.
@low-batt: Well! I got my first crash since compiling with the address sanitizer. But is this the same crash? What parts of the report are relevant? The address sanitizer, the crashed thread, or all threads from 0 up to the crashed thread?
AddressSanitizer:DEADLYSIGNAL
=================================================================
==48985==ERROR: AddressSanitizer: SEGV on unknown address 0x00000001a8f0 (pc 0x0001f4fec174 bp 0x00016b131fd0 sp 0x00016b131f70 T1)
==48985==The signal is caused by a READ memory access.
#0 0x1f4fec174 in gleUnbindDeleteHashNamesAndObjects+0x48 (GLEngine:arm64e+0xf7174) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
#1 0x1f4f10550 in glDeleteTextures_Exec+0x2b8 (GLEngine:arm64e+0x1b550) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
#2 0x198122894 in CA::(anonymous namespace)::AsyncDeletion::callback(void*)+0x54 (QuartzCore:arm64e+0xba894) (BuildId: 8cd4195a97373f66840e04904f0280c132000000200000000100000000030d00)
#3 0x1098457ec in asan_dispatch_call_block_and_release+0xe0 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x417ec) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)
#4 0x1909043fc in _dispatch_client_callout+0x10 (libdispatch.dylib:arm64e+0x43fc) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
#5 0x19090ba84 in _dispatch_lane_serial_drain+0x298 (libdispatch.dylib:arm64e+0xba84) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
#6 0x19090c5f4 in _dispatch_lane_invoke+0x17c (libdispatch.dylib:arm64e+0xc5f4) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
#7 0x190917240 in _dispatch_workloop_worker_thread+0x284 (libdispatch.dylib:arm64e+0x17240) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
#8 0x190ab0070 in _pthread_wqthread+0x11c (libsystem_pthread.dylib:arm64e+0x3070) (BuildId: b401cfb38dfe32db92b3ba8af0f8ca6e32000000200000000100000000030d00)
#9 0x190aaed90 in start_wqthread+0x4 (libsystem_pthread.dylib:arm64e+0x1d90) (BuildId: b401cfb38dfe32db92b3ba8af0f8ca6e32000000200000000100000000030d00)
==48985==Register values:
x[0] = 0x000000010b2b0060 x[1] = 0x000000010b2b00e0 x[2] = 0x000000016b131730 x[3] = 0x00000000000001a1
x[4] = 0x00000000000003e0 x[5] = 0x0000000000000001 x[6] = 0x000000016b0b0000 x[7] = 0x0000000000000001
x[8] = 0x0000000000000002 x[9] = 0x00000000000001a2 x[10] = 0x000000000000eccc x[11] = 0x000000010c8a5458
x[12] = 0x000000016b131798 x[13] = 0xb010ddcc49d497b2 x[14] = 0x0000000000000000 x[15] = 0x00000001ebf1ebb0
x[16] = 0x0000000190aaee10 x[17] = 0x0000000109898730 x[18] = 0x0000000000000000 x[19] = 0x000000013082cfb0
x[20] = 0x00000001f5018398 x[21] = 0x000000010d568e48 x[22] = 0x0000000113c00400 x[23] = 0x000000000001a8e8
x[24] = 0x000000000000000a x[25] = 0x000000000000000b x[26] = 0x0000000000000008 x[27] = 0x000000000000000a
x[28] = 0x000000000000000b fp = 0x000000016b131fd0 lr = 0x00000001f4fec1f0 sp = 0x000000016b131f70
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (GLEngine:arm64e+0xf7174) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00) in gleUnbindDeleteHashNamesAndObjects+0x48
Thread T1 created by T0 here:
<empty stack>
==48985==ABORTING
----------------------------------------------------------------------
Process: mpv [48985]
Date/Time: 2023-05-24
OS Version: macOS 13.3.1 (22E772610a)
Exception Type: EXC_BAD_ACCESS (SIGABRT)
Exception Codes: KERN_INVALID_ADDRESS at 0x000000000001a8f0
Exception Codes: 0x0000000000000001, 0x000000000001a8f0
VM Region Info: 0x1a8f0 is not in any region. Bytes before following region: 4375942928
REGION TYPE START - END [ VSIZE] PRT/MAX SHRMOD REGION DETAIL
UNUSED SPACE AT START
--->
__TEXT 104d54000-106fc8000 [ 34.5M] r-x/r-x SM=COW ...wnloads/*/mpv
Application Specific Information:
abort() called
Thread 0:: Dispatch queue: com.apple.main-thread
0 libsystem_pthread.dylib 0x190aafd00 pthread_once + 108
1 libclang_rt.asan_osx_dynamic.dylib 0x10987bd78 __lsan::DisabledInThisThread() + 40
2 libclang_rt.asan_osx_dynamic.dylib 0x10980aa5c __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) + 1440
3 libclang_rt.asan_osx_dynamic.dylib 0x10980a490 __asan::asan_malloc(unsigned long, __sanitizer::BufferedStackTrace*) + 48
4 libclang_rt.asan_osx_dynamic.dylib 0x109846ed0 wrap_malloc + 252
5 libsystem_trace.dylib 0x190835b6c _os_log_fmt_flatten_object + 352
6 libsystem_trace.dylib 0x190833968 _os_log_impl_flatten_and_send + 1860
7 libsystem_trace.dylib 0x19083320c _os_log + 152
8 libsystem_trace.dylib 0x19083316c _os_log_impl + 24
9 MediaPlayer 0x1b4957c38 __73-[MPRemoteCommandCenter _locked_updateCanBeNowPlayingApplicationIfNeeded]_block_invoke + 180
10 libclang_rt.asan_osx_dynamic.dylib 0x109845f78 __wrap_dispatch_async_block_invoke + 196
11 libdispatch.dylib 0x190902874 _dispatch_call_block_and_release + 32
12 libdispatch.dylib 0x190904400 _dispatch_client_callout + 20
13 libdispatch.dylib 0x190912bf8 _dispatch_main_queue_drain + 928
14 libdispatch.dylib 0x190912848 _dispatch_main_queue_callback_4CF + 44
15 CoreFoundation 0x190bd3c88 __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__ + 16
16 CoreFoundation 0x190b914a8 __CFRunLoopRun + 1992
17 CoreFoundation 0x190b9058c CFRunLoopRunSpecific + 612
18 HIToolbox 0x19a3c5df4 RunCurrentEventLoopInMode + 292
19 HIToolbox 0x19a3c5c30 ReceiveNextEventCommon + 648
20 HIToolbox 0x19a3c5988 _BlockUntilNextEventMatchingListInModeWithFilter + 76
21 AppKit 0x193daff58 _DPSNextEvent + 636
22 AppKit 0x193daf0f4 -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 716
23 AppKit 0x193da3558 -[NSApplication run] + 464
24 mpv 0x105937f9c cocoa_run_runloop + 120 (macosx_application.m:271)
25 mpv 0x105937364 cocoa_main + 1928 (macosx_application.m:368)
26 mpv 0x10594fee4 main + 36 (main-fn-cocoa.c:9)
27 dyld 0x19075bf28 start + 2236
Thread 1 Crashed:: Dispatch queue: CA DispatchGroup
0 libsystem_kernel.dylib 0x190a7c724 __pthread_kill + 8
1 libsystem_pthread.dylib 0x190ab3c28 pthread_kill + 288
2 libsystem_c.dylib 0x1909c1ae8 abort + 180
3 libclang_rt.asan_osx_dynamic.dylib 0x10986b55c __sanitizer::Abort() + 68
4 libclang_rt.asan_osx_dynamic.dylib 0x10986ac4c __sanitizer::Die() + 212
5 libclang_rt.asan_osx_dynamic.dylib 0x10984ec2c __asan::ScopedInErrorReport::~ScopedInErrorReport() + 1124
6 libclang_rt.asan_osx_dynamic.dylib 0x10984b3fc __asan::ReportDeadlySignal(__sanitizer::SignalContext const&) + 344
7 libclang_rt.asan_osx_dynamic.dylib 0x10984a7f4 __asan::AsanOnDeadlySignal(int, void*, void*) + 100
8 libsystem_platform.dylib 0x190ae2a84 _sigtramp + 56
9 ??? 0xffff8001f4fec1f0 ???
10 GLEngine 0x1f4f10554 glDeleteTextures_Exec + 700
11 QuartzCore 0x198122898 CA::(anonymous namespace)::AsyncDeletion::callback(void*) + 88
12 libclang_rt.asan_osx_dynamic.dylib 0x1098457f0 asan_dispatch_call_block_and_release + 228
13 libdispatch.dylib 0x190904400 _dispatch_client_callout + 20
14 libdispatch.dylib 0x19090ba88 _dispatch_lane_serial_drain + 668
15 libdispatch.dylib 0x19090c5f8 _dispatch_lane_invoke + 384
16 libdispatch.dylib 0x190917244 _dispatch_workloop_worker_thread + 648
17 libsystem_pthread.dylib 0x190ab0074 _pthread_wqthread + 288
18 libsystem_pthread.dylib 0x190aaed94 start_wqthread + 8
On whether the root cause of this crash is the same defect that caused the previous crash…
We need to track down and be able to explain the defective code to say for sure. We suspect a code defect is causing a memory overwrite. Memory overwrites can manifest in all kinds of different ways.
On what parts of the crash report are relevant…
In general this depends upon the kind of software problem. Some problems are obvious from just the stack traceback of the thread that crashed. For other problems it is desirable to have all of the available evidence. For example, looking over what else is going on with the other threads may show a pattern that provides a clue to the root cause.
Back to the crash at hand…
Did this crash also occur while mpv was quitting?
This part is very interesting:
#0 0x1f4fec174 in gleUnbindDeleteHashNamesAndObjects+0x48 (GLEngine:arm64e+0xf7174) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
#1 0x1f4f10550 in glDeleteTextures_Exec+0x2b8 (GLEngine:arm64e+0x1b550) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
#2 0x198122894 in CA::(anonymous namespace)::AsyncDeletion::callback(void*)+0x54 (QuartzCore:arm64e+0xba894) (BuildId: 8cd4195a97373f66840e04904f0280c132000000200000000100000000030d00)
This is the thread that crashed:
Thread 1 Crashed:: Dispatch queue: CA DispatchGroup
The following is from the crash report in issue #10986:
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 GLEngine 0x7ffa1b19f82d gleUnbindDeleteHashNamesAndObjects + 61
1 GLEngine 0x7ffa1b0bc759 glDeleteTextures_Exec + 700
2 mpv 0x102efe849 gl_tex_destroy + 73
3 mpv 0x102ee47c7 ra_tex_free + 23
4 mpv 0x102ee3a22 mpgl_osd_destroy + 82
5 mpv 0x102ef123f gl_video_uninit + 175
6 mpv 0x102ee379c destroy + 28
7 mpv 0x102f053bb mpv_render_context_free + 219
8 mpv 0x102f297af CocoaCB.shutdown(_:) + 367
The crash report in #10986 is suggestive of a problem with coordinating shutdown of the OpenGL context. Looking at the code for mpv_render_context_free shows a lot of code that is intended to properly coordinate freeing of the context. Possibly that code is not quite correct and there is a thread race condition lurking that is responsible for the problem at hand.
This is just speculation. Remember I am not a mpv developer.
Continue to run with sanity checking enabled. Lets see if we get another crash report that provides additional clues.
New incident. Don't know if it's the same one. Happened when quitting at the end of a video.
AddressSanitizer:DEADLYSIGNAL
=================================================================
==67503==ERROR: AddressSanitizer: BUS on unknown address (pc 0x0001f4fec174 bp 0x00016ae2be90 sp 0x00016ae2be30 T0)
==67503==The signal is caused by a READ memory access.
==67503==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used.
#0 0x1f4fec174 in gleUnbindDeleteHashNamesAndObjects+0x48 (GLEngine:arm64e+0xf7174) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
#1 0x1f4f10550 in glDeleteTextures_Exec+0x2b8 (GLEngine:arm64e+0x1b550) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
#2 0x105d2cc18 in gl_tex_destroy ra_gl.c:277
#3 0x1059d5210 in ra_tex_free ra.c:45
#4 0x1059c3d68 in mpgl_osd_destroy osd.c:107
#5 0x105a4c968 in gl_video_uninit video.c:3954
#6 0x1059c23d4 in destroy libmpv_gpu.c:223
#7 0x105b20fec in mpv_render_context_free vo_libmpv.c:307
#8 0x104feaa5c in LibmpvHelper.deinitRender() libmpv_helper.swift:226
#9 0x105018f68 in CocoaCB.shutdown(_:) cocoa_cb_common.swift:214
#10 0x1050190a4 in CocoaCB.processEvent(_:) cocoa_cb_common.swift:227
#11 0x1050190e8 in @objc CocoaCB.processEvent(_:) <compiler-generated>
#12 0x105bb1dec in -[Application processEvent:] macosx_application.m:184
#13 0x105bb7034 in -[EventsResponder processEvent:] macosx_events.m:277
#14 0x105bb6eb8 in __29-[EventsResponder readEvents]_block_invoke macosx_events.m:269
#15 0x109ac1f74 in __wrap_dispatch_async_block_invoke+0xc0 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x41f74) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)
#16 0x190902870 in _dispatch_call_block_and_release+0x1c (libdispatch.dylib:arm64e+0x2870) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
#17 0x1909043fc in _dispatch_client_callout+0x10 (libdispatch.dylib:arm64e+0x43fc) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
#18 0x190912bf4 in _dispatch_main_queue_drain+0x39c (libdispatch.dylib:arm64e+0x12bf4) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
#19 0x190912844 in _dispatch_main_queue_callback_4CF+0x28 (libdispatch.dylib:arm64e+0x12844) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
#20 0x190bd3c84 in __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__+0xc (CoreFoundation:arm64e+0xc0c84) (BuildId: b4fdaece97273969b01427f7f24c8e0132000000200000000100000000030d00)
#21 0x190b914a4 in __CFRunLoopRun+0x7c4 (CoreFoundation:arm64e+0x7e4a4) (BuildId: b4fdaece97273969b01427f7f24c8e0132000000200000000100000000030d00)
#22 0x190b90588 in CFRunLoopRunSpecific+0x260 (CoreFoundation:arm64e+0x7d588) (BuildId: b4fdaece97273969b01427f7f24c8e0132000000200000000100000000030d00)
#23 0x19a3c5df0 in RunCurrentEventLoopInMode+0x120 (HIToolbox:arm64e+0x30df0) (BuildId: 5f34bbf5653a31a5b4b30a02c91ab48832000000200000000100000000030d00)
#24 0x19a3c5c2c in ReceiveNextEventCommon+0x284 (HIToolbox:arm64e+0x30c2c) (BuildId: 5f34bbf5653a31a5b4b30a02c91ab48832000000200000000100000000030d00)
#25 0x19a3c5984 in _BlockUntilNextEventMatchingListInModeWithFilter+0x48 (HIToolbox:arm64e+0x30984) (BuildId: 5f34bbf5653a31a5b4b30a02c91ab48832000000200000000100000000030d00)
#26 0x193daff54 in _DPSNextEvent+0x278 (AppKit:arm64e+0x38f54) (BuildId: cc3e52154cd7364c87506eb7002720ce32000000200000000100000000030d00)
#27 0x193daf0f0 in -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:]+0x2c8 (AppKit:arm64e+0x380f0) (BuildId: cc3e52154cd7364c87506eb7002720ce32000000200000000100000000030d00)
#28 0x193da3554 in -[NSApplication run]+0x1cc (AppKit:arm64e+0x2c554) (BuildId: cc3e52154cd7364c87506eb7002720ce32000000200000000100000000030d00)
#29 0x105bb3f98 in cocoa_run_runloop macosx_application.m:271
#30 0x105bb3360 in cocoa_main macosx_application.m:368
#31 0x105bcbee0 in main main-fn-cocoa.c:9
#32 0x19075bf24 (<unknown module>)
==67503==Register values:
x[0] = 0x0000000113f10400 x[1] = 0x000000010d8690c8 x[2] = 0x00000001f5018398 x[3] = 0x0000000000000009
x[4] = 0x0000000000000001 x[5] = 0x0000000000000001 x[6] = 0x0000000000000000 x[7] = 0x0000000000000001
x[8] = 0x0000000000000001 x[9] = 0x0000000000000000 x[10] = 0x0000000000000001 x[11] = 0x0000000000000000
x[12] = 0x000000016ae2bd28 x[13] = 0x000000016ae2bd20 x[14] = 0x0000000000007e01 x[15] = 0x0000000000000006
x[16] = 0x00000001f5018398 x[17] = 0x0000000109b145d8 x[18] = 0x0000000000000000 x[19] = 0x000000013fc38090
x[20] = 0x00000001f5018398 x[21] = 0x000000010d8690c8 x[22] = 0x0000000113f10400 x[23] = 0x000000210001d51e
x[24] = 0x0000000000000009 x[25] = 0x000000000000000a x[26] = 0x0000000000000006 x[27] = 0x0000000000000001
x[28] = 0x0000000000000007 fp = 0x000000016ae2be90 lr = 0x00000001f4f10554 sp = 0x000000016ae2be30
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: BUS (GLEngine:arm64e+0xf7174) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00) in gleUnbindDeleteHashNamesAndObjects+0x48
==67503==ABORTING
Yeah it seems to be the same root cause. If this is indeed a bug with libmpv, I wonder if it can be reproduced on other platforms?
Don't the reports pinpoint (or at least narrow down considerably) the culprit?
It has happened at least three times the last couple of weeks, and I think it was always:
Though this may not be necessary factors for reproduction.
PS! Is it possible in some way to replay a mpv session, like tool-assisted speedruns? (Though this may not trigger the same issue every time because of the arbitrary nature of memory allocation.)
PPS! Then again, is there any way to sandbox an application, thereby forcing the exact same execution every time? Like a docker container, VirtualBox, etc. Maybe this is overkill, and still impossible to control, but interesting to know.
Don't the reports pinpoint (or at least narrow down considerably) the culprit?
Yes. I'm not an mpv dev though, so besides indicating a bug somewhere in mpv_render_context_free
as low-batt mentioned, I don't really know enough to pin it down further.
I wonder if also compiling with ThreadSanitizer
might be illuminating? You can see more about how TSan works internally at https://wwdctogether.com/wwdc2016/412 starting at 15:10.
https://clang.llvm.org/docs/ThreadSanitizer.html
Is it possible in some way to replay a mpv session
This is what https://rr-project.org/ is supposed to do, but I don't think such tools work on mac due to reasons mentioned in https://joneschrisg.wordpress.com/2015/01/29/rr-on-os-x-can-it-be-ported/
is there any way to sandbox an application, thereby forcing the exact same execution every time?
Normal sandboxing is not sufficient, since if it's a race condition then the timing with which memory accesses occur has an impact on the state, and to my knowledge I don't believe there's a way to perfectly ensure same state (would love to know if there is such a tool though!)
I think the best equivalent is using a race detector like TSan, since it "kind of" works around the memory access timing issue because it can instrument every memory access to do checking.
Sorry I've been slient. Super busy these days. On this:
Don't the reports pinpoint (or at least narrow down considerably) the culprit?
Yes and no. These two lines of the stack trace:
#1 0x1f4f10550 in glDeleteTextures_Exec+0x2b8 (GLEngine:arm64e+0x1b550) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
#2 0x105d2cc18 in gl_tex_destroy ra_gl.c:277
Show mpv gl_tex_destroy calling OpenGL glDeleteTextures. From OpenGL Context:
In order for any OpenGL commands to work, a context must be current; all OpenGL commands affect the state of whichever context is current. The current context is a thread-local variable, so a single process can have several threads, each of which has its own current context. However, a single context cannot be current in multiple threads at the same time.
Under macOS "thread-local variable" corresponds to NSThread.threadDictionary. Using thread local storage avoids the need to add a context
parameter to every OpenGL method. A downside of using thread local storage is that programmers must always be thinking about this thread state whenever using an API that makes use of this technique.
So one concern is the state of the OpenGL context. In the crash reports so far we have seen the main thread crash as well as a thread from the CA DispatchGroup.
I took @krackers advice and built a version of mpv with TSan enabled. That showed a problem in the logger at startup, which may explain a startup hang that I have experienced. I also ran into a reproducible crash in the playback core. Not sure what that is about. As the crashes are happening in OpenGL code, not mpv code, it limits the amount of help some of these santizers can provide.
I tried to reproduce this problem using mpv built with the address sanitizer and have only managed to trigger it once. I was using variations on the follow script to test mpv shutdown.
That detected the same problem reported here:
I'm expecting this requires the attention of a mpv developer with knowledge of different subsystems and how shutdown is supposed to be coordinated.
Thank you for chiming in low-batt, it is always a pleasure to read your posts!
so far we have seen the main thread crash as well as a thread from the CA DispatchGroup.
Have all instances of the crash been with people using OSX on ARM machines? On those machines OpenGL commands go through a translation layer, and it is known that it is more strict about requiring serialization for context access (see https://github.com/glfw/glfw/issues/1997#issuecomment-1292774964)
I wonder if it is a similar issue here?
Although this OpenGL requirement from OpenGL Programming Guide for Mac is very important:
OpenGL is not reentrant. If you modify the same context from multiple threads simultaneously, the results are unpredictable. Your application might crash or it might render improperly.
The situation at hand is worse. This is during shutdown, so single threading context access is not enough. All systems must be coordinated to ensure no thread tries to use the context after it has been shutdown. So even if access is single threaded it would be like a thread trying to access a file after another thread closed the channel.
The mpv code tries to coordinate this. But it could be some coordination is missing or something is not quite correct in the thread coordination.
This feels like a thread race condition with a very small window of vulnerability.
With that kind of problem the failure is usually, but not always, more likely to occur when running on hardware that is actually concurrently running threads on multiple CPU cores. That may explain why this is being encountered when running on Apple Silicon with its many cores.
Made me think about this some more. I'm thinking your comment:
Also should be noted that by mutex locking in the apple docs they seem to specifically mean using
CGLLockContext
. Using just regular thread mutex is not 100% effective.
May have nailed it.
The Apple documentation doesn't really spell this out, as in explain what can go wrong if you do your own locking. I'm thinking the one crash we have seen outside of the main thread may be an example. This stack trace:
Thread 1 Crashed:: Dispatch queue: CA DispatchGroup
…
10 GLEngine 0x1f4f10554 glDeleteTextures_Exec + 700
11 QuartzCore 0x198122898 CA::(anonymous namespace)::AsyncDeletion::callback(void*) + 88
Seems to indicate Quartz is performing some operations asynchronously. If the Core Graphics framework is asynchronously accessing the OpenGL context then a lock declared at the application level will be ineffective in single threading access to the context. That means CGLLockContext
/CGLUnlockContext
must be used in order to coordinate access with the framework threads.
The failure to use these locking methods for single threading access to the OpenGL context may be the root cause of the crashes reported in this issue.
IINA experienced crashes in glDeleteFramebuffers
, glDeleteTextures
, glGenTextures
due to failure to properly manage the OpenGL context. See this comment on IINA issue #3475. Since the fix in PR #3581 was merged no such crashes have been reported. IINA is using CGLLockContext
/CGLUnlockContext
.
Yup, that makes perfect sense to me, and would fit with what's been observed! Nice debugging work!
I was curious and took a look at disassembly of CGLLockContext:
int _CGLLockContext(int arg0) {
rbx = arg0;
rcx = sub_3aa3(arg0); /* Some initialization performed only once at startup */
rax = 0x2714;
if (rcx != 0x0) {
pthread_mutex_lock(*(rbx + 0x1e78) + 0x18);
rax = 0x0;
}
return rax;
}
So it is basically just a regular pthread mutex lock, I guess as you mentioned the key is that Apple's own libraries (e.g. Quartz) make use of this, so if you want to avoid race-conditions with Apple's own libraries trying to access the same context, then you must make use of CGLLockContext
so you use the same mutex they do.
Yes, exactly. The mpv code locked the side door and the asynchronous Quartz framework code walked in the front door. All the code needs to be using the same lock. At least this sure seems like the issue.
Frustrating that it is not easy to find detailed documentation regarding this. I always like to know the reasons behind requirements and what can go wrong if they are not followed. I'm expecting there isn't a lot of discussion on the net about this because most coders use CGLLockContext
/CGLUnlockContext
as discussed in the Apple documentation.
Had another one of these
Yet another one AddressSanitizer:DEADLYSIGNAL
at gleUnbindDeleteHashNamesAndObjects
.
Had another one of these on a different Mac/macOS:
System Version: macOS 11.7.10 (20G1427)
Model Identifier: MacBookAir6,2
mpv 0.37.0 built on Nov 30 2023 14:35:43 FFmpeg version: 5.1.4
0 libsystem_kernel.dylib 0x00007fff2047229a mach_msg_trap + 10
1 libsystem_kernel.dylib 0x00007fff2047260c mach_msg + 60
2 com.apple.framework.IOKit 0x00007fff22c63793 io_connect_method + 384
3 com.apple.framework.IOKit 0x00007fff22c635af IOConnectCallMethod + 186
4 com.apple.IOAccelerator 0x00007fff28427577 ioAccelResourceFinalize + 162
5 com.apple.CoreFoundation 0x00007fff2066a72a _CFRelease + 244
6 libGPUSupportMercury.dylib 0x00007fff4640ff7d gldReclaimTexture + 81
7 libGPUSupportMercury.dylib 0x00007fff4640ffd7 gldDestroyTexture + 31
8 libGFXShared.dylib 0x00007fff6b9d66bf gfxDestroyPluginTexture + 59
9 GLEngine 0x00007fff6bd0ee9d gleFreeTextureObject + 36
10 GLEngine 0x00007fff6bce169b gleUnbindDeleteHashNamesAndObjects + 182
11 GLEngine 0x00007fff6bc07fbc glDeleteTextures_Exec + 723
12 io.mpv 0x00000001047f3c39 gl_tex_destroy + 73
13 io.mpv 0x00000001047ab497 ra_tex_free + 23
14 io.mpv 0x00000001047bd200 uninit_rendering + 224
15 io.mpv 0x00000001047b79a2 uninit_video + 18
16 io.mpv 0x00000001047b784c gl_video_uninit + 28
17 io.mpv 0x00000001047aa4cc destroy + 28
18 io.mpv 0x00000001047c6182 mpv_render_context_free + 178
19 io.mpv 0x00000001046cc395 CocoaCB.shutdown(_:) + 325
20 io.mpv 0x00000001046cc44f @objc CocoaCB.processEvent(_:) + 31
21 io.mpv 0x00000001047d994d -[EventsResponder processEvent:] + 45
22 io.mpv 0x00000001047d98fa __29-[EventsResponder readEvents]_block_invoke + 74
i opened a PR #13494 with a fix. thank you @low-batt for your thorough investigation.
Great work, everyone. I've applied the patch and will notify if it happens again (though it's been very rare).
I tried to verify the fix. I built master
with the address sanitizer and the fix reverted and tried to reproduce the original failure, but it appears the time window of vulnerability where threads can interfere with each other is so small that it is very difficult to trigger the failure. I remember this being the case when I last looked into this problem. I ran mpv
through hundreds of play/quit cycles without experiencing the error.
I also ran hundreds of cycles with the fix. And tested without enabling the address sanitizer. No failures were encountered.
The changes should fix this one, but I just wasn't able to verify it.
yeah, did the same. i wrote a small shell script that launched mpv, quit it after a 1-2 seconds and let it run for nearly 2 hours. i didn't encounter a crash once sadly.
Important Information
Provide following Information:
Reproduction steps
First time incident. Happened when quitting movie (or possibly fast-forwarding towards end).
Since my code is patched with #11627, I can't see how it can be the same memory overwrite.
Also thread 1 contains references to
MRPlayer
etc. that I don't find in any other issues. (Or are threads other than the offendeng one arbitrary and irrelevant?)Expected behavior
No exception
Actual behavior
Log file
crash.log.gz
Sample files
N/A