mpv-player / mpv

🎥 Command line video player
https://mpv.io
Other
27.87k stars 2.87k forks source link

Memory Overwrite #11681

Closed forthrin closed 7 months ago

forthrin commented 1 year ago

Important Information

Provide following Information:

Reproduction steps

First time incident. Happened when quitting movie (or possibly fast-forwarding towards end).

Since my code is patched with #11627, I can't see how it can be the same memory overwrite.

Also thread 1 contains references to MRPlayer etc. that I don't find in any other issues. (Or are threads other than the offendeng one arbitrary and irrelevant?)

Expected behavior

No exception

Actual behavior

mpv(96649,0x16f967000) malloc: Heap corruption detected, free list is damaged at 0x600001d8cf30
*** Incorrect guard value: 0
mpv(96649,0x16f967000) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

Log file

crash.log.gz

Sample files

N/A

low-batt commented 1 year ago

One thing to try is to create a debug build of mpv with the AddressSanitizer enabled and run with that build for a while.

In issue #11626 I showed how I enabled the sanitizer when building using mpv-build:

printf "%s\n" -Dbuildtype=debug > mpv_options
printf "%s\n" -Db_sanitize=address,undefined >> mpv_options

If the problem reproduces the address sanitizer may be able to identify the offending code.

forthrin commented 1 year ago

@low-batt: Well! I got my first crash since compiling with the address sanitizer. But is this the same crash? What parts of the report are relevant? The address sanitizer, the crashed thread, or all threads from 0 up to the crashed thread?

AddressSanitizer:DEADLYSIGNAL
=================================================================
==48985==ERROR: AddressSanitizer: SEGV on unknown address 0x00000001a8f0 (pc 0x0001f4fec174 bp 0x00016b131fd0 sp 0x00016b131f70 T1)
==48985==The signal is caused by a READ memory access.
    #0 0x1f4fec174 in gleUnbindDeleteHashNamesAndObjects+0x48 (GLEngine:arm64e+0xf7174) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
    #1 0x1f4f10550 in glDeleteTextures_Exec+0x2b8 (GLEngine:arm64e+0x1b550) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
    #2 0x198122894 in CA::(anonymous namespace)::AsyncDeletion::callback(void*)+0x54 (QuartzCore:arm64e+0xba894) (BuildId: 8cd4195a97373f66840e04904f0280c132000000200000000100000000030d00)
    #3 0x1098457ec in asan_dispatch_call_block_and_release+0xe0 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x417ec) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)
    #4 0x1909043fc in _dispatch_client_callout+0x10 (libdispatch.dylib:arm64e+0x43fc) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
    #5 0x19090ba84 in _dispatch_lane_serial_drain+0x298 (libdispatch.dylib:arm64e+0xba84) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
    #6 0x19090c5f4 in _dispatch_lane_invoke+0x17c (libdispatch.dylib:arm64e+0xc5f4) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
    #7 0x190917240 in _dispatch_workloop_worker_thread+0x284 (libdispatch.dylib:arm64e+0x17240) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
    #8 0x190ab0070 in _pthread_wqthread+0x11c (libsystem_pthread.dylib:arm64e+0x3070) (BuildId: b401cfb38dfe32db92b3ba8af0f8ca6e32000000200000000100000000030d00)
    #9 0x190aaed90 in start_wqthread+0x4 (libsystem_pthread.dylib:arm64e+0x1d90) (BuildId: b401cfb38dfe32db92b3ba8af0f8ca6e32000000200000000100000000030d00)

==48985==Register values:
 x[0] = 0x000000010b2b0060   x[1] = 0x000000010b2b00e0   x[2] = 0x000000016b131730   x[3] = 0x00000000000001a1
 x[4] = 0x00000000000003e0   x[5] = 0x0000000000000001   x[6] = 0x000000016b0b0000   x[7] = 0x0000000000000001
 x[8] = 0x0000000000000002   x[9] = 0x00000000000001a2  x[10] = 0x000000000000eccc  x[11] = 0x000000010c8a5458
x[12] = 0x000000016b131798  x[13] = 0xb010ddcc49d497b2  x[14] = 0x0000000000000000  x[15] = 0x00000001ebf1ebb0
x[16] = 0x0000000190aaee10  x[17] = 0x0000000109898730  x[18] = 0x0000000000000000  x[19] = 0x000000013082cfb0
x[20] = 0x00000001f5018398  x[21] = 0x000000010d568e48  x[22] = 0x0000000113c00400  x[23] = 0x000000000001a8e8
x[24] = 0x000000000000000a  x[25] = 0x000000000000000b  x[26] = 0x0000000000000008  x[27] = 0x000000000000000a
x[28] = 0x000000000000000b     fp = 0x000000016b131fd0     lr = 0x00000001f4fec1f0     sp = 0x000000016b131f70
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (GLEngine:arm64e+0xf7174) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00) in gleUnbindDeleteHashNamesAndObjects+0x48
Thread T1 created by T0 here:
    <empty stack>

==48985==ABORTING

----------------------------------------------------------------------

Process:               mpv [48985]
Date/Time:             2023-05-24
OS Version:            macOS 13.3.1 (22E772610a)
Exception Type:        EXC_BAD_ACCESS (SIGABRT)
Exception Codes:       KERN_INVALID_ADDRESS at 0x000000000001a8f0
Exception Codes:       0x0000000000000001, 0x000000000001a8f0

VM Region Info: 0x1a8f0 is not in any region.  Bytes before following region: 4375942928
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      UNUSED SPACE AT START
--->
      __TEXT                      104d54000-106fc8000    [ 34.5M] r-x/r-x SM=COW  ...wnloads/*/mpv

Application Specific Information:
abort() called

Thread 0::  Dispatch queue: com.apple.main-thread
0   libsystem_pthread.dylib                0x190aafd00 pthread_once + 108
1   libclang_rt.asan_osx_dynamic.dylib         0x10987bd78 __lsan::DisabledInThisThread() + 40
2   libclang_rt.asan_osx_dynamic.dylib         0x10980aa5c __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) + 1440
3   libclang_rt.asan_osx_dynamic.dylib         0x10980a490 __asan::asan_malloc(unsigned long, __sanitizer::BufferedStackTrace*) + 48
4   libclang_rt.asan_osx_dynamic.dylib         0x109846ed0 wrap_malloc + 252
5   libsystem_trace.dylib                  0x190835b6c _os_log_fmt_flatten_object + 352
6   libsystem_trace.dylib                  0x190833968 _os_log_impl_flatten_and_send + 1860
7   libsystem_trace.dylib                  0x19083320c _os_log + 152
8   libsystem_trace.dylib                  0x19083316c _os_log_impl + 24
9   MediaPlayer                            0x1b4957c38 __73-[MPRemoteCommandCenter _locked_updateCanBeNowPlayingApplicationIfNeeded]_block_invoke + 180
10  libclang_rt.asan_osx_dynamic.dylib         0x109845f78 __wrap_dispatch_async_block_invoke + 196
11  libdispatch.dylib                      0x190902874 _dispatch_call_block_and_release + 32
12  libdispatch.dylib                      0x190904400 _dispatch_client_callout + 20
13  libdispatch.dylib                      0x190912bf8 _dispatch_main_queue_drain + 928
14  libdispatch.dylib                      0x190912848 _dispatch_main_queue_callback_4CF + 44
15  CoreFoundation                         0x190bd3c88 __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__ + 16
16  CoreFoundation                         0x190b914a8 __CFRunLoopRun + 1992
17  CoreFoundation                         0x190b9058c CFRunLoopRunSpecific + 612
18  HIToolbox                              0x19a3c5df4 RunCurrentEventLoopInMode + 292
19  HIToolbox                              0x19a3c5c30 ReceiveNextEventCommon + 648
20  HIToolbox                              0x19a3c5988 _BlockUntilNextEventMatchingListInModeWithFilter + 76
21  AppKit                                 0x193daff58 _DPSNextEvent + 636
22  AppKit                                 0x193daf0f4 -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 716
23  AppKit                                 0x193da3558 -[NSApplication run] + 464
24  mpv                                    0x105937f9c cocoa_run_runloop + 120 (macosx_application.m:271)
25  mpv                                    0x105937364 cocoa_main + 1928 (macosx_application.m:368)
26  mpv                                    0x10594fee4 main + 36 (main-fn-cocoa.c:9)
27  dyld                                   0x19075bf28 start + 2236

Thread 1 Crashed::  Dispatch queue: CA DispatchGroup
0   libsystem_kernel.dylib                 0x190a7c724 __pthread_kill + 8
1   libsystem_pthread.dylib                0x190ab3c28 pthread_kill + 288
2   libsystem_c.dylib                      0x1909c1ae8 abort + 180
3   libclang_rt.asan_osx_dynamic.dylib         0x10986b55c __sanitizer::Abort() + 68
4   libclang_rt.asan_osx_dynamic.dylib         0x10986ac4c __sanitizer::Die() + 212
5   libclang_rt.asan_osx_dynamic.dylib         0x10984ec2c __asan::ScopedInErrorReport::~ScopedInErrorReport() + 1124
6   libclang_rt.asan_osx_dynamic.dylib         0x10984b3fc __asan::ReportDeadlySignal(__sanitizer::SignalContext const&) + 344
7   libclang_rt.asan_osx_dynamic.dylib         0x10984a7f4 __asan::AsanOnDeadlySignal(int, void*, void*) + 100
8   libsystem_platform.dylib               0x190ae2a84 _sigtramp + 56
9   ???                             0xffff8001f4fec1f0 ???
10  GLEngine                               0x1f4f10554 glDeleteTextures_Exec + 700
11  QuartzCore                             0x198122898 CA::(anonymous namespace)::AsyncDeletion::callback(void*) + 88
12  libclang_rt.asan_osx_dynamic.dylib         0x1098457f0 asan_dispatch_call_block_and_release + 228
13  libdispatch.dylib                      0x190904400 _dispatch_client_callout + 20
14  libdispatch.dylib                      0x19090ba88 _dispatch_lane_serial_drain + 668
15  libdispatch.dylib                      0x19090c5f8 _dispatch_lane_invoke + 384
16  libdispatch.dylib                      0x190917244 _dispatch_workloop_worker_thread + 648
17  libsystem_pthread.dylib                0x190ab0074 _pthread_wqthread + 288
18  libsystem_pthread.dylib                0x190aaed94 start_wqthread + 8
low-batt commented 1 year ago

On whether the root cause of this crash is the same defect that caused the previous crash…

We need to track down and be able to explain the defective code to say for sure. We suspect a code defect is causing a memory overwrite. Memory overwrites can manifest in all kinds of different ways.

On what parts of the crash report are relevant…

In general this depends upon the kind of software problem. Some problems are obvious from just the stack traceback of the thread that crashed. For other problems it is desirable to have all of the available evidence. For example, looking over what else is going on with the other threads may show a pattern that provides a clue to the root cause.

Back to the crash at hand…

Did this crash also occur while mpv was quitting?

This part is very interesting:

    #0 0x1f4fec174 in gleUnbindDeleteHashNamesAndObjects+0x48 (GLEngine:arm64e+0xf7174) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
    #1 0x1f4f10550 in glDeleteTextures_Exec+0x2b8 (GLEngine:arm64e+0x1b550) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
    #2 0x198122894 in CA::(anonymous namespace)::AsyncDeletion::callback(void*)+0x54 (QuartzCore:arm64e+0xba894) (BuildId: 8cd4195a97373f66840e04904f0280c132000000200000000100000000030d00)

This is the thread that crashed:

Thread 1 Crashed::  Dispatch queue: CA DispatchGroup

The following is from the crash report in issue #10986:

Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   GLEngine                            0x7ffa1b19f82d gleUnbindDeleteHashNamesAndObjects + 61
1   GLEngine                            0x7ffa1b0bc759 glDeleteTextures_Exec + 700
2   mpv                                    0x102efe849 gl_tex_destroy + 73
3   mpv                                    0x102ee47c7 ra_tex_free + 23
4   mpv                                    0x102ee3a22 mpgl_osd_destroy + 82
5   mpv                                    0x102ef123f gl_video_uninit + 175
6   mpv                                    0x102ee379c destroy + 28
7   mpv                                    0x102f053bb mpv_render_context_free + 219
8   mpv                                    0x102f297af CocoaCB.shutdown(_:) + 367

The crash report in #10986 is suggestive of a problem with coordinating shutdown of the OpenGL context. Looking at the code for mpv_render_context_free shows a lot of code that is intended to properly coordinate freeing of the context. Possibly that code is not quite correct and there is a thread race condition lurking that is responsible for the problem at hand.

This is just speculation. Remember I am not a mpv developer.

Continue to run with sanity checking enabled. Lets see if we get another crash report that provides additional clues.

forthrin commented 1 year ago

New incident. Don't know if it's the same one. Happened when quitting at the end of a video.

AddressSanitizer:DEADLYSIGNAL
=================================================================
==67503==ERROR: AddressSanitizer: BUS on unknown address (pc 0x0001f4fec174 bp 0x00016ae2be90 sp 0x00016ae2be30 T0)
==67503==The signal is caused by a READ memory access.
==67503==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
    #0 0x1f4fec174 in gleUnbindDeleteHashNamesAndObjects+0x48 (GLEngine:arm64e+0xf7174) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
    #1 0x1f4f10550 in glDeleteTextures_Exec+0x2b8 (GLEngine:arm64e+0x1b550) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
    #2 0x105d2cc18 in gl_tex_destroy ra_gl.c:277
    #3 0x1059d5210 in ra_tex_free ra.c:45
    #4 0x1059c3d68 in mpgl_osd_destroy osd.c:107
    #5 0x105a4c968 in gl_video_uninit video.c:3954
    #6 0x1059c23d4 in destroy libmpv_gpu.c:223
    #7 0x105b20fec in mpv_render_context_free vo_libmpv.c:307
    #8 0x104feaa5c in LibmpvHelper.deinitRender() libmpv_helper.swift:226
    #9 0x105018f68 in CocoaCB.shutdown(_:) cocoa_cb_common.swift:214
    #10 0x1050190a4 in CocoaCB.processEvent(_:) cocoa_cb_common.swift:227
    #11 0x1050190e8 in @objc CocoaCB.processEvent(_:) <compiler-generated>
    #12 0x105bb1dec in -[Application processEvent:] macosx_application.m:184
    #13 0x105bb7034 in -[EventsResponder processEvent:] macosx_events.m:277
    #14 0x105bb6eb8 in __29-[EventsResponder readEvents]_block_invoke macosx_events.m:269
    #15 0x109ac1f74 in __wrap_dispatch_async_block_invoke+0xc0 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x41f74) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)
    #16 0x190902870 in _dispatch_call_block_and_release+0x1c (libdispatch.dylib:arm64e+0x2870) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
    #17 0x1909043fc in _dispatch_client_callout+0x10 (libdispatch.dylib:arm64e+0x43fc) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
    #18 0x190912bf4 in _dispatch_main_queue_drain+0x39c (libdispatch.dylib:arm64e+0x12bf4) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
    #19 0x190912844 in _dispatch_main_queue_callback_4CF+0x28 (libdispatch.dylib:arm64e+0x12844) (BuildId: 8e87dc0ea5703933b37d5e05ad51620632000000200000000100000000030d00)
    #20 0x190bd3c84 in __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__+0xc (CoreFoundation:arm64e+0xc0c84) (BuildId: b4fdaece97273969b01427f7f24c8e0132000000200000000100000000030d00)
    #21 0x190b914a4 in __CFRunLoopRun+0x7c4 (CoreFoundation:arm64e+0x7e4a4) (BuildId: b4fdaece97273969b01427f7f24c8e0132000000200000000100000000030d00)
    #22 0x190b90588 in CFRunLoopRunSpecific+0x260 (CoreFoundation:arm64e+0x7d588) (BuildId: b4fdaece97273969b01427f7f24c8e0132000000200000000100000000030d00)
    #23 0x19a3c5df0 in RunCurrentEventLoopInMode+0x120 (HIToolbox:arm64e+0x30df0) (BuildId: 5f34bbf5653a31a5b4b30a02c91ab48832000000200000000100000000030d00)
    #24 0x19a3c5c2c in ReceiveNextEventCommon+0x284 (HIToolbox:arm64e+0x30c2c) (BuildId: 5f34bbf5653a31a5b4b30a02c91ab48832000000200000000100000000030d00)
    #25 0x19a3c5984 in _BlockUntilNextEventMatchingListInModeWithFilter+0x48 (HIToolbox:arm64e+0x30984) (BuildId: 5f34bbf5653a31a5b4b30a02c91ab48832000000200000000100000000030d00)
    #26 0x193daff54 in _DPSNextEvent+0x278 (AppKit:arm64e+0x38f54) (BuildId: cc3e52154cd7364c87506eb7002720ce32000000200000000100000000030d00)
    #27 0x193daf0f0 in -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:]+0x2c8 (AppKit:arm64e+0x380f0) (BuildId: cc3e52154cd7364c87506eb7002720ce32000000200000000100000000030d00)
    #28 0x193da3554 in -[NSApplication run]+0x1cc (AppKit:arm64e+0x2c554) (BuildId: cc3e52154cd7364c87506eb7002720ce32000000200000000100000000030d00)
    #29 0x105bb3f98 in cocoa_run_runloop macosx_application.m:271
    #30 0x105bb3360 in cocoa_main macosx_application.m:368
    #31 0x105bcbee0 in main main-fn-cocoa.c:9
    #32 0x19075bf24  (<unknown module>)

==67503==Register values:
 x[0] = 0x0000000113f10400   x[1] = 0x000000010d8690c8   x[2] = 0x00000001f5018398   x[3] = 0x0000000000000009  
 x[4] = 0x0000000000000001   x[5] = 0x0000000000000001   x[6] = 0x0000000000000000   x[7] = 0x0000000000000001  
 x[8] = 0x0000000000000001   x[9] = 0x0000000000000000  x[10] = 0x0000000000000001  x[11] = 0x0000000000000000  
x[12] = 0x000000016ae2bd28  x[13] = 0x000000016ae2bd20  x[14] = 0x0000000000007e01  x[15] = 0x0000000000000006  
x[16] = 0x00000001f5018398  x[17] = 0x0000000109b145d8  x[18] = 0x0000000000000000  x[19] = 0x000000013fc38090  
x[20] = 0x00000001f5018398  x[21] = 0x000000010d8690c8  x[22] = 0x0000000113f10400  x[23] = 0x000000210001d51e  
x[24] = 0x0000000000000009  x[25] = 0x000000000000000a  x[26] = 0x0000000000000006  x[27] = 0x0000000000000001  
x[28] = 0x0000000000000007     fp = 0x000000016ae2be90     lr = 0x00000001f4f10554     sp = 0x000000016ae2be30  
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: BUS (GLEngine:arm64e+0xf7174) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00) in gleUnbindDeleteHashNamesAndObjects+0x48
==67503==ABORTING
krackers commented 1 year ago

Yeah it seems to be the same root cause. If this is indeed a bug with libmpv, I wonder if it can be reproduced on other platforms?

forthrin commented 1 year ago

Don't the reports pinpoint (or at least narrow down considerably) the culprit?

It has happened at least three times the last couple of weeks, and I think it was always:

Though this may not be necessary factors for reproduction.

PS! Is it possible in some way to replay a mpv session, like tool-assisted speedruns? (Though this may not trigger the same issue every time because of the arbitrary nature of memory allocation.)

PPS! Then again, is there any way to sandbox an application, thereby forcing the exact same execution every time? Like a docker container, VirtualBox, etc. Maybe this is overkill, and still impossible to control, but interesting to know.

krackers commented 1 year ago

Don't the reports pinpoint (or at least narrow down considerably) the culprit?

Yes. I'm not an mpv dev though, so besides indicating a bug somewhere in mpv_render_context_free as low-batt mentioned, I don't really know enough to pin it down further.

I wonder if also compiling with ThreadSanitizer might be illuminating? You can see more about how TSan works internally at https://wwdctogether.com/wwdc2016/412 starting at 15:10.

https://clang.llvm.org/docs/ThreadSanitizer.html

Is it possible in some way to replay a mpv session

This is what https://rr-project.org/ is supposed to do, but I don't think such tools work on mac due to reasons mentioned in https://joneschrisg.wordpress.com/2015/01/29/rr-on-os-x-can-it-be-ported/

is there any way to sandbox an application, thereby forcing the exact same execution every time?

Normal sandboxing is not sufficient, since if it's a race condition then the timing with which memory accesses occur has an impact on the state, and to my knowledge I don't believe there's a way to perfectly ensure same state (would love to know if there is such a tool though!)

I think the best equivalent is using a race detector like TSan, since it "kind of" works around the memory access timing issue because it can instrument every memory access to do checking.

low-batt commented 1 year ago

Sorry I've been slient. Super busy these days. On this:

Don't the reports pinpoint (or at least narrow down considerably) the culprit?

Yes and no. These two lines of the stack trace:

    #1 0x1f4f10550 in glDeleteTextures_Exec+0x2b8 (GLEngine:arm64e+0x1b550) (BuildId: 49b25d98bfc530ce9a4838d94d1f2e7c32000000200000000100000000030d00)
    #2 0x105d2cc18 in gl_tex_destroy ra_gl.c:277

Show mpv gl_tex_destroy calling OpenGL glDeleteTextures. From OpenGL Context:

In order for any OpenGL commands to work, a context must be current; all OpenGL commands affect the state of whichever context is current. The current context is a thread-local variable, so a single process can have several threads, each of which has its own current context. However, a single context cannot be current in multiple threads at the same time.

Under macOS "thread-local variable" corresponds to NSThread.threadDictionary. Using thread local storage avoids the need to add a context parameter to every OpenGL method. A downside of using thread local storage is that programmers must always be thinking about this thread state whenever using an API that makes use of this technique.

So one concern is the state of the OpenGL context. In the crash reports so far we have seen the main thread crash as well as a thread from the CA DispatchGroup.

I took @krackers advice and built a version of mpv with TSan enabled. That showed a problem in the logger at startup, which may explain a startup hang that I have experienced. I also ran into a reproducible crash in the playback core. Not sure what that is about. As the crashes are happening in OpenGL code, not mpv code, it limits the amount of help some of these santizers can provide.

TSan playback core crash: ```text Thread 4 Crashed:: */playback core (OSX) 0 libsystem_kernel.dylib 0x1aa124724 __pthread_kill + 8 1 libsystem_pthread.dylib 0x1aa15bc28 pthread_kill + 288 2 libclang_rt.tsan_osx_dynamic.dylib 0x1062489fc wrap_pthread_kill + 288 3 libsystem_c.dylib 0x1aa069ae8 abort + 180 4 libclang_rt.tsan_osx_dynamic.dylib 0x10622d2cc __sanitizer::Abort() + 68 5 libclang_rt.tsan_osx_dynamic.dylib 0x10622c9bc __sanitizer::Die() + 212 6 libclang_rt.tsan_osx_dynamic.dylib 0x106261d8c __tsan::finalize(void*) + 80 7 libsystem_c.dylib 0x1aa019ec4 __cxa_finalize_ranges + 476 8 libsystem_c.dylib 0x1aa019c4c exit + 44 9 mpv 0x102e96a4c playback_thread + 184 (macosx_application.m:283) 10 libclang_rt.tsan_osx_dynamic.dylib 0x106243d0c __tsan_thread_start_func + 144 11 libsystem_pthread.dylib 0x1aa15bfa8 _pthread_start + 148 12 libsystem_pthread.dylib 0x1aa156da0 thread_start + 8 ```

I tried to reproduce this problem using mpv built with the address sanitizer and have only managed to trigger it once. I was using variations on the follow script to test mpv shutdown.

startQuitLoop.sh: ```bash #!/bin/bash while true; do echo Starting mpv /Users/low-batt/Documents/builds/mpv-build/mpv-build/mpv/build/mpv --autofit=30% --config=no --input-ipc-server=/tmp/mpv-socket --mute ~/Movies/Airplaneski.mp4 & sleep 2 echo Instructing mpv to Quit echo '{ "command": ["quit"] }' | socat - /tmp/mpv-socket | jq sleep 2 done ```

That detected the same problem reported here:

Address sanitizer abort: ```text Starting mpv (+) Video --vid=1 (*) (h264 682x438 25.000fps) (+) Audio --aid=1 (*) (aac 2ch 48000Hz) AO: [coreaudio] 48000Hz stereo 2ch floatp VO: [libmpv] 682x438 => 683x438 yuv420p Instructing mpv to Quit (0%) A-V: 0.000 Exiting... (Quit) { "data": null, "request_id": 0, "error": "success" } AddressSanitizer:DEADLYSIGNAL ================================================================= ==74741==ERROR: AddressSanitizer: BUS on unknown address (pc 0x00020efc4174 bp 0x00016cf8c990 sp 0x00016cf8c930 T0) ==74741==The signal is caused by a READ memory access. ==74741==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used. #0 0x20efc4174 in gleUnbindDeleteHashNamesAndObjects+0x48 (GLEngine:arm64e+0xf7174) (BuildId: 07f4fb2fb29a3ae29179e379a4ac0d1a32000000200000000100000000040d00) #1 0x20eee8550 in glDeleteTextures_Exec+0x2b8 (GLEngine:arm64e+0x1b550) (BuildId: 07f4fb2fb29a3ae29179e379a4ac0d1a32000000200000000100000000040d00) #2 0x1035033b8 in gl_tex_destroy ra_gl.c:277 #3 0x103351164 in ra_tex_free ra.c:45 #4 0x103348ca4 in mpgl_osd_destroy osd.c:107 #5 0x1033870ec in gl_video_uninit video.c:3954 #6 0x10334827c in destroy libmpv_gpu.c:223 #7 0x1033e41c8 in mpv_render_context_free vo_libmpv.c:307 #8 0x102e8be1c in LibmpvHelper.deinitRender() libmpv_helper.swift:226 #9 0x102eba2cc in CocoaCB.shutdown(_:) cocoa_cb_common.swift:214 #10 0x102eba408 in CocoaCB.processEvent(_:) cocoa_cb_common.swift:227 #11 0x102eba44c in @objc CocoaCB.processEvent(_:) #12 0x10341f068 in -[Application processEvent:] macosx_application.m:184 #13 0x103422d18 in -[EventsResponder processEvent:] macosx_events.m:277 #14 0x103422c00 in __29-[EventsResponder readEvents]_block_invoke macosx_events.m:269 #15 0x106f0df74 in __wrap_dispatch_async_block_invoke+0xc0 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x41f74) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00) #16 0x1a9faa870 in _dispatch_call_block_and_release+0x1c (libdispatch.dylib:arm64e+0x2870) (BuildId: c2fd3094b46539a4b77416583ff53c4b32000000200000000100000000040d00) #17 0x1a9fac3fc in _dispatch_client_callout+0x10 (libdispatch.dylib:arm64e+0x43fc) (BuildId: c2fd3094b46539a4b77416583ff53c4b32000000200000000100000000040d00) #18 0x1a9fbabf4 in _dispatch_main_queue_drain+0x39c (libdispatch.dylib:arm64e+0x12bf4) (BuildId: c2fd3094b46539a4b77416583ff53c4b32000000200000000100000000040d00) #19 0x1a9fba844 in _dispatch_main_queue_callback_4CF+0x28 (libdispatch.dylib:arm64e+0x12844) (BuildId: c2fd3094b46539a4b77416583ff53c4b32000000200000000100000000040d00) #20 0x1aa27bc50 in __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__+0xc (CoreFoundation:arm64e+0xc0c50) (BuildId: 203e44018c2e3157a24b92f52551d43e32000000200000000100000000040d00) #21 0x1aa2393d0 in __CFRunLoopRun+0x7c4 (CoreFoundation:arm64e+0x7e3d0) (BuildId: 203e44018c2e3157a24b92f52551d43e32000000200000000100000000040d00) #22 0x1aa2384b4 in CFRunLoopRunSpecific+0x260 (CoreFoundation:arm64e+0x7d4b4) (BuildId: 203e44018c2e3157a24b92f52551d43e32000000200000000100000000040d00) #23 0x1b3a82c3c in RunCurrentEventLoopInMode+0x120 (HIToolbox:arm64e+0x30c3c) (BuildId: d800278b4e6c3032b56f027a938a51d632000000200000000100000000040d00) #24 0x1b3a82a78 in ReceiveNextEventCommon+0x284 (HIToolbox:arm64e+0x30a78) (BuildId: d800278b4e6c3032b56f027a938a51d632000000200000000100000000040d00) #25 0x1b3a827d0 in _BlockUntilNextEventMatchingListInModeWithFilter+0x48 (HIToolbox:arm64e+0x307d0) (BuildId: d800278b4e6c3032b56f027a938a51d632000000200000000100000000040d00) #26 0x1ad459d40 in _DPSNextEvent+0x278 (AppKit:arm64e+0x38d40) (BuildId: 041c7d2121cc3eeaae53d39f5ca6a24e32000000200000000100000000040d00) #27 0x1ad458edc in -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:]+0x2c8 (AppKit:arm64e+0x37edc) (BuildId: 041c7d2121cc3eeaae53d39f5ca6a24e32000000200000000100000000040d00) #28 0x1ad44d340 in -[NSApplication run]+0x1cc (AppKit:arm64e+0x2c340) (BuildId: 041c7d2121cc3eeaae53d39f5ca6a24e32000000200000000100000000040d00) #29 0x1034208c4 in cocoa_run_runloop macosx_application.m:271 #30 0x103420190 in cocoa_main macosx_application.m:368 #31 0x1034372e8 in main main-fn-cocoa.c:9 #32 0x1a9e03f24 () ==74741==Register values: x[0] = 0x0000000112200400 x[1] = 0x000000010ab26a48 x[2] = 0x000000020eff0398 x[3] = 0x0000000000000007 x[4] = 0x0000000000000001 x[5] = 0x0000000000000001 x[6] = 0x0000000000000000 x[7] = 0x0000000000000001 x[8] = 0x0000000000000001 x[9] = 0x0000000000000000 x[10] = 0x0000000000000001 x[11] = 0x0000000000000000 x[12] = 0x000000016cf8c8a8 x[13] = 0x000000016cf8c8a0 x[14] = 0x0000000000007e01 x[15] = 0x0000000000000006 x[16] = 0x000000020eff0398 x[17] = 0x0000000106f605d8 x[18] = 0x0000000000000000 x[19] = 0x000000010a72bf70 x[20] = 0x000000020eff0398 x[21] = 0x000000010ab26a48 x[22] = 0x0000000112200400 x[23] = 0x0000001f00014ab8 x[24] = 0x0000000000000007 x[25] = 0x0000000000000008 x[26] = 0x0000000000000005 x[27] = 0x0000000000000001 x[28] = 0x0000000000000006 fp = 0x000000016cf8c990 lr = 0x000000020eee8554 sp = 0x000000016cf8c930 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: BUS (GLEngine:arm64e+0xf7174) (BuildId: 07f4fb2fb29a3ae29179e379a4ac0d1a32000000200000000100000000040d00) in gleUnbindDeleteHashNamesAndObjects+0x48 ==74741==ABORTING ./startQuitLoop.sh: line 18: 74741 Abort trap: 6 /Users/low-batt/Documents/builds/mpv-build/mpv-build/mpv/build/mpv --autofit=30% --config=no --input-ipc-server=/tmp/mpv-socket --mute ~/Movies/Airplaneski.mp4 ```

I'm expecting this requires the attention of a mpv developer with knowledge of different subsystems and how shutdown is supposed to be coordinated.

krackers commented 1 year ago

Thank you for chiming in low-batt, it is always a pleasure to read your posts!

so far we have seen the main thread crash as well as a thread from the CA DispatchGroup.

Have all instances of the crash been with people using OSX on ARM machines? On those machines OpenGL commands go through a translation layer, and it is known that it is more strict about requiring serialization for context access (see https://github.com/glfw/glfw/issues/1997#issuecomment-1292774964)

I wonder if it is a similar issue here?

low-batt commented 1 year ago

Although this OpenGL requirement from OpenGL Programming Guide for Mac is very important:

OpenGL is not reentrant. If you modify the same context from multiple threads simultaneously, the results are unpredictable. Your application might crash or it might render improperly.

The situation at hand is worse. This is during shutdown, so single threading context access is not enough. All systems must be coordinated to ensure no thread tries to use the context after it has been shutdown. So even if access is single threaded it would be like a thread trying to access a file after another thread closed the channel.

The mpv code tries to coordinate this. But it could be some coordination is missing or something is not quite correct in the thread coordination.

This feels like a thread race condition with a very small window of vulnerability.

With that kind of problem the failure is usually, but not always, more likely to occur when running on hardware that is actually concurrently running threads on multiple CPU cores. That may explain why this is being encountered when running on Apple Silicon with its many cores.

low-batt commented 1 year ago

Made me think about this some more. I'm thinking your comment:

Also should be noted that by mutex locking in the apple docs they seem to specifically mean using CGLLockContext. Using just regular thread mutex is not 100% effective.

May have nailed it.

The Apple documentation doesn't really spell this out, as in explain what can go wrong if you do your own locking. I'm thinking the one crash we have seen outside of the main thread may be an example. This stack trace:

Thread 1 Crashed::  Dispatch queue: CA DispatchGroup
…
10  GLEngine                               0x1f4f10554 glDeleteTextures_Exec + 700
11  QuartzCore                             0x198122898 CA::(anonymous namespace)::AsyncDeletion::callback(void*) + 88

Seems to indicate Quartz is performing some operations asynchronously. If the Core Graphics framework is asynchronously accessing the OpenGL context then a lock declared at the application level will be ineffective in single threading access to the context. That means CGLLockContext/CGLUnlockContext must be used in order to coordinate access with the framework threads.

The failure to use these locking methods for single threading access to the OpenGL context may be the root cause of the crashes reported in this issue.

IINA experienced crashes in glDeleteFramebuffers, glDeleteTextures, glGenTextures due to failure to properly manage the OpenGL context. See this comment on IINA issue #3475. Since the fix in PR #3581 was merged no such crashes have been reported. IINA is using CGLLockContext/CGLUnlockContext.

krackers commented 1 year ago

Yup, that makes perfect sense to me, and would fit with what's been observed! Nice debugging work!

krackers commented 1 year ago

I was curious and took a look at disassembly of CGLLockContext:

int _CGLLockContext(int arg0) {
    rbx = arg0;
    rcx = sub_3aa3(arg0); /* Some initialization performed only once at startup */
    rax = 0x2714;
    if (rcx != 0x0) {
            pthread_mutex_lock(*(rbx + 0x1e78) + 0x18);
            rax = 0x0;
    }
    return rax;
}

So it is basically just a regular pthread mutex lock, I guess as you mentioned the key is that Apple's own libraries (e.g. Quartz) make use of this, so if you want to avoid race-conditions with Apple's own libraries trying to access the same context, then you must make use of CGLLockContext so you use the same mutex they do.

low-batt commented 1 year ago

Yes, exactly. The mpv code locked the side door and the asynchronous Quartz framework code walked in the front door. All the code needs to be using the same lock. At least this sure seems like the issue.

Frustrating that it is not easy to find detailed documentation regarding this. I always like to know the reasons behind requirements and what can go wrong if they are not followed. I'm expecting there isn't a lot of discussion on the net about this because most coders use CGLLockContext/CGLUnlockContext as discussed in the Apple documentation.

forthrin commented 1 year ago

Had another one of these

forthrin commented 1 year ago

Yet another one AddressSanitizer:DEADLYSIGNAL at gleUnbindDeleteHashNamesAndObjects.

forthrin commented 8 months ago

Had another one of these on a different Mac/macOS:

System Version: macOS 11.7.10 (20G1427)
Model Identifier: MacBookAir6,2
mpv 0.37.0 built on Nov 30 2023 14:35:43 FFmpeg version: 5.1.4

0   libsystem_kernel.dylib          0x00007fff2047229a mach_msg_trap + 10
1   libsystem_kernel.dylib          0x00007fff2047260c mach_msg + 60
2   com.apple.framework.IOKit       0x00007fff22c63793 io_connect_method + 384
3   com.apple.framework.IOKit       0x00007fff22c635af IOConnectCallMethod + 186
4   com.apple.IOAccelerator         0x00007fff28427577 ioAccelResourceFinalize + 162
5   com.apple.CoreFoundation        0x00007fff2066a72a _CFRelease + 244
6   libGPUSupportMercury.dylib      0x00007fff4640ff7d gldReclaimTexture + 81
7   libGPUSupportMercury.dylib      0x00007fff4640ffd7 gldDestroyTexture + 31
8   libGFXShared.dylib              0x00007fff6b9d66bf gfxDestroyPluginTexture + 59
9   GLEngine                        0x00007fff6bd0ee9d gleFreeTextureObject + 36
10  GLEngine                        0x00007fff6bce169b gleUnbindDeleteHashNamesAndObjects + 182
11  GLEngine                        0x00007fff6bc07fbc glDeleteTextures_Exec + 723
12  io.mpv                          0x00000001047f3c39 gl_tex_destroy + 73
13  io.mpv                          0x00000001047ab497 ra_tex_free + 23
14  io.mpv                          0x00000001047bd200 uninit_rendering + 224
15  io.mpv                          0x00000001047b79a2 uninit_video + 18
16  io.mpv                          0x00000001047b784c gl_video_uninit + 28
17  io.mpv                          0x00000001047aa4cc destroy + 28
18  io.mpv                          0x00000001047c6182 mpv_render_context_free + 178
19  io.mpv                          0x00000001046cc395 CocoaCB.shutdown(_:) + 325
20  io.mpv                          0x00000001046cc44f @objc CocoaCB.processEvent(_:) + 31
21  io.mpv                          0x00000001047d994d -[EventsResponder processEvent:] + 45
22  io.mpv                          0x00000001047d98fa __29-[EventsResponder readEvents]_block_invoke + 74
Akemi commented 7 months ago

i opened a PR #13494 with a fix. thank you @low-batt for your thorough investigation.

forthrin commented 7 months ago

Great work, everyone. I've applied the patch and will notify if it happens again (though it's been very rare).

low-batt commented 7 months ago

I tried to verify the fix. I built master with the address sanitizer and the fix reverted and tried to reproduce the original failure, but it appears the time window of vulnerability where threads can interfere with each other is so small that it is very difficult to trigger the failure. I remember this being the case when I last looked into this problem. I ran mpv through hundreds of play/quit cycles without experiencing the error.

I also ran hundreds of cycles with the fix. And tested without enabling the address sanitizer. No failures were encountered.

The changes should fix this one, but I just wasn't able to verify it.

Akemi commented 7 months ago

yeah, did the same. i wrote a small shell script that launched mpv, quit it after a 1-2 seconds and let it run for nearly 2 hours. i didn't encounter a crash once sadly.