obsproject / obs-studio

OBS Studio - Free and open source software for live streaming and screen recording
https://obsproject.com
GNU General Public License v2.0
60.31k stars 7.99k forks source link

OBS Virtual Camera crashes when switching breakout rooms in Zoom. #7287

Closed attaboyBrad closed 2 years ago

attaboyBrad commented 2 years ago

Operating System Info

macOS 12

Other OS

No response

OBS Studio Version

28.0.0

OBS Studio Version (Other)

28.0.1

OBS Studio Log URL

https://obsproject.com/logs/Chwt_hwR_9cbLuE8

OBS Studio Crash Log URL

No response

Expected Behavior

I expected to switch from one breakout room to another without issue.

Current Behavior

The OBS virtual camera froze in Zoom, and in attempting to restart it the virtual camera ceased to show up as an option in Zoom and Zoom fell back to a raw webcam feed. This has been an issue since the first OBS 28 beta and was mentioned a couple of times in the beta announcement forum thread.

Steps to Reproduce

  1. Open OBS 28
  2. Open Zoom (fully updated as of 22-09-04) and start a meeting.
  3. Start virtual camera in OBS 28.
  4. Enter a breakout room or try to switch breakout rooms in Zoom.

Anything else we should know?

This has been happening basically every time I try to enter, or especially to switch, breakout rooms in Zoom since the first OBS 28 beta (still an issue with 28.0.1).

It's not an issue on 27.2.4, and I've reverted back to that for the time being.

steven-michaud commented 2 years ago

Here's the last output before the crash:

(Sat Oct  8 14:01:56 2022) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[19894] [0x70001367b000] Hook.mm: _CFMachPortCreateWithPort2(): port 0x20583, shouldFreeInfo 1, returning
<CFMachPort 0x148e11580 [0x7ff859d07d70]>{valid = Yes, port = 20583, source = 0x0, callout = __NSFireMachPort (0x7ff818107788), context = <CFMachPort context 0x13e15fcd0>}

(Sat Oct  8 14:01:56 2022) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[19894] [0x70001367b000] Hook.mm: IONotificationPortGetRunLoopSource(): notify 0x14b6dd5b0, notify.masterPort 0x7907, notify.wakePort 0x20583, notify.cfmachPort 0x0, notify.source 0x0, notify.dispatch_source 0x0, notify.refcount 0, returning
(null)

I took the port number (20583) from the port 0x20583 parameter to _CFMachPortCreateWithPort2() and used it to search through the log.

The first hit is:

(Sat Oct  8 14:01:53 2022) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[19894] [0x70000e786000] Hook.mm: _CFMachPortCreateWithPort2(): port 0x20583, shouldFreeInfo 0, returning
<CFMachPort 0x148e11580 [0x7ff859d07d70]>{valid = Yes, port = 20583, source = 0x0, callout = __NSFireMachPort (0x7ff818107788), context = <CFMachPort context 0x13e15fcd0>}
    (hook.dylib) PrintStackTrace() + 0x68
    (hook.dylib) Hooked__CFMachPortCreateWithPort2(__CFAllocator const*, unsigned int, void (*)(__CFMachPort*, void*, long, void*), CFMachPortContext*, unsigned char*, unsigned char) + 0xbd
    (Foundation) -[NSMachPort initWithMachPort:options:] + 0x14d
    (Foundation) +[NSMachPort parseMachMessage:localPort:remotePort:msgid:components:] + 0x167
    (hook.dylib) +[NSObject(NSMachPortSwizzling) NSMachPort_parseMachMessage:localPort:remotePort:msgid:components:] + 0x4d
    (Foundation) __NSFireMachPort + 0xa4
    (hook.dylib) Hooked___NSFireMachPort(__CFMachPort*, void*, long, void*) + 0x31
    (CoreFoundation) __CFMachPortPerform + 0x120
    (CoreFoundation) __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 0x29
    (CoreFoundation) __CFRunLoopDoSource1 + 0x26b
    (CoreFoundation) __CFRunLoopRun + 0x96f
    (CoreFoundation) CFRunLoopRunSpecific + 0x232
    (Foundation) -[NSRunLoop(NSRunLoop) runMode:beforeDate:] + 0xd8
    (Foundation) -[NSRunLoop(NSRunLoop) runUntilDate:] + 0x67
    (unknown) unknown + 0x0
    (libdispatch.dylib) _dispatch_call_block_and_release + 0xc
    (libdispatch.dylib) _dispatch_client_callout + 0x8
    (libdispatch.dylib) _dispatch_root_queue_drain + 0x2a1
    (libdispatch.dylib) _dispatch_worker_thread2 + 0xa0
    (libsystem_pthread.dylib) _pthread_wqthread + 0x100
    (libsystem_pthread.dylib) start_wqthread + 0xf

This mac-virtualcam code code is running to deal with a mach port that's just "fired". Notice that _CFMachPortCreateWithPort2() has returned CFMachPort 0x148e11580.

The next line in the log is:

(Sat Oct  8 14:01:53 2022) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[19894] [0x70000e786000] [NSMachPort parseMachMessage:]: localPort <CFMachPort 0x148c45a00 [0x7ff859d07d70]>{valid = Yes, port = 2090b, source = 0x148c78940, callout = __NSFireMachPort (0x7ff818107788), context = <CFMachPort context 0x14d2e9160>}, remotePort (null), msgid 0xe783ea4, components (
    "<NSMachPort: 0x148e11580>",
    {length = 8, bytes = 0x9e6505c9ce0e0000},
    {length = 4, bytes = 0x1e000000},
    {length = 4, bytes = 0x01000000}
)

Notice that the first item in components is <NSMachPort: 0x148e11580>, which has the same address as the CFMachPortRef returned by _CFMachPortCreateWithPort2(). From looking at -(void)OBSDALMachClient handlePortMessage:(NSPortMessage *)message you can tell that this object, at components[0], is used to initialize NSMachPort *framePort.

So we've found what the CFMachPortRef that triggered this bug's crashes originally was, before it turned into a zombie.

steven-michaud commented 2 years ago

@PatTheMav: I discovered that if, in my patch above, I restore the call to mach_port_dealloc() (so that I'm calling it just before [framePort invalidate]), it gets compiled out! I've never seen anything like this before, but then I've never used ARC (automatic reference counting).

I also had the idea of calling [framePort release] instead of [framePort invalidate], but the compiler tells me this isn't allowed when using ARC. However I found I could play a trick on the compiler and call CFRelease((CFMachPortRef)framePort) (like you do below using CFRelease(surface)). This resolves the crashes. CFMachPortInvalidate() still gets called on framePort, and my STR no longer works. I wonder if it will also resolve the leaks.

diff --git a/plugins/mac-virtualcam/src/dal-plugin/OBSDALMachClient.mm b/plugins/mac-virtualcam/src/dal-plugin/OBSDALMachClient.mm
index c6126398c..a9c7212bf 100644
--- a/plugins/mac-virtualcam/src/dal-plugin/OBSDALMachClient.mm
+++ b/plugins/mac-virtualcam/src/dal-plugin/OBSDALMachClient.mm
@@ -114,8 +114,9 @@

                        IOSurfaceRef surface = IOSurfaceLookupFromMachPort(
                                [framePort machPort]);
-                       mach_port_deallocate(mach_task_self(),
-                                            [framePort machPort]);
+                       //mach_port_deallocate(mach_task_self(),
+                       //                     [framePort machPort]);
+                       CFRelease((CFMachPortRef)framePort);

                        if (!surface) {
                                ELog(@"Failed to obtain IOSurface from Mach port");
steven-michaud commented 2 years ago

ARC quite apparently doesn't work on framePort. I have a hunch the reason is that it's really a CFMachPortRef (created by _CFMachPortCreateWithPort2()), and those aren't really covered by ARC. In many other respects NSMachPort * and CFMachPortRef are equivalent (toll-free bridged). But not this one.

steven-michaud commented 2 years ago

I spoke too soon. My second patch doesn't work. My changes somehow didn't register unless I rebuilt OBS Studio entirely from scratch :-( This may also be the reason for the call to mach_port_deallocate() apparently being "compiled away".

From this point, I'll re-clone each time and rebuild from scratch. Anybody know of a way to do make clean?

steven-michaud commented 2 years ago

For what it's worth, I have decent results with the following patch. I double-checked the obs-mac-virtualcam module (in a disassembler) to make sure it really matched my change. Both [framePort invalidate] and mach_port_deallocate() showed up where they should be. I ran Activity Monitor on Firefox Nightly (the main process), assuming that any major leaks in obs-mac-virtualcam would show up there. I didn't see any.

diff --git a/plugins/mac-virtualcam/src/dal-plugin/OBSDALMachClient.mm b/plugins/mac-virtualcam/src/dal-plugin/OBSDALMachClient.mm
index c6126398c..c6c0cde2a 100644
--- a/plugins/mac-virtualcam/src/dal-plugin/OBSDALMachClient.mm
+++ b/plugins/mac-virtualcam/src/dal-plugin/OBSDALMachClient.mm
@@ -114,6 +114,7 @@

                        IOSurfaceRef surface = IOSurfaceLookupFromMachPort(
                                [framePort machPort]);
+                       [framePort invalidate];
                        mach_port_deallocate(mach_task_self(),
                                             [framePort machPort]);
steven-michaud commented 2 years ago

@PatTheMav, how do you measure leaks?

Anyone know of an "activity monitor" that will give you per-module memory use statistics?

PatTheMav commented 2 years ago

Instruments has this built-in, it does leak checks automatically.

steven-michaud commented 2 years ago

Using Instruments to detect leaks is a royal pain. I eventually figured out that I need to resign Firefox to give it a com.apple.security.get-task-allow entitlement. Then Instruments crashed Firefox almost immediately. From the crash stack, it seems Instruments doesn't get along with Firefox's implementation of jemalloc.

So ... please try my latest patch and let me know if you see leaks with it.

steven-michaud commented 2 years ago

I did a local mozilla-central build (a Firefox build from current trunk code) with jemalloc disabled (ac_add_options --disable-jemalloc). Instruments' leak detector works fine with that. Then I ran my "fixed" OBS and performed the STR with which I was originally able to reproduce this bug's crashes. I switched cameras back and forth ten or more times, and saw no leaks that could be attributed to the obs-mac-virtualcam plugin.

@PatTheMav, I await the results of your tests. If all goes well I'll do a pull request for my latest patch.

PatTheMav commented 2 years ago

Alas without any changes or fixes Instruments reported hundreds of memory leaks with our current virtual cam implementation that I'm currently trying to get to the bottom of. There are several places where the handling of mach port rights don't seem right, as well as unnecessary IOSurface locks.

This will take me quite a bit to figure out, but as I mentioned above we're in the process of replacing this virtual camera implementation with a Camera Extension anyway, so we're probably only getting the legacy DAL plugin into a workable state and leave it at that.

steven-michaud commented 2 years ago

Then I'll let you take the lead on this. Let me know when you want me to do a pull request.

But don't wait too long :-) These crashes are among the top five Mac crashes on Firefox betas and releases, and their numbers seem to be growing.

steven-michaud commented 2 years ago

For what it's worth, I don't see any leaks (with my STR) in a local mozilla-central build with ASAN (Address Sanitizer) support. (This build also disables jemalloc.)

PatTheMav commented 2 years ago

Annoyingly enough the leaks supposedly all happen within -[NSMachPort initWithMachPort:options:] which calls _CFMachPortCreateWithPort2 down the line as you expected, initiated by +[NSMachPort portWithMachPort:options:].

These leaks happen continuously as long as virtual camera clients are connected and there is a visible spike in memory usage after sending a frame message, but overall memory consumption seems stable overall.

Also here's what leaks has to say:

      6 (464 bytes) ROOT CYCLE: <OS_dispatch_source 0x600001e01480> [128]
         5 (336 bytes) ROOT CYCLE: 0x600001be2ae0 [112]
            2 (112 bytes) ROOT CYCLE: 0x6000025486c0 [64]
               1 (48 bytes) ROOT CYCLE: <__NSMallocBlock__ 0x600003fc26d0> [48]  CoreFoundation  ___CFMachPortCreateWithPort2_block_invoke  0x194601260
                  __strong [capture] --> CYCLE BACK TO <OS_dispatch_source 0x600001e01480> [128]
            2 (112 bytes) 0x600002549ec0 [64]
               1 (48 bytes) <__NSMallocBlock__ 0x600003fc3210> [48]  CoreFoundation  ___CFMachPortCreateWithPort2_block_invoke.3  0x194662490

Overall your suggested fix does get rid of the crashes, so I'll throw this in with some other fixes I have implemented. If you don't mind, I'd use the email address on your GitHub profile for the co-author attribution?

attaboyBrad commented 2 years ago

I've been doing my best to keep up with the back and forth here over the last few days, but I'm afraid it's almost entirely over my head.

I'll continue to try to follow along, but do let me know if there's any sort of testing I can help with in the meantime.

I'm so grateful @PatTheMav and @steven-michaud for the effort you're putting in.

PatTheMav commented 2 years ago

@attaboyBrad Long story short:

It seems to be that if we don't invalidate the port (a sharing mechanism between processes on macOS) for the IOSurface (a macOS thing that represents image data) after receiving it in the Virtual Camera client (the thing that gets loaded into your browser or conferencing software), we exhaust the amount of ports available to IOKit (the macOS framework responsible for handling webcams, among other things), leading macOS to recycle ports, which IOKit really doesn't like and crashes.

I don't know why explicit deallocation doesn't do the trick, but I take any fix I can get.

steven-michaud commented 2 years ago

Overall your suggested fix does get rid of the crashes, so I'll throw this in with some other fixes I have implemented. If you don't mind, I'd use the email address on your GitHub profile for the co-author attribution?

That's fine with me.

steven-michaud commented 2 years ago

I get different results from leaks. Here are some that I've picked out that seem to be relevant:

      5 (192 bytes) ROOT LEAK: <AVCaptureDeviceFormat_Tundra 0x6000006d0ac0> [32]
         4 (160 bytes) _internal --> <AVCaptureDeviceFormatInternal_Tundra 0x600000cbf450> [48]
            3 (112 bytes) videoSupportedFrameRateRanges --> <NSArray 0x600000206b30> [16]
               2 (96 bytes) __strong _object --> <AVFrameRateRange_Tundra 0x6000002060d0> [16]
                  1 (80 bytes) _internal --> <AVFrameRateRangeInternal_Tundra 0x60000203c690> [80]

      5 (192 bytes) ROOT LEAK: <AVCaptureDeviceFormat_Tundra 0x6000006d8560> [32]
         4 (160 bytes) _internal --> <AVCaptureDeviceFormatInternal_Tundra 0x600000c5d8f0> [48]
            3 (112 bytes) videoSupportedFrameRateRanges --> <NSArray 0x60000026ea50> [16]
               2 (96 bytes) __strong _object --> <AVFrameRateRange_Tundra 0x60000026ea30> [16]
                  1 (80 bytes) _internal --> <AVFrameRateRangeInternal_Tundra 0x6000022fc5a0> [80]
      3 (96 bytes) ROOT LEAK: <AVCaptureDeviceFormat_Tundra 0x60000074c0e0> [32]
         2 (64 bytes) _internal --> <AVCaptureDeviceFormatInternal_Tundra 0x600000c5adf0> [48]
            1 (16 bytes) videoSupportedFrameRateRanges --> <NSArray 0x600000254cf0> [16]
      2 (80 bytes) ROOT LEAK: <AVCaptureDeviceFormat_Tundra 0x6000001c3180> [32]
         1 (48 bytes) _internal --> <AVCaptureDeviceFormatInternal_Tundra 0x600000cbb9c0> [48]

None of this, of course, comes directly from the plugin itself (obs-mac-virtualcam).

I'm just doing leaks [firefox main process pid]. These "tree format" results come at the end of a very long list of single-line items. (This is my first time using leaks or Instruments.)

Edit: I don't see ROOT CYCLE anywhere in my leaks output. Do you need to play tricks to generate it?

PatTheMav commented 2 years ago

leaks OBS was all I had to do while running a debug build of the app.

The leaks don't come from the Mach client part, but from the Mach server inside OBS when creating a new NSMachPort with the port given by the IOSurface.

steven-michaud commented 2 years ago

OK. Now I see your results, in both leaks and Instruments.

Here's a stack trace from Instruments:

_malloc_zone_malloc 
_Block_copy 
_dispatch_Block_copy    
_dispatch_source_set_handler    
_CFMachPortCreateWithPort2  
-[NSMachPort initWithMachPort:options:] 
+[NSMachPort portWithMachPort:options:] 
-[OBSDALMachServer sendPixelBuffer:timestamp:fpsNumerator:fpsDenominator:]  
virtualcam_output_raw_video(void*, video_data*) 
default_raw_video_callback  
video_output_cur_frame  
video_thread    
_pthread_start  
thread_start    
attaboyBrad commented 2 years ago

@PatTheMav,

If it's not too much trouble, would you be so kind as to provide instructions for how a layperson might test/implement your fixes from the linked pull request?

OBS 29 seems like a long time to wait...

sa1 commented 2 years ago

@PatTheMav,

If it's not too much trouble, would you be so kind as to provide instructions for how a layperson might test/implement your fixes from the linked pull request?

OBS 29 seems like a long time to wait...

The artifacts are available here to download: https://github.com/obsproject/obs-studio/actions/runs/3257497328