obsproject / obs-studio

OBS Studio - Free and open source software for live streaming and screen recording
https://obsproject.com
GNU General Public License v2.0
58.93k stars 7.85k forks source link

Firefox crashes with [@ obs-virtualcam-module64.dll | BaseThreadInitThunk ] #4243

Closed Archaeopteryx closed 1 year ago

Archaeopteryx commented 3 years ago

Platform

Operating system and version: Windows (7, 8.1, 10) OBS Studio version: 26.*

Expected Behavior

Firefox doesn't crash on websites accessing webcams.

Current Behavior

I (person looking at crash reports for Firefox) and people I asked cannot reproduce the issue but some users encounters on websites which access web cams (e.g. Google Meet, https://webcamtests.com/ , BigBlueButton installs; interestingly no mention of Zoom). Recent crash reports (the "Reports" tab provides access to individual crash reports).

Steps to Reproduce

unknown

Additional information

Bug in Mozilla's bug tracker: bug 1691902

notr1ch commented 3 years ago

Unfortunately our build process excluded symbols for the virtual camera module which makes this difficult to diagnose. We'll try to have this fixed for the next version and I'll keep an eye on the Mozilla crash reports to try and get a useful stack trace.

msmania commented 3 years ago

Hi, I'm working on this issue from Mozilla side.

To locate the crashing address, I built obs-virtualcam-module64.dll on tags/26.1.0 locally and compared its binary pattern to the minidumps in our crash reports. So far I could locate four different crashing patterns as below.

  1. Write AV at nv12_scale_nearest_to_yuy2. dst was inaccessible.
  2. Write AV at nv12_convert_to_yuy2. dst was inaccessible.
  3. Read AV at nv12_convert_to_yuy2. src_uv was inaccessible.
  4. Write AV at nv12_convert_to_yuy2. dst was inaccessible.

I hope this information helps investigation on the OBS project side.

jp9000 commented 3 years ago

Thanks for the information. Those hints are useful. I apologize for any issues it might be causing.

If anyone has any other information, particularly any information on how to reproduce, please let us know.

gcp commented 3 years ago

@jp9000 I notice that all the crashes are in the yuy2 conversion, which was added recently (4 months ago).

Do you know what the use case for this is? Like, what kind of setup is likely to use this and thus lead to STR?

If we can see that this code is buggy/crashing, can it be disabled in OBS? Or is it required for something? (And then what is it required for?)

randomascii commented 3 years ago

Have any updated versions with symbols been released? We are continuing to get crash reports from Chrome that might also be this issue and I could analyze these much more easily with obs_virtualcam_module64 symbols.

I've only looked at one crash and it happens when chrome!media::VideoCaptureDeviceClient::OnIncomingCapturedData tries to make a virtual function call. The vtable pointer is invalid (c08f2200`a8280000) so this doesn't work. I am assuming that there is a use-after-free happening here but I can't be certain. The call stack is:

00 chrome!media::VideoCaptureDeviceClient::OnIncomingCapturedData 01 chrome!media::VideoCaptureDeviceWin::FrameReceived 02 chrome!media::SinkInputPin::Receive 03 obs_virtualcam_module64 ???

The module information is:

Timestamp:        Mon Dec 14 09:57:59 2020 (5FD7C3C7)
CheckSum:         00000000
ImageSize:        000E0000
File version:     26.1.0.0
jp9000 commented 3 years ago

That stack trace is actually quite informative. I appreciate that, and I'm currently looking into it. Unfortunately it seems the symbol file for the virtualcam module for the current release was annoyingly lost, but I think we shouldn't need it as long as we have the addresses and image base. I'm really sorry for the trouble.

slivermeteor commented 3 years ago

I also found a crash with obs-virtualcam-module32.dll. When my program starts, it will enums all camera devices in pc. Sometimes a crash occurs when it's the turn of obs-virtualcam-module32.dll Here is the crash stack: image I compiled the obs-virtualcam-module32.dll symbol locally and forced it to load. obs-virtualcam-module.dll info: image Any ideas for this crash?

randomascii commented 3 years ago

Any updates?

Also, has there been consideration given to re-releasing the current version, but this time with symbols available? If I could download the PE file and PDB file then I could give a much more helpful analysis of what is happening in Chrome. Right now I don't even know if this Firefox crash is the same crash that we see in Chrome.

notr1ch commented 3 years ago

We're aiming to start the release process for v27 near the end of the month. This will include PDBs for the virtual camera module. Based on previous update roll-outs, we expect a majority of users to have updated within 24-48 hours of a final release.

Unfortunately we weren't able to figure out exactly where we are crashing, so there is no fix for the crash yet.

notr1ch commented 3 years ago

We've identified one cause of crashing that could occur when the user replaced the default placeholder.png with an image that was not the same resolution. The scaling could go out of bounds when the placeholder was displayed, which seems to correlate with the original report.

@slivermeteor Can you confirm if you had a modified placeholder.png when you experienced your crash?

msmania commented 3 years ago

I could manually resolve the callstack of the crash. The crash on Firefox is different from Chrome's crash or the placeholder's crash.

obs_virtualcam_module64!nv12_scale_nearest_to_yuy2
obs_virtualcam_module64!nv12_do_scale
obs_virtualcam_module64!video_queue_read
obs_virtualcam_module64!VCamFilter::ShowOBSFrame
obs_virtualcam_module64!VCamFilter::Frame
obs_virtualcam_module64!VCamFilter::Thread
obs_virtualcam_module64!std::thread::_Invoke<std::tuple<<lambda_4e6f779957cbddf8fba6c1db701dac17> >,0>
obs_virtualcam_module64!thread_start<void (__cdecl*)(void *),0>

I think the crash happened because the allocated buffer was too small. In the example below, accessing 00000130e4a9c000 caused AV because it was a free region. Probably the allocated buffer was [130e4800000, 130e4a9c000) located above.

0:104> .excr
rax=0000000000040bf0 rbx=0000000000000186 rcx=0000000000000059
rdx=00000000000003c0 rsi=0000000000006db0 rdi=000000000000db60
rip=00007ffc917cacde rsp=0000001cd30bfc20 rbp=0000000000000000
 r8=0000000000000551  r9=00000130e4a9c000 r10=00000000001032cc
r11=00000000000003c0 r12=000000000000030c r13=00000130e3e0db08
r14=0000000000000780 r15=00000130e3dba480
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
obs_virtualcam_module64+0xacde:
00007ffc`917cacde 418809          mov     byte ptr [r9],cl ds:00000130`e4a9c000=??

0:104> !address
...snip...
       130`e47c2000      130`e4800000        0`0003e000 MEM_PRIVATE MEM_RESERVE                                    <unknown>  
+      130`e4800000      130`e4a9c000        0`0029c000 MEM_PRIVATE MEM_COMMIT  PAGE_READWRITE                     <unknown>  
+      130`e4a9c000      130`e4b00000        0`00064000             MEM_FREE    PAGE_NOACCESS                      Free       
...snip...

Maybe this can happen if the video size gets smaller while streaming? In every frame, VCamFilter::Frame handles such size change, but I cannot find how the size of a buffer, which is set in OutputPin::LockSampleData, is updated.

RytoEX commented 3 years ago

OBS Studio 27.0.0-rc2 has been released with PDBs included for the OBS Virtual Camera DLLs, and the PDBs should be included in all future releases. While this won't help with tracking down crashes related to previous versions of OBS, this should help track down any crashes related to OBS Studio 27 and newer.

@notr1ch has been working on some bug fixes for the virtual camera. Some of those changes have already shipped (#4417, https://github.com/obsproject/obs-studio/commit/5c8587eab3164f8711e8a19791b5c4d54edf9fa0, https://github.com/obsproject/obs-studio/commit/670156db8bde1ab940f2f2489c70653334b7ec3b, https://github.com/obsproject/obs-studio/commit/45643adb036c3bb40a7e400203499abb4ad7a3dd, https://github.com/obsproject/obs-studio/commit/642490112036e1e6f51356d64cb39e3612786b09) and some are still in the works. He is more familiar with the code than I am, so he might be able to comment further on those changes.

If it's determined that there are multiple unique crashes being reported in this thread, could I trouble the reporters to open new issues for the ones that don't match the originally reported crash? That should help us target each crash separately without mixing up information.

kent-chan-dev commented 3 years ago

OBS Studio 27.0.0-rc2 has been released with PDBs included for the OBS Virtual Camera DLLs, and the PDBs should be included in all future releases. While this won't help with tracking down crashes related to previous versions of OBS, this should help track down any crashes related to OBS Studio 27 and newer.

@notr1ch has been working on some bug fixes for the virtual camera. Some of those changes have already shipped (#4417, 5c8587e, 670156d, 45643ad, 6424901) and some are still in the works. He is more familiar with the code than I am, so he might be able to comment further on those changes.

If it's determined that there are multiple unique crashes being reported in this thread, could I trouble the reporters to open new issues for the ones that don't match the originally reported crash? That should help us target each crash separately without mixing up information.

Thanks, I had this happen to me twice the last 7 days on Firefox 88. Is this related to hardware problems, or entirely it is software?

I had been having some hardware problems that I hope were fixed by swapping the GPU.

gcp commented 3 years ago

@kgc35 If you go to about:crashes, and link some of the related reports, we may be able to investigate.

randomascii commented 3 years ago

I did some crash queries and found four related crashes with the latest OBS and latest Chrome. It's too early to tell if the crash rate will be affected by the newest version but I don't think so. Here is the data from one of the crashes, with the OBS symbols:

0:016> .ecxr rax=000000520e5ff618 rbx=000032f4002778e0 rcx=0000000000000000 rdx=0000000000000000 rsi=0000000000000000 rdi=000000520e5ff618 rip=00007ff837d63fe9 rsp=000000520e5ff5e0 rbp=000000520e5ff8e0 r8=0000000000000000 r9=000000520e5ff608 r10=0000000000000000 r11=0000000000000246 r12=000000520e5ff8f0 r13=000000520e5ff8e0 r14=000032f40043a080 r15=000000520e5ff8c8 iopl=0 nv up ei pl nz na po nc cs=0033 ss=0000 ds=0000 es=0000 fs=0053 gs=002b efl=00010206 chrome!media::VideoCaptureDeviceClient::OnIncomingCapturedBufferExt+0x199: 00007ff837d63fe9 488b06 mov rax,qword ptr [rsi] ds:0000000000000000=???????????????? 0:016> kc ** Stack trace for last set context - .thread/.cxr resets it Call Site 00 chrome!media::VideoCaptureDeviceClient::OnIncomingCapturedBufferExt 01 chrome!media::VideoCaptureDeviceClient::OnIncomingCapturedData 02 chrome!media::VideoCaptureDeviceWin::FrameReceived 03 chrome!media::SinkInputPin::Receive 04 obs_virtualcam_module64!DShow::OutputPin::UnlockSampleData 05 obs_virtualcam_module64!DShow::OutputFilter::UnlockSampleData 06 obs_virtualcam_module64!VCamFilter::Frame 07 obs_virtualcam_module64!VCamFilter::Thread 08 obs_virtualcam_module64!VCamFilter::{ctor}::l2::::operator() 09 obs_virtualcam_module64!std::invoke 0a obs_virtualcam_module64!std::threed::_Invoke<std::tuple< >,0> 0b obs_virtualcam_module64!thread_start<unsigned int (cdecl)(void *),1> 0c KERNEL32!BaseThreadInitThunk 0d ntdll!RtlUserThreadStart

(note that I change the spelling of thread to threed to stop an emoji from being inserted there)

Does that help?

BTW, one of the four crashes had a slightly different build of the DLL. The version number is the same but the build date was April 2nd instead of April 12. FWIW:

0:015> lmv m obs_virtualcam_module64 Browse full module list start end module name 00007ffcca120000 00007ffcca201000 obs_virtualcam_module64 T (no symbols)
Loaded symbol image file: obs-virtualcam-module64.dll Image path: C:\Program Files\obs-studio\data\obs-plugins\win-dshow\obs-virtualcam-module64.dll Image name: obs-virtualcam-module64.dll Browse all global symbols functions data Timestamp: Fri Apr 2 08:52:08 2021 (60673DA8) CheckSum: 00000000 ImageSize: 000E1000 File version: 27.0.0.0

randomascii commented 3 years ago

Note: the crash rate with 27.0 does seem to be much lower, so maybe the bug is fixed? We'll need a lot more users to upgrade to be sure. The low 27.0 crash rate might just represent low usage. Are there usage numbers you can share?

kent-chan-dev commented 3 years ago

Note: the crash rate with 27.0 does seem to be much lower, so maybe the bug is fixed? We'll need a lot more users to upgrade to be sure. The low 27.0 crash rate might just represent low usage. Are there usage numbers you can share?

@kgc35 If you go to about:crashes, and link some of the related reports, we may be able to investigate.

Thanks for the updates. Unfortunately Firefox did not create a crash dump, so there is no file to analyze.

@randomascii how can I update to 27.0? It seems like the official release is still at 26.

randomascii commented 3 years ago

I didn't update to 27.0 (I'm not an OBS user) but I grabbed the DLL and PDB files for 27.0 from here:

https://github.com/obsproject/obs-studio/releases/tag/27.0.0-rc2

There's also an installer

gcp commented 3 years ago

@randomascii As pointed out in https://github.com/obsproject/obs-studio/issues/4243#issuecomment-819011643, the Chrome crash has a different signature and may very well have a different root cause (so the lower crash volume you see may or may not have anything to do with the problem this issue is about...). Can you file a different issue (as already requested here https://github.com/obsproject/obs-studio/issues/4243#issuecomment-819159085)? This will prevents mixups in the analysis.

randomascii commented 3 years ago

Good point. I filed this issue: https://github.com/obsproject/obs-studio/issues/4615

gcp commented 3 years ago

Unfortunately Firefox did not create a crash dump, so there is no file to analyze.

This makes it somewhat less likely it's this issue, as that is caught by our crash reporter. If the problem is persistent, it may be worth trying Firefox Nightly, which is able to catch and report some hardware or driver related crashes previous versions couldn't.

kent-chan-dev commented 3 years ago

Yeah, Firefox did not generate any crash dumps, nor did Windows. However, I do have these error messages from Windows Reliability. image image image

msmania commented 3 years ago

@kgc35 Thank you for sharing those screenshots! The exact same crash (the module was unloaded and the offset is 0xcdf7) has been filed as bug 1707060. Our WebRTC team is still looking, but at least it's different from the issue discussed here or bug 1691902.

kent-chan-dev commented 3 years ago

@kgc35 Thank you for sharing those screenshots! The exact same crash (the module was unloaded and the offset is 0xcdf7) has been filed as bug 1707060. Our WebRTC team is still looking, but at least it's different from the issue discussed here or bug 1691902.

Thanks. What should I do in the meantime in order to avoid this problem? I regularly use OBS virtual camera in Firefox.

gcp commented 3 years ago

What should I do in the meantime in order to avoid this problem?

If you can comment in https://bugzilla.mozilla.org/show_bug.cgi?id=1707060 and provide more details or steps to reproduce the problem, it may help us track it down more quickly. As we don't understand the root cause yet I don't know of any workaround.

kent-chan-dev commented 3 years ago

What should I do in the meantime in order to avoid this problem?

If you can comment in https://bugzilla.mozilla.org/show_bug.cgi?id=1707060 and provide more details or steps to reproduce the problem, it may help us track it down more quickly. As we don't understand the root cause yet I don't know of any workaround.

No problem. I am glad to help with this problem. I have mentioned that it is due to unloading the camera, when I end a call or switch a type of call on Doxy.me.

I also linked the images here at Bugzilla.

RytoEX commented 3 years ago

What should I do in the meantime in order to avoid this problem?

If you can comment in https://bugzilla.mozilla.org/show_bug.cgi?id=1707060 and provide more details or steps to reproduce the problem, it may help us track it down more quickly. As we don't understand the root cause yet I don't know of any workaround.

No problem. I am glad to help with this problem. I have mentioned that it is due to unloading the camera, when I end a call or switch a type of call on Doxy.me.

I also linked the images here at Bugzilla.

Are you able to reproduce the crash in something that doesn't require specific meeting rooms or signing up for the service, such as https://webcamtests.com/ or some other testing tool? Does this also happen with the doxybot precall test from https://intercom.help/doxyme/en/articles/2350624-video-issue-my-patient-can-t-see-me ?

Are you able to reproduce this with the virtual camera from OBS Studio 27.0.0-rc2? This will require either uninstalling OBS Studio 26.1.1 and installing 27.0.0-rc2, or running (as administrator) virtualcam-uninstall.bat from OBS Studio 26.1.1 and running virtualcam-install.bat from OBS Studio 27.0.0-rc2.

It seems to me that https://bugzilla.mozilla.org/show_bug.cgi?id=1691902 and https://bugzilla.mozilla.org/show_bug.cgi?id=1707060 are different crashes, so I don't think we should track both in this Issue. The replies on 1707060 also seem unclear if this is something we need to address in OBS or if it needs to be addressed in Firefox.

kent-chan-dev commented 3 years ago

What should I do in the meantime in order to avoid this problem?

If you can comment in https://bugzilla.mozilla.org/show_bug.cgi?id=1707060 and provide more details or steps to reproduce the problem, it may help us track it down more quickly. As we don't understand the root cause yet I don't know of any workaround.

No problem. I am glad to help with this problem. I have mentioned that it is due to unloading the camera, when I end a call or switch a type of call on Doxy.me. I also linked the images here at Bugzilla.

Are you able to reproduce the crash in something that doesn't require specific meeting rooms or signing up for the service, such as https://webcamtests.com/ or some other testing tool? Does this also happen with the doxybot precall test from https://intercom.help/doxyme/en/articles/2350624-video-issue-my-patient-can-t-see-me ?

Are you able to reproduce this with the virtual camera from OBS Studio 27.0.0-rc2? This will require either uninstalling OBS Studio 26.1.1 and installing 27.0.0-rc2, or running (as administrator) virtualcam-uninstall.bat from OBS Studio 26.1.1 and running virtualcam-install.bat from OBS Studio 27.0.0-rc2.

It seems to me that https://bugzilla.mozilla.org/show_bug.cgi?id=1691902 and https://bugzilla.mozilla.org/show_bug.cgi?id=1707060 are different crashes, so I don't think we should track both in this Issue. The replies on 1707060 also seem unclear if this is something we need to address in OBS or if it needs to be addressed in Firefox.

Yes, I can try in the next two days.

Regarding the crashes, well I am inclined to think it is a problem on OBS' end, since people are reporting crashes in both Chrome and Firefox, independent of hardware. It seems that the virtual camera driver is the commonality.

What was changed in Release 27 that might address this issue?

notr1ch commented 3 years ago

While these may all look like "OBS virtual camera crashes" from the outside, there are distinct differences between them:

kent-chan-dev commented 3 years ago

While these may all look like "OBS virtual camera crashes" from the outside, there are distinct differences between them:

* The original report in this thread is one crash reported in Firefox that we're looking in to as it clearly points to our code. This might be caused by custom placeholder images - if a user used the wrong size image it would cause a crash. This was fixed in v27, further testing is needed to confirm if this was the same crash as identified in this issue.

* The Chrome report is a different crash in Chrome's code, de-referencing a null pointer. This may be our fault for not shutting down the virtual output properly, but it's a very different crash from the original report which is why it was moved to #4615.

* Finally there is this report, where Firefox is crashing trying to access code in the filter that has been unloaded. Since we wait on the output thread as part of our filter destruction, it is unlikely that our own filter code is trying to execute after unloading, so Firefox may be holding a stale reference to the filter somewhere. It's again, unrelated to the original report in this thread, so I would prefer to keep discussion to the Firefox bug tracker at https://bugzilla.mozilla.org/show_bug.cgi?id=1707060. If after investigation it turns out to be something that's more likely to be on our end, we can open a separate issue for it.

Understood, the root cause for each of the problems are different, although they come from the interaction between the browser and OBS virtual camera.

What are the steps to update my virtual camera to version 27? Will I need to uninstall all of OBS? I want to do some testing that will help us clarify this problem.

RytoEX commented 3 years ago

Regarding the crashes, well I am inclined to think it is a problem on OBS' end, since people are reporting crashes in both Chrome and Firefox, independent of hardware. It seems that the virtual camera driver is the commonality.

Minor pedantry: on Windows, the virtual camera is not a driver. It's a DirectShow filter.

What was changed in Release 27 that might address this issue?

See my earlier comment for changes made to the virtual camera code for OBS Studio 27. Though I don't know for sure that any of those changes would address the crash that you are seeing.

What are the steps to update my virtual camera to version 27? Will I need to uninstall all of OBS? I want to do some testing that will help us clarify this problem.

See my other earlier comment about this procedure:

This will require either uninstalling OBS Studio 26.1.1 and installing 27.0.0-rc2, or running (as administrator) virtualcam-uninstall.bat from OBS Studio 26.1.1 and running virtualcam-install.bat from OBS Studio 27.0.0-rc2.

Since 27.0.0-rc3 is now out, use that instead. You don't need to uninstall your userdata (profiles and scene collections). You just need to either:

  1. Uninstall OBS 26.1.1 and install OBS 27.0.0-rc3. or
  2. Run the appropriate uninstall/install scripts for the virtual camera for each version while having both installed in separate locations (or using a portable/zip installation of OBS 27).

Though, please also understand that while we try our best to prevent bugs, we don't consider RCs to be stable releases.

msmania commented 3 years ago

We're still receiving the crash reports from obs-virtualcam-module64.dll v27.

The crashing stack is the same as I mentioned in the earlier comment. Either the destination or source buffer was not accessible in nv12_convert_to_yuy2.

The first crash (ed786d15-8ed2-49d9-ae5d-a2bd60210611) is read AV for 00000253015de000. In my understanding the source buffer is a mapped file created by OBS Studio. As you can see below, there is a mapped region just above the crashing address. So I believe the problem was buffer copy was started with a wrong (or old) frame size.

0:120> r
Last set context:
rax=000000000000004a rbx=0000025302b18c00 rcx=00000253015de002
rdx=0000025302b18480 rsi=000000e47070bca0 rdi=0000000000000002
rip=00007ff9685db66a rsp=0000009aebe3fce8 rbp=0000000000000000
 r8=000002530158c300  r9=0000025302b18c00 r10=00000253015ddb40
r11=0000000000001100 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na pe cy
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010283
obs_virtualcam_module64!nv12_convert_to_yuy2+0x4a:
00007ff9`685db66a 0fb641fe        movzx   eax,byte ptr [rcx-2] ds:00000253`015de000=??

0:120> !address
...
       253`012fe000      253`01300000        0`00002000 MEM_PRIVATE MEM_RESERVE                                    <unknown>  
+      253`01300000      253`015de000        0`002de000 MEM_MAPPED  MEM_COMMIT  PAGE_READONLY                      <unknown>  [................]
+      253`015de000      253`01700000        0`00122000             MEM_FREE    PAGE_NOACCESS                      Free       
...
crossan007 commented 2 years ago

I'm seeing Firefox 95.0.2 crash everytime I end a Google Meet meeting (or disable my camera in the meeting) with OBS 27.1.3

notr1ch commented 2 years ago

Are you using the virtual camera as your meeting camera? If so, what canvas resolution are you using in OBS?

SD777x commented 2 years ago

Same here, FF 95.0.2 crashes every time I close out of a google meeting with OBS 27.1.3 with virtual camera running at 1280x720. I'm not sure if it's 100% of the time, but it's pretty close. obs-virtualcam-module64.dll | CFGControl::SetFirstVW is the culprit in FF crash reports, EXCEPTION_ACCESS_VIOLATION_EXEC

SD777x commented 2 years ago

Here is one of my crashes: https://crash-stats.mozilla.org/report/index/16bd3089-8c58-4407-9cec-85ca30220104

notr1ch commented 2 years ago

Does it still occur if you run OBS at 1920x1080?

SD777x commented 2 years ago

Does it still occur if you run OBS at 1920x1080?

Yes it occurs if I run the virtual camera at 1920x1080 and 1280x720.

crossan007 commented 2 years ago

https://crash-stats.mozilla.org/report/index/88da9854-bc4e-4e99-a14f-78b930220111

and https://crash-stats.mozilla.org/report/index/df95af1b-af6d-4c15-a7ce-9812e0220111

Used Google Meet for a little over an hour; Firefox crashed immediately upon clicking end call. OBS base and output resolutions are 1920x1080

crossan007 commented 2 years ago

Another crash https://crash-stats.mozilla.org/report/index/a70c755a-0c67-4ec8-922c-c50c80220113

Used Google Meet for a little over an hour; Firefox crashed immediately upon clicking end call. OBS base and output resolutions are 1920x1080

RytoEX commented 2 years ago

Another crash https://crash-stats.mozilla.org/report/index/a70c755a-0c67-4ec8-922c-c50c80220113

Is this with the same steps-to-reproduce as your previous comment? Please include that information when submitting crash reports. Thanks!

crossan007 commented 2 years ago

Yes; same steps. Updated comment.

crossan007 commented 2 years ago

Frame 0 and 1 seem to always be consistent (pasting here for search visibility)

Frame 0: obs-virtualcam-module64.dll@0x000000000000e247 Frame 1: void CFGControl::SetFirstVW(struct IVideoWindow*)

Frame 2 seems to be variable (probably a memory address?:)

and then the rest of the stack seems consistent as well:

Frame 3: webrtc::videocapturemodule::VideoCaptureDS::~VideoCaptureDS() Frame 4: rtc::RefCountedObject<webrtc::videocapturemodule::VideoCaptureDS>::~RefCountedObject() ...

RytoEX commented 2 years ago

@crossan007 @SD777x Your reports resemble Mozilla Bug 1707060 more than the original crash for this Issue, Mozilla Bug 1691902. See the following in each crash stack:

xul.dll webrtc::videocapturemodule::VideoCaptureDS::~VideoCaptureDS
xul.dll rtc::RefCountedObject<webrtc::videocapturemodule::VideoCaptureDS>::~RefCountedObject
xul.dll rtc::RefCountedObject<webrtc::videocapturemodule::VideoCaptureDS>::Release

As previously discussed in this Issue, I'd recommend commenting there. If the Mozilla folks monitoring that bug think that what you have experienced is actually an OBS bug, I'd recommend opening a new GitHub Issue for the crash you're experiencing to keep that tracked separately from this one. If you do open a new GitHub Issue with us, please make sure to include an OBS log from a session where the issue occurred.

Please include as much detail as you can regarding the steps-to-reproduce when commenting on Mozilla Bug 1707060 or if you end up opening a new GitHub Issue with us.

  1. Can you reproduce it in a test, or does it have to be a live call?
  2. Does it have to be a call with a certain length/duration, or does it happen with short calls under a minute?
  3. Does it happen when you're the participant, organizer, or both?

Having said that, we've thus far been unable to reproduce the crash you've described in Firefox 95.0.2 or 96.0.1.

jp9000 commented 2 years ago

I neglected to follow up on this post with an update. Quite some time ago, we were able to reproduce a crash that we suspect was this one and determined the cause. The crash that we saw happens when the camera is being stopped/unloaded.

Something in Firefox (or maybe the WebRTC library or elsewhere) is fully unloading the DLL without waiting for the camera's filter to finish being destroyed. The destroy is a synchronous operation so it means that there's likely some race condition in Firefox or one of its libraries.

The virtualcam module stops its thread in its filter destructor by signaling the thread to stop and then joining the thread. However, that thread can take some number of milliseconds to stop because it may be waiting or sleeping, thus it doesn't always immediately stop, and thus can possibly stall for some number of milliseconds while waiting for the thread to join. It seems that Firefox is not waiting for the filter to destruct, and in some other thread, UnloadLibrary is called on the virtualcam module DLL, causing data that the destructor uses to become invalidated before it can finish, and thus causing a crash when code tries to use that data.

Basically, there's a likely a race condition where in Firefox which will unload the DLL regardless of whether it's still waiting to destroy the filter elsewhere.

I originally tried to patch our DLL to not wait in the filter destructor, but the filter is designed to spawn a thread on construction and cleanup on destroy, rather than spawn a thread on filter start and kill the thread on filter stop. When I first tried to fix it, the fix to this was not as trivial as it sounds, and I unintentionally introduced more bugs which were even worse than Firefox's race condition. So I had to revert it at the time because I had a pressing release to make, and thus the bug is still present and I need to go back to it for a more thorough patch, but ultimately I do not believe that the bug is technically the fault of OBS. It's just triggering a race condition bug that something inside of Firefox itself has due to its somewhat unconventional design.

I still intend to fix that design flaw, I've just been pulled left and right constantly and it unintentionally fell by the wayside yet again. But again, I do want to emphasize that I believe that this is a bug with something within Firefox itself (not sure if it's WebRTC or what exactly is the cause, I don't recall hearing about this happening on Chrome but I could be wrong).

gcp commented 2 years ago

Firefox doesn't know anything about OBS at all, let alone this DLL, so this analysis seems a bit weird to me. We can't call UnloadLibary on a DLL we don't even know the existence of.

Firefox just uses the DirectShow Capture API, and OBS is presenting itself as a camera there. This is done in the WebRTC library which is shared with Chrome/Chromium/whatever else (and who had or have the same crash...): https://searchfox.org/mozilla-central/source/third_party/libwebrtc/modules/video_capture/windows

https://crash-stats.mozilla.org/report/index/7fd6b54d-f1e8-4b4f-a1fa-0eb4d0220619#allthreads Look for: Thread 70, Name: VideoCapture https://hg.mozilla.org/releases/mozilla-release/file/1ff2cec0bb36e389df1a209a9f882b443ed48495/third_party/libwebrtc/webrtc/modules/video_capture/windows/video_capture_ds.cc#l47

Crashes happen when we release our (MSCOM) reference to the IBaseFilter* we were holding. So we aren't freeing or unloading anything, we're signaling to Windows we are no longer using a capture device, i.e. reducing the reference count on it.

The code literally does

someIBaseFilterPtr->Release();
someIBaseFilterPtr = nullptr;

It's quite conceivable that if this is the last reference, the Windows DirectShow/COM machinery starts cleaning up but that isn't anything we have any control over!

As far as the WebRTC code knows, it's asking Windows for anything that supports CLSID_VideoInputDeviceCategory, gets a reference to it, and then drops the reference when it is no longer required (at this point OBS, and only OBS, sometimes crashes). It doesn't explicitly load or unload any DLL's, that's all OBS/DirectShow/Windows.

(This is potentially a different issue than was first reported here, which was a mis-sized buffer in OBS, and a different one the Chrome team complained about, which is frame callbacks being called even after the capture was stopped)

randomascii commented 2 years ago

Given that the DLL is unloaded when Firefox releases our MSCOM reference to IBaseFilter* that presumably means that when OBS starts up it needs to increment its own reference count while it has threads running, or something similar. I'm not a COM expert but the problem of COM objects with independently running threads must be a common issue with standard solutions.

From a bit of searching a few options cropped up: 1) Call LoadLibrary on yourself, and don't call UnloadLibrary until you are not executing any code from yourself anymore (tricky, since that final UnloadLibrary call would presumably be in the library) - https://devblogs.microsoft.com/oldnewthing/20191127-00/?p=103153 2) Increment the reference count for each thread - each reference should increment the reference count, which is obvious when you say it out loud: https://devblogs.microsoft.com/oldnewthing/20081103-00/?p=20353 3) Implement DLLCanUnloadNow - https://docs.microsoft.com/en-us/windows/win32/api/combaseapi/nf-combaseapi-dllcanunloadnow and https://pubs.opengroup.org/onlinepubs/009899899/CHP06GDC.HTM and https://www.ooportal.com/basic-com/module3/implementDllCanUnloadNow.php

I'm not a COM expert so I'm not sure how well these different techniques work. The tricky part is that if you increment your own reference count then you need to decrement it (or call UnloadLibrary to decrement the DLLs reference count) and that means that you are running your code that will cause your code to unload, which is racy.

I think that DLLCanUnloadNow is probably the simplest and safest solution but I'm not sure.

jp9000 commented 2 years ago

@gcp

I completely understand your defensiveness, and to be clear, I would like to point out that I did try to very carefully say "something in Firefox"; I was implying the Firefox process holding the filter. I should have stated that a bit more clearly. Indeed, it's completely possible maybe it could be Windows DirectShow/COM machinery code, however it does not really change the fact that a race condition has to exist somewhere outside of the virtual camera module code. I'll make some speculations a bit later about this.

And yes, as I stated it happens during the destruction of the filter, thus it would be in the Release() call of the final reference, where ever that may be. So the code you pointed out is likely where it could crash indeed if it was the last reference holder. However, to state the issue again, something somewhere else is unloading the library before that Release() call there completes, assuming it's the final reference. It really has to be specifically UnloadLibrary() somewhere. That was what we discovered. I am simply felt it was important to report that.

I already acknowledged the flaw that the virtual camera module has by using the construction/destruction rather than the start/stop calls, but the library being unloaded during destruction is something that I felt was still worth reporting here. If it seemed like I was trying to place any specific blame for this I apologize; that is not my intent, so please do not take it that way. That being said, that condition will still remain as a possibility regardless of whether I change the destruction logic or not, so again, I felt that it was very important that I note it here, because that is not something that the virtual camera module can control, and it will always be a latent race condition that exists somewhere.

Something, somewhere in a Firefox process that uses the filter, is unloading the virtual camera module library before is has a chance to finish destroying, invalidating memory being used during destruction, causing the crash. Whether that's Firefox itself, one of its dependencies, or something deeper in Microsoft, that was something I found to be the case.

As for the one the Chrome team complained about, that would likely be due to the the design flaw as well. I take responsibility for the design flaw there because I'm the one who wrote that. Sadly I made things a little bit difficult for myself, otherwise I would have fixed it last patch as I had intended rather than have delayed it for the next patch. I can only blame myself for that one.

I intend to fix the start/stop logic for the next patch, and while this may remedy the situation I'm reporting, it will not solve the underlying UnloadLibrary() race condition that exists somewhere outside of the virtual camera module in the process using it. It was still important that I report it. I'm just putting extra emphasis on this here because I want people to understand that writers of filter modules have absolutely no control over libraries being unloaded, except maybe by forcing an extra LoadLibrary reference to itself on library initialization, but that would have the negative effect of forcing the library to always be loaded in memory until the process using it shuts down.

I don't think that anyone using the DirectShow API is going to be handling modules directly, and now that I think about it, I would agree that it is unlikely that any of the people writing any of that code is overtly unloading any libraries themselves (I hope). I think it may just be a threading issue somewhere with COM/DirectShow. COM is very specific about threading, and how COM behaves with multiple threads is dependent upon how COM is initialized. DirectShow is also notoriously picky about threading as well. Your speculation that it could be within the Microsoft machinery somewhere may indeed be correct, although it could also be that the Release() call is occurring in a thread other than the thread in which the object was created/used in conjunction with COM being initialized in mode that doesn't support that. Similarly, it's possible that DirectShow might have its own thoughts on the threading there, but I'd be more likely to lean towards COM. So now that I think about, the thing I'd be most likely to point my finger at is just threading usage with either COM or DirectShow. (Have I mentioned that I absolutely hate DirectShow passionately?)

However, that's just my idle speculation based upon my experience with COM and DirectShow; I have not browsed the WebRTC/Firefox code enough to know whether that's the case. If COM was properly initialized in multithreading mode, then it may very well be a latent bug in Microsoft's code somewhere that I've only triggered by happenstance. But either way, it may be worth checking COM initialization code relative to camera filters, and checking to see what threads the filter Release() calls can occur in, and making sure that the threads where those things are occurring make sense.

@randomascii

I appreciate the feedback. DLLCanUnloadNow might be helpful, however I think that just fixing the construction/destruction logic in the virtual camera module is all that needs to be done to remedy it. If I am unable to spend adequate time on this fix, I will use that or some other simple remedy to ensure that the library can't be unloaded until filter destruction is complete.

notr1ch commented 2 years ago

I'm somewhat of the opposite opinion, we're clearly doing something wrong if every other DLL seems to behave fine. There's no "latent bug in Microsoft's code", we must have some code that is violating the requirements of what a COM DLL is supposed to do somewhere.

Looking over the code again, we do seem to have a DllCanUnloadNow function which returns S_OK as long as nothing has called VCamFactory::LockServer which seems immediately wrong. If we're telling DShow that it's OK to unload our DLL, then we shouldn't be surprised when DShow unloads our DLL prematurely.

gcp commented 2 years ago

Thanks for the follow-up. I was surprised by the diagnosis but going back I see that I missed that this was bug 1707060 not bug 1691902 as I was coming from. Unfortunately the crash rate for bug 1707060 is fairly high (150-400 crashes per day). I understand that at this point 3 separate problems are multiplexed in this GitHub issue.

https://bugzilla.mozilla.org/show_bug.cgi?id=1707060#c1 and https://bugzilla.mozilla.org/show_bug.cgi?id=1707060#c17 are super-careful to not claim OBS is the one with the bug for this specific one :-)

That said, I think the high crash rate is because OBS is the most popular capture solution. However, if we'd have a generic bug in the lifetime management of DirectShow capture filters, we should have much more or other crashes with webrtc::videocapturemodule::VideoCaptureDS::~VideoCaptureDS in the stack. I'll try to verify that this is not the case, which could then confirm that at least OBS is special (in an unfortunate way) here.

Regarding DllCanUnloadNow, we don't call CoFreeUnusedLibraries() directly but we do certainly call CoUninitialize() which would then call that (AFAIK).