[BUG] SkiaSharp can call WM_PAINT handler and random COM callbacks when creating or obtaning objects

kekekeks commented 4 years ago

   Avalonia.Win32.dll!Avalonia.Win32.WindowImpl.WndProc(System.IntPtr hWnd, uint msg, System.IntPtr wParam, System.IntPtr lParam) Line 30  C#
   [Native to Managed Transition]  
   [Managed to Native Transition]  
   System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOneNoCheck(int millisecondsTimeout)  Unknown
   System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOne(int millisecondsTimeout)  Unknown
   System.Private.CoreLib.dll!System.Threading.ReaderWriterLockSlim.WaitOnEvent(System.Threading.EventWaitHandle waitEvent, ref uint numWaiters, System.Threading.ReaderWriterLockSlim.TimeoutTracker timeout, System.Threading.ReaderWriterLockSlim.EnterLockType enterLockType)  Unknown
   System.Private.CoreLib.dll!System.Threading.ReaderWriterLockSlim.TryEnterWriteLockCore(System.Threading.ReaderWriterLockSlim.TimeoutTracker timeout)  Unknown
   SkiaSharp.dll!SkiaSharp.HandleDictionary.RegisterHandle(System.IntPtr handle, SkiaSharp.SKObject instance)  Unknown
   SkiaSharp.dll!SkiaSharp.SKObject.RegisterHandle(System.IntPtr handle, SkiaSharp.SKObject instance)  Unknown
   SkiaSharp.dll!SkiaSharp.SKObject.Handle.set(System.IntPtr value)  Unknown
   SkiaSharp.dll!SkiaSharp.SKObject.SKObject(System.IntPtr handle, bool owns)  Unknown
   SkiaSharp.dll!SkiaSharp.SKPath.SKPath(System.IntPtr handle, bool owns)  Unknown
   SkiaSharp.dll!SkiaSharp.SKPath.SKPath()  Unknown

This is caused by all locks in .NET being alertable locks that can still run the message pump when waiting for a lock.

This behavior might lead to unexpected behavior in the user code or even deadlocks, since user code doesn't expect to receive WM_PAINT while calling SKPath constructor.

mattleibow commented 4 years ago

Having a look at this implementation of a non-blocking dictionary: https://github.com/VSadov/NonBlocking

mattleibow commented 4 years ago

More info:

Using a basic lock or Monitor also causes this issue. See this repro:

https://gist.github.com/retran/b57e4db1a173048c2cee49ac6d523fc2

toptensoftware commented 3 years ago

I just hit upon this issue and unfortunately this makes SkiaSharp unusable for my app.

Perhaps I'm missing something here, but the only way currently to prevent this issue on Windows is to not use the STA threading model. Unfortunately this is not possible for many desktop apps because:

OLE functionality requires the STA apartment model
Anything that uses Drag/Drop uses OLE and therefore needs STA
The SHBrowseForFolder API function requires STA
Some clipboard functionality requires STA

While some of these issues can be worked around, my app loads third party plugins that are usually written in C++ and often assuming the host is running STA. They also often use the above described functionality.

It should be noted that any Windows app using SkiaSharp and STA is at risk of deadlocking - if the GC happens to run while the lock is held, then it can easily go re-entrant and deadlock.

All that's to say I really need a reliable fix for this. I can't help but think the easiest immediate fix might be to switch the Windows build to use PInvoke and a simple Win32 critical section for the lock?

I'm going to make this change because I really need it but would rather not maintain a separate branch. Is this something you'd consider including in the official build until a proper solution can be developed?

noseratio commented 3 years ago

@toptensoftware if you control the synchronization context in your desktop app, one other workaround without patching SkiaSharp might be to install a custom STA-friendly synchronization context. It should request wait notifications (via base.SetWaitNotificationRequired()) and override SynchronizationContext.Wait to perform real non-pumping waits (eg, via Win32 WaitForMultipleObjects, which doesn't pump). Here's why this may work.

It should be possible to do this just for the scope of a particular SkiaSharp call, with try/finally to temporarily install a custom non-pumping SynchronizationContext, call SkiaSharp API, then restore the original synchronization context.

mattleibow commented 3 years ago

@kekekeks will this also fix the issue for you?

kekekeks commented 3 years ago

We are currently switching to a non-pumping SynchronizationContext before doing any calls to SkiaSharp on the UI thread and when handing WM_SIZE and WM_PAINT. So the app is switching between SkiaSharp-safe and STA-safe modes. It's not an ideal solution, but works for us for now.

toptensoftware commented 3 years ago

I wasn't aware the locks uses the sync context - I'll look into switching that, but it really seems like a horrible hack. All that overhead for the few instructions it takes to put something into, or get something out of a dictionary.

Isn't the whole point of these locks being alertable in the STA is so you don't get a deadlock situation for COM method calls. Unless something inside one of those HandleDictionary methods is doing COM stuff that needs to callback into the STA that's not an issue. The non-alertable critical section seems a much safer and probably faster solution to me.

noseratio commented 3 years ago

I wasn't aware the locks uses the sync context - I'll look into switching that, but it really seems like a horrible hack. All that overhead for the few instructions it takes to put something into, or get something out of a dictionary.

@toptensoftware, as bad as this hack feels, I think it's the only way to safeguard yourself from reentrancy issues, if you're using an STA thread with WindowsFormsSynchronizationContext (WinForms) or DispatcherSynchronizationContext (WPF). This is how pumping in blocking WaitOne calls was designed to work in .NET Framework (to enable COM calls and callbacks). Back in the day, I ranted about it here and I don't think much has changed in .NET Core since then.

The fix you propose might be fixing it for one particular place, but are you sure nothing else is calling WaitOne across the board, explicitly or implicitly?

toptensoftware commented 3 years ago

You're right of course, but I think for SkiaSharp this is particularly troublesome because:

I can't imagine most developers expect calls to SkiaSharp to block in a way that's re-entrant. That might be naive but it's just not the way you think when writing drawing/rendering code.
It's especially hard to notice and track down. I've had this bug in my app for years, have seen evidence of it in a couple of crash reports, but never been able to figure it out until yesterday when I stumbled upon a reproducible case under load.
The fact that the standard wait implementation dispatches WM_PAINT messages results in the worst possible outcome... your painting code is likely to become re-entrant because it's painting. And when it does SkiaSharp can deadlock.
The fact that the STA, desktop GUI apps and SkiaSharp are all so closely related and used together means they really should all work together nicely out of the box.

To say every STA desktop Windows app that uses SkiaSharp needs a custom sync context with switching whenever using Skia is a big ask.

toptensoftware commented 3 years ago

A couple more thoughts on this. My app uses a custom UI toolkit so isn't using WinForms or WPF's sync context but does have its own which until just now used the default implementation of Wait (the problematic one). As an experiment I just changed it a non-alertable wait and the re-entrancy is indeed solved - but of course the STA is now broken.

The question now becomes how to reliably switch between the two modes. The low hanging fruit is to switch around WM_PAINT, WM_SIZE and perhaps a few other key places but I'm far less certain about the possibly hundreds of other places where I might be invoking SkaiSharp. There's the Skia objects that controls create during their construction, theme loading where fonts and images are loaded, temporary objects used during measurement and layout, offscreen rendering that isn't done during paint. The list goes on and on...

The scope of this issue is quite daunting once you understand its implications and I'd much rather be able to rely on SkiaSharp not going re-entrant and then not have to worry about it.

To put it another way, I'm far more worried about missing a SkiaSharp invocation than I am about: "nothing else is calling WaitOne across the board, explicitly or implicitly?"

noseratio commented 3 years ago

I agree, ideally SkiaSharp would benefit from using some lower level synchronization primitives which aren't exposed to pumping. I think that Monitor.Enter/Monitor.Exit (which is used by C# lock { ... }) should be fine, although I haven't verified it.

That said, many other things apart from Skia might still be using the .NET blocking calls, which still do that limited amount of pumping by default. You're probably using Skia on tight rendering loops, so you've become affected by this issue in an easily reproducible way. For many other scenarios, that could be a hard to track bug, which might still occasionally pop up and sporadically affect users.

Luckily, because you have your own synchronization context, you have full control over this. You still pump messages in your core event loop (the top-level GetMessage/DispatchMessage location), so you should be fine with hard OLE drag&drop and other STA requirements. Then, disabling pumping for short-lived blocking WaitOne calls inside UI event handler (or something like a timer callback) shouldn't be a problem. That worked well for me in the past.

However, even if we disable undesired pumping for managed code (preventing it for WaitOne via a custom synchronization context), there still could be some 3rd party native code calling something like CoWaitForMultipleHandles, or just starting its own a nested message loop. That was pretty common back in COM days. I once had to resort to WH_GETMESSAGE hook just to ignore WM_PAINT, WM_SIZE etc if they arrive on a nested message loop (i.e., via any place that calls GetMessage/PeekMessage, other than my core event loop).

toptensoftware commented 3 years ago

Hi Andrew, thanks for your insight on this - that's really valuable info. Message hooks to filter paint messages sounds like too much fun.

This is all a bit of a mess, but for understood reasons. I think we agree SkiaSharp could improve things with a non-pumping lock, or at least a way to plugin your own sync mechanism.

Unless, or until this can be fixed SkiaSharp I'm tempted to just disable the pumping wait entirely. I feel like I'm more exposed by that than anything else the app or any plugins might do. But I'm sure that'll bite me somewhere else :)

noseratio commented 3 years ago

No worries Brad, glad if it helped... one of those evergreen subtleties of Windows Desktop development with .NET :)

kekekeks commented 3 years ago

Unless, or until this can be fixed SkiaSharp I'm tempted to just disable the pumping wait entirely.

Doing that on STA thread breaks .NET itself in various ways. We've initially tried to always use a non-pumping context and discovered that it breaks APIs like GC.WaitForPendingFinalizers, those just deadlock.

toptensoftware commented 3 years ago

Hi @kekekeks, thanks for the info, good to know.

noseratio commented 3 years ago

Doing that on STA thread breaks .NET itself in various ways. We've initially tried to always use a non-pumping context and discovered that it breaks APIs like GC.WaitForPendingFinalizers, those just deadlock.

@kekekeks I'm curious if you create (or obtain via interop) any COM objects on that STA thread?

I remember having this issue with WaitForPendingFinalizers and RCW wrappers for MSHTML COM objects. I think, the solution back then was to enable pumping just for the scope of WaitForPendingFinalizers call, while disabling WM_PAINT and input messages processing, with WM_SETREDRAW and EnableWindow(false). As horrible as it sounds, that was effective to avoid reentrancy, but later we just got rid of WaitForPendingFinalizers.

mattleibow commented 3 years ago

I think that Monitor.Enter/Monitor.Exit (which is used by C# lock { ... }) should be fine, although I haven't verified it.

@noseratio I think using lock also had this issue: https://gist.github.com/retran/b57e4db1a173048c2cee49ac6d523fc2

noseratio commented 3 years ago

@noseratio I think using lock also had this issue: https://gist.github.com/retran/b57e4db1a173048c2cee49ac6d523fc2

@mattleibow I might be wrong but I think in this particular case it's caused by Thread.CurrentThread.Join(100), where they call it on the main UI thread. This call does indeed pump:

                try
                {
                    Debug.Assert(Thread.CurrentThread.ManagedThreadId == mainThreadId);
                    throwOnPaint = true;
                    Thread.CurrentThread.Join(100);
                    Lock();
                }

I think, Monitor.Enter/Monitor.Exit should be mapping to Win32 EnterCriticalSection/LeaveCriticalSection directly, maybe with some added "spin-before-sleep" improvement, but I can't find anything confirming that in the dotnet repo right away. I might try modifying the code you linked to verify this theory :)

noseratio commented 3 years ago

@mattleibow you were right! TIL, plain c# locks (ie., Monitor.Enter/Monitor.Exit) also pump on an STA thread 👀 https://gist.github.com/noseratio/0d93133b826339c7eaf49c9feec142e3#file-winformsapp-cs-L60

Now I wonder if there's any managed blocking synchronization primitive that doesn't pump in this case, besides Thread.Sleep...

Edited: and here's where the pumping magic is happening: https://github.com/dotnet/runtime/blob/0c0bd5a4066bc9b1a165a45e3553dfa14b87729b/src/coreclr/vm/threads.cpp#L3342

mono / SkiaSharp

[BUG] SkiaSharp can call WM_PAINT handler and random COM callbacks when creating or obtaning objects #1383