Interop using GetOrCreate to create a CanvasVirtualBitmap eventually hangs

benstevens48 commented 5 years ago

I'm trying to create a CanvasVirtualBitmap from a ID2D1ImageSourceFromWic using interop with C++/WinRT. In fact I am basically using the code that I added to the documentation at the end of the interop article here, with some extra code added at the beginning to create the WIC bitmap source from a stream. I'm calling this many times (using co_await resume_background() and co_return to wrap the function so it is running on a background thread), and the calls may be simultaneous. Eventually I manage to create a situation where the app hangs indefinitely. The hang even occurs if I don't do anything with the created CanvasVirtualBitmap. It's difficult to reproduce consistently. It's only the GetOrCreate part that causes the problem - If I comment out this bit it works Ok. It also works without problems using CanvasVirtualBitmap.LoadAsync (instead of interop). I had a look at the GetOrCreate code, and notice it does some locking, so I wondered if that could be the issue, although I couldn't see any potential problems.

benstevens48 commented 5 years ago

I've done some investigation and I'm pretty sure this is a bug caused by the lock in ResourceManager::GetOrCreate (ResourceManager.cpp). I built a modified version of Win2D, replacing this

    std::lock_guard<std::recursive_mutex> lock(m_mutex);

    // Do we already have a wrapper around this resource?
    auto it = m_resources.find(resourceIdentity.Get());

    if (it != m_resources.end())
    {
        wrapper = LockWeakRef<IInspectable>(it->second);
    }

with this

    std::unique_lock<std::recursive_mutex> lock(m_mutex);

    // Do we already have a wrapper around this resource?
    auto it = m_resources.find(resourceIdentity.Get());

    if (it != m_resources.end())
    {
        wrapper = LockWeakRef<IInspectable>(it->second);
    }
    lock.unlock();

and this seemed to fix the problem.

Note the unlock(). Without the unlock, the lock remains in place for the whole function. This is not good because the function creates resources so may do a significant amount of work. Creating these resources may involve other locks (like accessing the device context pool etc). It's not hard to imagine this might cause deadlock, and in any case it is not ideal because it reduces concurrency of interop resource creation.

I do acknowledge that the original lock does prevent a resource from being wrapped twice if GetOrCreate is called for the same resource from different threads at the same time. So I think the solution is to either change the code to what I've got above and say that calling GetOrCreate for the same resource from different threads at the same time when the resource hasn't already been wrapped may result in multiple wrapped copies, or introduce a more complicated locking system. E.g. something like having a map of resources being created, each with their own separate lock which is released when they have been created, and not locking on the main shared mutex during resource creation.

I really hope you can fix this bug as it basically means no interop operations (going from Direct2D to Win2D) can be used reliably in multi-threaded scenarios.

A final note - I am not 100% sure this is the underlying cause of the bug, but during testing I can consistently produce the error with the original Win2D code and cannot produce the error when using my modified code.

shawnhar commented 5 years ago

Hi,

Releasing the resource manager lock earlier is not correct because GetOrCreate absolutely needs to work in a threadsafe manner. The operation of checking for an existing wrapper, creating a new one if it does not already exist, and returning the wrapper, must be atomic, so it's correct that the entire thing takes place inside the lock.

In general, creating Win2D wrapper objects is extremely lightweight and I'm not sure how this could result in deadlocks. Virtual bitmaps do a little more work on construction than most wrappers (to determine bounds of the image) but it's not obvious to me why this would be a problem.

Are you able to debug the state of locks when your hang occurs, to see which threads are blocking and on what? (it might be helpful to locally build a debug flavor of Win2D for this kind of debugging, but even the release binaries in the NuGet package will be able to show some info about which threads are doing what)

benstevens48 commented 5 years ago

Hi Shawn,

Ok, I added OutputDebugString before and after the lock statement in the GetOrCreate function and also in the Remove function (in ResourceManager.cpp) and also just before the return statement in these functions saying 'Acquiring', 'Acquired' and 'Releasing' respectively, and I think this confirms that there is a deadlock situation going on. This is the output I get (note I also outputted the thread id after the acquiring strings)

GetOrCreate lock aquiring 21052
GetOrCreate lock aquired
Remove resource lock acquiring 19528
GetOrCreate lock aquiring 19820 (this then repeats with different thread ids).

The last two locks are never acquired and the app then hangs. It seems likely to me that this is a deadlock situation. I can imagine how this happens. The CanvasVirtualBitmap constructor makes a call to get the bounds of the image source, which perhaps results in an internal Direct2D lock.

Now imagine that a resource is being disposed at the same time, which calls the Remove method of ResourceManager. If it also has a Direct2D lock at the same time (either internal or an explicit ResourceLock), then this would result in the deadlock situation above.

I agree that there needs to be protection against wrapping the same resource twice, but it should not rely on the main shared mutex. The create function tries a lot of different functions and you need to be sure that none of these could result in deadlock, which I think in fact they probably can. Therefore I think what you should do is have a more complicated locking system. Something like this pseudocode

//Define CreatingResourceObject =  struct{recusive_mutex mutex, HResult error, ComPtr<IInspectable> result} or similar

map<IUnknown *, CreatingResourceObject> ResourceManager::creating_map;

ComPtr<IInspectable> ResourceManager::GetOrCreate(ICanvasDevice* device, IUnknown* resource, float dpi)
{
    ComPtr<IUnknown> resourceIdentity = AsUnknown(resource);
    ComPtr<IInspectable> wrapper;

    std::unique_lock<std::recursive_mutex> lock(m_mutex);

    // Do we already have a wrapper around this resource?
    auto it = m_resources.find(resourceIdentity.Get());

    if (it != m_resources.end())
    {
        wrapper = LockWeakRef<IInspectable>(it->second);
    }

    if(wrapper){
    lock.unlock();
    ValidateDevice(wrapper.Get(), device);
    ValidateDpi(wrapper.Get(), dpi);
    return wrapper;
}
auto creatingMatch = creatingmap.find(resourceIdentity.Get());
if(creatingMatch){
lock.unlock();
creatinglock = lock(creatingMatch.mutex);
return creatingMatch.result or throw error if creatingmatch.error
}
recursuive_mutex creatingmutex;
unique_lock creatinglock = lock(creatingmutex)
creatingresult = {creatingmutex, null, null}
creatingMap.add(resource, creatingresult)
lock.unlock();

    // Create a new wrapper instance?

        for (auto& tryCreateFunction : tryCreateFunctions)
        {
            if (tryCreateFunction(device, resource, dpi, &wrapper))
            {
                break;
            }
        }

        // Fail if we did not find a way to wrap this type.
        if (!wrapper)
        {
lock.lock()
creatingmap.remove(resource);
lock.unlock();
creatingresult.error=  Hresult(E_NOINTERFACE, Strings::ResourceManagerUnknownType)
creatinglock.unlock();
            ThrowHR(E_NOINTERFACE, Strings::ResourceManagerUnknownType);
        }

    // Validate that the object we got back reports the expected device and DPI.
try{
    ValidateDevice(wrapper.Get(), device);
    ValidateDpi(wrapper.Get(), dpi);
}catch{
lock.lock()
creatingmap.remove(resource);
lock.unlock();
creatingresult.error=  Hresult(E_NOINTERFACE, Strings::ResourceManagerUnknownType)
creatinglock.unlock();
}

lock.lock()
creatingmap.remove(resource);
lock.unlock();
creatingresult.result=  wrapper;
creatinglock.unlock();
    return wrapper;
}

Sorry, I just wrote that pseudo-code very quickly as an example! I hope you get the gist of it. Thanks for taking the time to look at this bug.

shawnhar commented 5 years ago

The behavior you are seeing does sound like a deadlock, but the design of holding a resource creator lock around the whole operation should not in theory lead to such things, because Win2D is a wrapper around D2D, which means that Win2D locks are always taken outside of any region that would hold a e D2D lock. Deadlock can only occur if there is lock inversion, which would require a Win2D lock to be taken /inside/ a region that held a D2D lock, but the design is for that to never occur.

Of course its entirely possible we got something wrong, but I'd like to get to the bottom of what that is before changing the lock design. This should be pretty easy to see in the debugger (including from a postmortem dump) after you hit the hang - it'll show what locks are held, what is trying to be acquired, and what callstack led to each of these.

benstevens48 commented 5 years ago

Hi Shawn,

Thanks for the debug suggestion. I took me a while to work out how to see the info, but in the end it was very helpful. I can now confirm it is a deadlock, and the inversion happens because of the explicit D2DResourceLock that you take in a few places that enters the D2D Factory critical section. How the deadlock happens is as follows.

Thread 1 calls ResourceManager::GetOrCreate and acquires a lock on ResourceManager::m_mutex, but doesn't get as far as the constructor for the CanvasVirtualBitmap.
Thread 2 calls an instance of CanvasSwapChain.CreateDrawingSession. This then enters the D2D critical section via D2DResourceLock lock(d2dDevice.Get()). It then calls CanvasDrawingSession::CreateNew, which in turn ends up calling ResourceManager::Add. This tries to lock on the ResourceManager::m_mutex, but Thread 1 has the lock so it is now waiting for Thread 1 to release this lock.
Thread 1 continues and eventually calls the CanvasVirtualBitmap interop constructor, which calls deviceContext->GetImageLocalBounds. However, all Direct2D calls on the factory are blocked until Thread 2 leaves the critical section, hence Thread 1 is now blocked.
Deadlock ensues!

To back up that this is what is happening, I found the following when debugging.

Several threads were waiting on the ResourceManager::m_mutex inside ResourceManager::GetOrCreate, but this is a side-effect I think.
One thread was waiting on ResourceManager::m_mutex inside ResourceManager::Remove, but this is a side-effect I think.
One thread was stuck in GetLocalBounds in the CanvasVirtualBitmap constructor after having called GetOrCreate - this is key
One thread was waiting on ResourceManager::m_mutex inside ResourceManager::Add, and further back in the callstack was CanvasSwapChain.CreateDrawingSession - this is key.

In my inexperienced opinion the solution is something like I described in the previous comment. Since every resource on both creation and deletion has to acquire a lock on ResourceManager::m_mutex, I would make sure this lock is held for as minimal time as possible. If you're talking about inversion, then I would say it's an inversion to acquire any other lock whilst holding this lock. You could also look at changing the locking behaviour of CanvasSwapChain.CreateDrawingSession if you think that's necessary or if you consider that to be an inversion (it is effectively holding a Win2D lock inside a D2D lock).

I hope this info helps!

shawnhar commented 5 years ago

Excellent debugging there! That's a convincing explanation, and gnarly problem.

I'm reluctant to make the resource manager locking significantly more complicated, although am having a hard time explaining why. Mostly I think I just have a bad reaction to the principle of complex locking systems! As you've found here, even very simple locks can lead to surprising and hard to identify corner cases, and in my experience increasing complexity is rarely the best way past that :-)

The core mistake here is that resource manager was written on the assumption that wrapping an existing D2D object was purely a Win2D object construction, which would never call back into D2D and thus acquire D2D locks inside the Win2D one. This is obviously not true for CanvasVirtualBitmap though!

I think my preferred approach will be to review the wrapper construction code to identify if there are other places that may have this problem. If it's a large set, there may be no other option than a more complex lock design, but if it's just CanvasVirtualBitmap it should be possible to rework that implementation to avoid D2D usage during wrapper construction.

We should also have tests to validate this for every resource type, although those may be a pain to write The test infrastructure Win2D uses isn't great at multithreaded test workloads, and mocking (which we rely on for most tests) wouldn't capture the important aspect of what locks D2D takes internally.

shawnhar commented 5 years ago

Also, understanding this root cause suggests a simple workaround: wrap your CanvasVirtualBitmap GetOrCreate calls inside a CanvasDevice.Lock. Hacky, but should unblock you to get this app in a shippable state without having to wait for a full Win2D bugfix.

benstevens48 commented 5 years ago

Hi Shawn, If you want a much simpler locking system that I think would also solve the problem, then something like this should do it (basically give the GetOrCreate function its own lock).

std::recursive_mutex ResourceManager::m_getOrCreateMutex; //Should be defined in the header as a static member.

ComPtr<IInspectable> ResourceManager::GetOrCreate(ICanvasDevice* device, IUnknown* resource, float dpi)
{
    ComPtr<IUnknown> resourceIdentity = AsUnknown(resource);
    ComPtr<IInspectable> wrapper;

    std::lock_guard<std::recursive_mutex> getOrCreateLock(m_getOrCreateMutex);
    std::unique_lock<std::recursive_mutex> lock(m_mutex); //Replaces old lock_guard so we can unlock it

    // Do we already have a wrapper around this resource?
    auto it = m_resources.find(resourceIdentity.Get());

    if (it != m_resources.end())
    {
        wrapper = LockWeakRef<IInspectable>(it->second);
    }

    lock.unlock();

    // Create a new wrapper instance?
    if (!wrapper)
    {
        for (auto& tryCreateFunction : tryCreateFunctions)
        {
            if (tryCreateFunction(device, resource, dpi, &wrapper))
            {
                break;
            }
        }

        // Fail if we did not find a way to wrap this type.
        if (!wrapper)
        {
            ThrowHR(E_NOINTERFACE, Strings::ResourceManagerUnknownType);
        }
    }

    // Validate that the object we got back reports the expected device and DPI.
    ValidateDevice(wrapper.Get(), device);
    ValidateDpi(wrapper.Get(), dpi);

    return wrapper;
}

(That's only two or three lines of code to change).

benstevens48 commented 5 years ago

Just an update to say I've been using your suggested workaround without issue. I've realised that actually my suggestion above is in conflict with that workaround, so maybe it's not such a good suggestion. So the best potential fixes are as follows.

Implement your workaround in Win2D (by acquiring a Direct2D lock (CanvasDevice.Lock) immediately on entering the GetOrCreate function and holding it for the duration of the function).
Avoid D2D usage during wrapper construction as you previously suggested.

1 is simple in principle, although might be tricky in the cases where the device parameter is not required, but not totally optimal for concurrency. 2 might be simple enough (I haven't checked) and would be better for concurrency but you would need to make to clear in the code for future maintainers not to use D2D during wrapper construction!

microsoft / Win2D

Interop using GetOrCreate to create a CanvasVirtualBitmap eventually hangs #663