Bug: Access violation when using winrt::resume_on_signal

fereeh commented 1 year ago

Version

No response

Summary

We ran into an access violation when using winrt::resume_on_signal. It was caused by a race condition in signal_awaiter::create_threadpool_wait. The method accesses *this after scheduling a "transfer" of the coroutine handle across threads using a threadpool wait. By the time m_state is accessed, a worker thread may have already resumed the coroutine, destroying the signal_awaiter and eventually the coroutine state.

Reproducible example

No response

Expected behavior

No response

Actual behavior

No response

Additional comments

No response

kennykerr commented 1 year ago

Keep in mind that a co_await expression is about async synchronization and not about async ownership. Its still the coroutine's responsibility to stabilize any references (including this) prior to suspension.

sylveon commented 1 year ago

The coroutine code in this case is correct - it's just co_await winrt::resume_on_signal(handle) (which is done in sample code mind you) which will only destroy the signal_awaiter once the wait is over.

The issue is that if the threadpool wait immediately fires, the coroutine execution resumes and destroys the signal_awaiter before reaching this code (which seems to assume the threadpool wait won't immediately fire): https://github.com/microsoft/cppwinrt/blob/297454ee285476f16bf11425bd60daf4593b66ee/strings/base_coroutine_threadpool.h#L405-L409

This seems like it could be a problem with the other awaiters that rely on SetThreadpoolXXX functions. Why not move this assignment to before SetThreadpoolXXX is called? I think @oldnewthing wrote this originally, he probably knows why it was done that way.

fereeh commented 1 year ago

Quoting cppreference for additional context:

Note that because the coroutine is fully suspended before entering awaiter.await_suspend(), that function is free to transfer the coroutine handle across threads, with no additional synchronization. For example, it can put it inside a callback, scheduled to run on a threadpool when async I/O operation completes. In that case, since the current coroutine may have been resumed and thus executed the awaiter object's destructor, all concurrently as await_suspend() continues its execution on the current thread, await_suspend() should treat *this as destroyed and not access it after the handle was published to other threads.

They have some sample code right below the quoted text that further illustrates the problem.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

sylveon commented 1 year ago

What's the opinions of maintainers on this? Is moving the assignment before SetThreadpoolXxx being called an acceptable solution?

kennykerr commented 1 year ago

It was simpler when I originally wrote it, and I'm not comfortable messing with it now. Unless @oldnewthing is available, I suggest you write your own resume_on_signal and move on. It's not hard to get it right if you keep things simple. 😊

oldnewthing commented 1 year ago

I'm busy right now but I'll try to find time to look at this in the next few weeks.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

sylveon commented 1 year ago

Still a problem

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

sylveon commented 1 year ago

The easiest way to reproduce this is to insert a sleep between SetThreadpoolX and the atomic access, to simulate a case where the threadpool thread fires before the atomic access runs. This is true of all resume_ functions

The result is that m_state is accessed after the temporary awaitable object is destroyed. This is easily confirmed by ASAN, and also gives a runtime crash in debug mode (because the object is memset'd to 0xDD). It "works" in release because the 0xDD memset doesn't happen.

I believe the solution would be to set m_state to pending before doing the threadpool wait. Something like this:

state expected = state::idle;
if (m_state.compare_exchange_strong(expected, state::pending, std::memory_order_release))
{
    WINRT_IMPL_SetThreadpoolWaitEx(m_wait.get(), m_handle, file_time, nullptr);
}
else
{
    // fire the callback immediately
    int64_t now = 0;
    WINRT_IMPL_SetThreadpoolWaitEx(m_wait.get(), WINRT_IMPL_GetCurrentProcess(), &now, nullptr);
}

I can PR it if this sounds like a good solution.

This use-after-free could also happen with winrt::impl::await_adapter, as it does suspending.exchange() after a registering a callback to async.Completed (some theorical IAsyncAction implementation could short-circuit and immediately call a callback being registered if the coroutine is already completed).

oldnewthing commented 1 year ago

I agree that this is a problem. Now I have to reverse-engineer how "cancellation v2 (#1246)" works.

sylveon commented 1 year ago

I have a solution that I'm about to open a PR for. This problem wasn't introduced by #1246 as far as I know, as the code was the same before. It got extracted into its own type by the PR, hence why it shows up in the git blame.

sylveon commented 1 year ago

Opened #1342

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

sylveon commented 1 year ago

I'm gonna look at the fixing the PR this week

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

fereeh commented 10 months ago

still a problem

sylveon commented 10 months ago

If you're able to build cppwinrt yourself, you could try #1342 to make sure it solves your problem

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

microsoft / cppwinrt