Closed fereeh closed 10 months ago
Keep in mind that a co_await
expression is about async synchronization and not about async ownership. Its still the coroutine's responsibility to stabilize any references (including this
) prior to suspension.
The coroutine code in this case is correct - it's just co_await winrt::resume_on_signal(handle)
(which is done in sample code mind you) which will only destroy the signal_awaiter once the wait is over.
The issue is that if the threadpool wait immediately fires, the coroutine execution resumes and destroys the signal_awaiter
before reaching this code (which seems to assume the threadpool wait won't immediately fire):
https://github.com/microsoft/cppwinrt/blob/297454ee285476f16bf11425bd60daf4593b66ee/strings/base_coroutine_threadpool.h#L405-L409
This seems like it could be a problem with the other awaiters that rely on SetThreadpoolXXX
functions. Why not move this assignment to before SetThreadpoolXXX
is called? I think @oldnewthing wrote this originally, he probably knows why it was done that way.
Quoting cppreference for additional context:
Note that because the coroutine is fully suspended before entering awaiter.await_suspend(), that function is free to transfer the coroutine handle across threads, with no additional synchronization. For example, it can put it inside a callback, scheduled to run on a threadpool when async I/O operation completes. In that case, since the current coroutine may have been resumed and thus executed the awaiter object's destructor, all concurrently as await_suspend() continues its execution on the current thread, await_suspend() should treat *this as destroyed and not access it after the handle was published to other threads.
They have some sample code right below the quoted text that further illustrates the problem.
This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.
What's the opinions of maintainers on this? Is moving the assignment before SetThreadpoolXxx being called an acceptable solution?
It was simpler when I originally wrote it, and I'm not comfortable messing with it now. Unless @oldnewthing is available, I suggest you write your own resume_on_signal
and move on. It's not hard to get it right if you keep things simple. 😊
I'm busy right now but I'll try to find time to look at this in the next few weeks.
This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Still a problem
This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.
The easiest way to reproduce this is to insert a sleep between SetThreadpoolX and the atomic access, to simulate a case where the threadpool thread fires before the atomic access runs. This is true of all resume_
functions
The result is that m_state
is accessed after the temporary awaitable object is destroyed. This is easily confirmed by ASAN, and also gives a runtime crash in debug mode (because the object is memset'd to 0xDD). It "works" in release because the 0xDD memset doesn't happen.
I believe the solution would be to set m_state
to pending before doing the threadpool wait. Something like this:
state expected = state::idle;
if (m_state.compare_exchange_strong(expected, state::pending, std::memory_order_release))
{
WINRT_IMPL_SetThreadpoolWaitEx(m_wait.get(), m_handle, file_time, nullptr);
}
else
{
// fire the callback immediately
int64_t now = 0;
WINRT_IMPL_SetThreadpoolWaitEx(m_wait.get(), WINRT_IMPL_GetCurrentProcess(), &now, nullptr);
}
I can PR it if this sounds like a good solution.
This use-after-free could also happen with winrt::impl::await_adapter
, as it does suspending.exchange()
after a registering a callback to async.Completed
(some theorical IAsyncAction
implementation could short-circuit and immediately call a callback being registered if the coroutine is already completed).
I agree that this is a problem. Now I have to reverse-engineer how "cancellation v2 (#1246)" works.
I have a solution that I'm about to open a PR for. This problem wasn't introduced by #1246 as far as I know, as the code was the same before. It got extracted into its own type by the PR, hence why it shows up in the git blame
.
Opened #1342
This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I'm gonna look at the fixing the PR this week
This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.
still a problem
If you're able to build cppwinrt yourself, you could try #1342 to make sure it solves your problem
This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Version
No response
Summary
We ran into an access violation when using winrt::resume_on_signal. It was caused by a race condition in signal_awaiter::create_threadpool_wait. The method accesses *this after scheduling a "transfer" of the coroutine handle across threads using a threadpool wait. By the time m_state is accessed, a worker thread may have already resumed the coroutine, destroying the signal_awaiter and eventually the coroutine state.
Reproducible example
No response
Expected behavior
No response
Actual behavior
No response
Additional comments
No response