Closed jacobsa closed 1 year ago
By the way, I should mention that I discovered this because tsan reports it as a data race. And I think it's correct: clang has introduced a data race by putting a write after the call to Register
, by which time another thread could be using the coroutine frame.
@ChuanqiXu9
I'm not sure if this is related to a known bug that current coroutine couldn't cache TLS variable correctly.
@jacobsa Could you build clang from source? If yes, could you test it again after applying https://reviews.llvm.org/D125291 and https://reviews.llvm.org/D127383?
@ChuanqiXu9: just saw your comment after writing this. I'll try that shortly, but it may take me some time because I've never done it before. In the meantime here is some information about the IR—can you tell whether it's related based on that?
Here is an IR dump after each optimization pass made with -mllvm -print-after-all -mllvm -filter-print-funcs=_Z6FooBarv
. It was made with a Google-internal build of clang based on db1978b674
, and the build settings might be slightly different from the Compiler Explorer link above.
You can see that in the version on line 3669 we still have the correct control flow:
store i8 1, i8* %44, align 8, !dbg !913
%45 = getelementptr inbounds %struct.Awaiter, %struct.Awaiter* %5, i64 0, i32 0, !dbg !914
%46 = load %struct.SomeAwaitable*, %struct.SomeAwaitable** %45, align 8, !dbg !914
%47 = icmp eq %struct.SomeAwaitable* %46, null, !dbg !915, !nosanitize !82
br i1 %47, label %48, label %49, !dbg !915, !nosanitize !82
48: ; preds = %37
call void @llvm.ubsantrap(i8 22) #14, !nosanitize !82
unreachable, !nosanitize !82
49: ; preds = %37
%50 = call i8* @_ZN13SomeAwaitable8RegisterENSt3__u16coroutine_handleIvEE(%struct.SomeAwaitable* noundef nonnull align 1 dereferenceable(1) %46, i8* %43) #2, !dbg !915
call void @llvm.dbg.value(metadata i8* %50, metadata !909, metadata !DIExpression()) #2, !dbg !910
call void @llvm.dbg.value(metadata %"struct.std::__u::coroutine_handle"* undef, metadata !916, metadata !DIExpression()) #2, !dbg !920
%51 = icmp eq i8* %50, null, !dbg !923
br i1 %51, label %52, label %53, !dbg !924
52: ; preds = %49
store i8 0, i8* %44, align 8, !dbg !925
br label %53, !dbg !927
However the SROAPass
on line 3877 eliminates the stores, turning them into a phi
node to select false or true depending on the result of Register
, and then later use that to decide whether to call DidntSuspend
:
25: ; preds = %21
%26 = call i8* @_ZN13SomeAwaitable8RegisterENSt3__u16coroutine_handleIvEE(%struct.SomeAwaitable* noundef nonnull align 1 dereferenceable(1) %19, i8* %11) #2, !dbg !910
call void @llvm.dbg.value(metadata i8* %26, metadata !907, metadata !DIExpression()) #2, !dbg !908
call void @llvm.dbg.value(metadata %"struct.std::__u::coroutine_handle"* undef, metadata !911, metadata !DIExpression()) #2, !dbg !915
%27 = icmp eq i8* %26, null, !dbg !918
br i1 %27, label %28, label %29, !dbg !919
28: ; preds = %25
br label %29, !dbg !920
29: ; preds = %25, %28
%30 = phi i8 [ 0, %28 ], [ 1, %25 ], !dbg !908
%31 = phi i8* [ %11, %28 ], [ %26, %25 ], !dbg !908
call void @llvm.dbg.value(metadata %"struct.std::__u::coroutine_handle"* undef, metadata !922, metadata !DIExpression()), !dbg !925
%32 = call i8* @llvm.coro.subfn.addr(i8* %31, i8 0)
%33 = bitcast i8* %32 to void (i8*)*
call fastcc void %33(i8* %31) #2, !dbg !890
%34 = call i8 @llvm.coro.suspend(token %22, i1 false), !dbg !890
switch i8 %34, label %52 [
i8 0, label %35
i8 1, label %46
], !dbg !890
35: ; preds = %29, %15
%36 = phi i8 [ %20, %15 ], [ %30, %29 ], !dbg !927
call void @llvm.dbg.value(metadata %struct.Awaiter* undef, metadata !928, metadata !DIExpression()) #2, !dbg !931
%37 = icmp eq i8 %36, 0, !dbg !933
br i1 %37, label %38, label %39, !dbg !935
38: ; preds = %35
call void @_Z12DidntSuspendv() #2, !dbg !936
br label %39, !dbg !938
The lack of a store is preserved up through the version on line 3068:
%14 = call i8* @_ZN13SomeAwaitable8RegisterENSt3__u16coroutine_handleIvEE(%struct.SomeAwaitable* noundef nonnull align 1 dereferenceable(1) %2, i8* %11) #2, !dbg !841
call void @llvm.dbg.value(metadata i8* %14, metadata !838, metadata !DIExpression()) #2, !dbg !839
call void @llvm.dbg.value(metadata %"struct.std::__u::coroutine_handle"* undef, metadata !842, metadata !DIExpression()) #2, !dbg !846
%15 = icmp eq i8* %14, null, !dbg !849
%16 = select i1 %15, i8* %11, i8* %14, !dbg !850
%17 = call i8* @llvm.coro.subfn.addr(i8* %16, i8 0)
%18 = bitcast i8* %17 to void (i8*)*
call fastcc void %18(i8* %16) #2, !dbg !832
%19 = call i8 @llvm.coro.suspend(token %13, i1 false), !dbg !832
switch i8 %19, label %32 [
i8 0, label %20
i8 1, label %26
], !dbg !832
20: ; preds = %9
call void @llvm.dbg.value(metadata %struct.Awaiter* undef, metadata !851, metadata !DIExpression()) #2, !dbg !854
br i1 %15, label %21, label %22, !dbg !856
21: ; preds = %20
call void @_Z12DidntSuspendv() #2, !dbg !857
br label %22, !dbg !860
But then on line 6111 CoroSplitPass
takes this and introduces the incorrect unconditional store
after Register
returns:
%27 = call i8* @_ZN13SomeAwaitable8RegisterENSt3__u16coroutine_handleIvEE(%struct.SomeAwaitable* noundef nonnull align 1 dereferenceable(1) %19, i8* %13) #2, !dbg !858
%28 = getelementptr inbounds %_Z6FooBarv.Frame, %_Z6FooBarv.Frame* %14, i32 0, i32 5, !dbg !850
store i8* %27, i8** %28, align 8, !dbg !850
I'd appreciate anybody's thoughts about what could be done to prevent this.
@ChuanqiXu9 okay yes, I can reproduce this at 91ab4d4231e5b7456d012776c5eeb69fa61ab994
:
> ./bin/clang++ -std=c++20 -O1 -fno-exceptions -S -mllvm --x86-asm-syntax=intel ~/tmp/foo.cc
> grep -A 5 Register foo.s
call _ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE@PLT
mov qword ptr [rbx + 24], rax
test rax, rax
cmove rax, rbx
mov rdi, rax
call qword ptr [rax]
I applied https://reviews.llvm.org/D125291 and https://reviews.llvm.org/D127383 in their current state and rebuilt clang, and still get the same result. I guess that makes sense—there is no TLS here.
I applied https://reviews.llvm.org/D125291 and https://reviews.llvm.org/D127383 in their current state and rebuilt clang, and still get the same result. I guess that makes sense—there is no TLS here.
Oh, sorry for misleading.
I think I get the problem. Long story short, your analysis (and the analysis of tsan) is correct. This is a (potential) miscompile.
Here is the reason:
auto *handle = coro.begin();
bool suspended = false;
call func(handle); // we don't know the body
Use of suspended...
The key issue here is that:
suspended
is escaped too. So we shouldn't sink it.suspended
lives in a structure that handle refers too.I think we need to introduce something like CoroutineAA to provide the information. I would try to look at it.
And it wouldn't be done in a few days so you probably need to do some workaround. Maybe something like DO_NOT_OPTIMIZE(...)
?
I am far from an expert here, but I wonder if SROAPass should be forbidden from making optimizatons of this sort across an llvm.coro.suspend?
This is not an option to me. The key reason why Clang/LLVM want to construct coroutine frames is about the performance. And in fact, there were many such bugs about coroutines, which could be fixed in one shot if we disable the optimizations. So our strategy is always to fix the actual issues. As a heavy user and developer of coroutines, I believe it should be the right choice since the performance is a key reason why we chose C++.
Yeah, I didn't mean disabling optimizations altogether. Just recognizing that this particular optimization shouldn't be performed for objects that span an llvm.coro.suspend
.
It's probably more complicated than I realize. Thanks for looking; I look forward to seeing what fix you come up with. :-)
I have recently run into the same issue using clang 14.0.6. My conclusion is that the await_suspend
function is inlined into the coroutine function, and as a side-effect, (in your case) the to_resume
variable is converted from a stack variable to a coroutine state variable, which makes it unsafe to check after you have tried to give the coroutine away. A work-around is to tag the await_suspend
function with __attribute__((noinline))
. gcc (11.2.1) does not seem to have this issue.
I have recently run into the same issue using clang 14.0.6. My conclusion is that the
await_suspend
function is inlined into the coroutine function, and as a side-effect, (in your case) theto_resume
variable is converted from a stack variable to a coroutine state variable, which makes it unsafe to check after you have tried to give the coroutine away. A work-around is to tag theawait_suspend
function with__attribute__((noinline))
. gcc (11.2.1) does not seem to have this issue.
GCC has much less coroutine bugs than clang. Since all the coroutine related works in GCC are done in the frontend. And for clang, the middle end gets involved to optimize coroutines further.
I am aware that the support for coroutines is much more limited in gcc. That is why I am experimenting with clang. I love the fact that clang is able to fully inline non-recursive synchronous generators. Here are some code snippets that might help pinpoint the underlying issue (hopefully the same one observed by @jacobsa).
this code does not trigger the issue:
auto schedule(Executor &executor) {
struct [[nodiscard]] awaiter {
Executor &executor;
awaiter(Executor &executor_in)
: executor(executor_in) {}
bool await_ready() const noexcept { return false; }
void await_suspend(std::coroutine_handle<> handle) {
struct ResumeTask : Executor::Task {
std::coroutine_handle<> handle;
ResumeTask(std::coroutine_handle<> handle_in)
: handle(handle_in) {}
void run() override { handle.resume(); }
};
Executor::Task::UP task = std::make_unique<ResumeTask>(handle);
task = executor.execute(std::move(task));
if (task) {
throw ScheduleFailedException("rejected by executor");
}
}
void await_resume() const noexcept {}
};
return awaiter(executor);
}
this code triggers the issue if the noinline
tag is removed:
auto try_schedule(Executor &executor) {
struct [[nodiscard]] awaiter {
Executor &executor;
bool accepted;
awaiter(Executor &executor_in)
: executor(executor_in), accepted(true) {}
bool await_ready() const noexcept { return false; }
bool await_suspend(std::coroutine_handle<> handle) __attribute__((noinline)) {
struct ResumeTask : Executor::Task {
std::coroutine_handle<> handle;
ResumeTask(std::coroutine_handle<> handle_in)
: handle(handle_in) {}
void run() override { handle.resume(); }
};
Executor::Task::UP task = std::make_unique<ResumeTask>(handle);
task = executor.execute(std::move(task));
if (task) {
// need to start with accepted == true to avoid race
// with handle.resume() from executor thread before
// await_suspend has returned.
accepted = false;
return false;
} else {
return true;
}
}
[[nodiscard]] bool await_resume() const noexcept { return accepted; }
};
return awaiter(executor);
}
The issue (according to TSAN) is that the local task
variable ends up in the coroutine frame in the second version (but apparently not in the first). This may be caused by its entanglement with the accepted
frame variable. It might get tagged with 'needs to be stored in the state since it might be used after the coroutine is suspended'. But in reality the variable needs to perform a reverse-escape
from the coroutine frame into the stack in order to live long enough to be checked after the coroutine state has been destroyed by another thread.
Bug report 59221 has a different reason than this. So ignore the above mentioning.
Now I feel this is like a TSan bug/defect instead of a compiler/optimizer bug/defect more. Here are my reasons:
According to the ABI requirement, for a coroutine frame pointer
void *frame
, for these functions which are not the corresponding resume/destroy functions, is only allowed to access the first 2 fields (the addresses of the resume/destroy functions). And if the user can be sure about the corresponding promise type, it is also allowed to access the corresponding promise type at the address offrame + 0xf
.
In another word, in the example, the function Register
shouldn't access the variable suspended
by the coroutine handle h
in any means.
@jacobsa @havardpe Does the explanation make sense for you? If yes, I think it may be better to file another issue to tell the TSan guys to address this.
Are you saying this hinges on the type erasure of std::coroutine_handle<>
, so that SomeAwaitable::Register
isn't allowed to access suspended
because it doesn't know the type of the promise? If so I don't think that argument applies; the same code is generated even when we make the promise type explicit.
so that SomeAwaitable::Register isn't allowed to access suspended because it doesn't know the type of the promise?
No. I mean SomeAwaitable::Register isn't allowed to access suspended
in any means. No matter if it knows the promise or not. This is the ABI requirement for coroutines. The ABI for coroutine frame is as following:
coro_frame {
void (*__resume_)(), // address of resume function
void (*__destroy_)(), // address of the destroy function
promise_type promise, // the corresponding promise
... // other necessary part
}
So it is clear that users are allowed to access the resume function, destroy function and the promise (if they know the type). But they are not allowed to access the other necessary part
here. In fact, the users can't access the other necessary part
legally here. Unless they do some mad address calculation and dereferences, for example:
std::coroutine_handle<> h;
int a = (int)(*(h.address() + 1024));
or
std::coroutine_handle<> h;
memset(h.address(), 1000, 0);
the above code is clearly not legal.
So I want to say that the users (or SomeAwaitable::Register in this example) shouldn't/can't access the other part of the coroutine frame (or suspended
). In the consequences, it is valid that the compiler sink the store beyond the call to SomeAwaitable::Register
. And the problem here is that TSan failed to know this. How about this explanation for you?
As far as I can see, both issues (write ordering, local variable placement) boils down to this:
To work well in a multi-threaded environment, a coroutine needs to support being resumed in another thread before the await_suspend function has returned in the thread handling its suspension.
@ChuanqiXu9 Ah, I see what you mean. I think @havardpe has it right. This issue isn't about the fact that SomeAwaitable::Register
may access Awaiter::suspended
(it doesn't in my example). It's about the fact that SomeAwaitable::Register
may cause the coroutine to be destroyed on another thread, and so it's not safe for FooBar
to access the coroutine frame if Register
returns a null result to signal this.
The generated assembly code gets this wrong. It unconditionally writes to the coroutine frame after Register
returns, despite the C++ only conditionally writing based on the result of Register
:
const auto to_resume = awaitable.Register(h);
if (!to_resume) {
suspended = false;
return h;
}
call SomeAwaitable::Register(std::__n4861::coroutine_handle<void>)
mov qword ptr [rbx + 24], rax
This is not correct, because Register
may have passed the coroutine handle off to another thread, which could have destroyed it by the time Register
returns. This is a use-after-free bug. And it was introduced by the compiler: it's not present in the original C++ code.
My proof from the original post about the write of suspended
happening before the read of suspended
was only to show that the input C++ doesn't have a data race, i.e. this is not the result of undefined behavior.
To work well in a multi-threaded environment, a coroutine needs to support being resumed in another thread before the await_suspend function has returned in the thread handling its suspension.
If my reading is correct, you're saying the coroutine FooBar
may be destroyed before the await_suspend function has returned, is my reading correct? If yes, this is not valid. Since the spec says it is an UB if we resume/destroy a coroutine which is not suspended.
It's about the fact that SomeAwaitable::Register may cause the coroutine to be destroyed on another thread, and so it's not safe for FooBar to access the coroutine frame if Register returns a null result to signal this.
This is not correct, because Register may have passed the coroutine handle off to another thread, which could have destroyed it by the time Register returns.
So these two sentences look like UB to me if my understanding is right.
If my reading is correct, you're saying the coroutine FooBar may be destroyed before the await_suspend function has returned, is my reading correct?
This is the problem, yes.
If yes, this is not valid. Since the spec says it is an UB if we resume/destroy a coroutine which is not suspended.
Could you please cite the standard in more detail if you think this is UB? My understanding is that the coroutine is suspended once it enters await_suspend
. (If not, what is the definition of "suspended"?)
I don't think there is any alternative to this. There is no way for the implementor to synchronize on await_suspend
having fully returned before it's allowed to destroy on another thread. Note that this isn't just destroying by calling std::coroutine_handle<>::destroy
—this could be "destroying" just due to resuming the coroutine on another thread and having it run to completion. This is fundamental in how coroutines work.
Could you please cite the standard in more detail if you think this is UB? My understanding is that the coroutine is suspended once it enters await_suspend. (If not, what is the definition of "suspended"?)
http://eel.is/c++draft/coroutine.handle.resumption talks about we can't resume/destroy a non-suspended coroutine.
And about the definition of suspended
, hmmmm, according to http://eel.is/c++draft/expr.await#5.1, you are right since the spec says the coroutine is suspended if await_ready returns false. However, in the implementation, we assume the coroutine is suspended if it is exited from the stack (in another word, after the return of await_suspend).
Luckily we found another defect report. Let me see how to fix it... I feel like it is a pretty tough one...
I don't think there is any alternative to this. There is no way for the implementor to synchronize on await_suspend having fully returned before it's allowed to destroy on another thread. Note that this isn't just destroying by calling std::coroutine_handle<>::destroy—this could be "destroying" just due to resuming the coroutine on another thread and having it run to completion. This is fundamental in how coroutines work.
I got your point. Although we did this usually like:
void await_suspend(coroutine_handle<> h) {
get_scheduler()->schedule(h);
return;
}
The reason why we don't use coroutine_handle
as the return type is that we found we'll get 2 more memory access in that form. (Just an information sharing. Not to suggest you to refactor your codes).
And about the definition of suspended, hmmmm, according to http://eel.is/c++draft/expr.await#5.1, you are right since the spec says the coroutine is suspended if await_ready returns false. However, in the implementation, we assume the coroutine is suspended if it is exited from the stack (in another word, after the return of await_suspend).
This sounds like it's the heart of the bug; thanks for finding it!
Also, I think you already agree with me, but I do want to drive home the point that clang's current definition doesn't make sense. Here's your example of more typical code again:
void await_suspend(coroutine_handle<> h) {
get_scheduler()->schedule(h);
return;
}
But this code also doesn't synchronize on await_suspend
having returned. It's totally possible for the interleaving to look like this:
await_suspend
calls Scheduler::schedule
.Scheduler::schedule
puts the handle on some queue.Scheduler::schedule
returns.await_suspend
returns.If the definition is "a coroutine is suspended once await_suspend
returns", then this is UB because a non-suspended coroutine was resumed in step (4). It's impossible for this definition to work, because there is no hook to find out when await_suspend
returns.
Also, I think you already agree with me, but I do want to drive home the point that clang's current definition doesn't make sense. Here's your example of more typical code again:
void await_suspend(coroutine_handle<> h) { get_scheduler()->schedule(h); return; }
But this code also doesn't synchronize on
await_suspend
having returned. It's totally possible for the interleaving to look like this:
- Thread 1:
await_suspend
callsScheduler::schedule
.- Thread 1:
Scheduler::schedule
puts the handle on some queue.- Thread 1:
Scheduler::schedule
returns.- Thread 2: a work loop takes the handle off the queue and resumes it.
- Thread 1:
await_suspend
returns.If the definition is "a coroutine is suspended once
await_suspend
returns", then this is UB because a non-suspended coroutine was resumed in step (4). It's impossible for this definition to work, because there is no hook to find out whenawait_suspend
returns.
Yeah, you're right and we've thought this before too. Although our previous explanation is "it is OK from the compiler's perspective since there is no code left after get_scheduler()->schedule(h);
.", but it indeed actually UB from the spec strictly.
In fact I am surprised that the coroutine is considered suspended in the execution of await_suspend
, since it is not suspended and it is still running actually... I mean the there is inconsistency between the design model and the implementation model. And we can't fix it perfectly. We can only workaround it by simulating or pretending the coroutine is suspended in the await_suspend. And I believe it will be tough time. I mean both the user and the implementor may take a relative long time to encounter, to locate and to fix the bugs...
In fact I am surprised that the coroutine is considered suspended in the execution of await_suspend, since it is not suspended and it is still running actually... I mean the there is inconsistency between the design model and the implementation model.
I leave it to you to say how hard it is to fix the implementation, but I'll defend the standard, since I think its definition makes sense:
await_suspend
is not part of the coroutine; it's part of the promise glue. By the time it runs, the code in the coroutine itself has stopped running.In fact I am surprised that the coroutine is considered suspended in the execution of await_suspend, since it is not suspended and it is still running actually... I mean the there is inconsistency between the design model and the implementation model.
I leave it to you to say how hard it is to fix the implementation, but I'll defend the standard, since I think its definition makes sense:
- As discussed above, it's the only definition that makes sense if the coroutine may be resumed/destroyed on another thread.
- The code in
await_suspend
is not part of the coroutine; it's part of the promise glue. By the time it runs, the code in the coroutine itself has stopped running.
I got your point. Your words make sense from the perspective of users.
I mean the there is inconsistency between the design model and the implementation model
Isn't this exactly why the LLVM model has the llvm.coro.save instrinsic? My understanding was that the LLVM code calls llvm.coro.save
after await_ready
returns false, but before llvm.coro.suspend
is called. And already with the llvm.coro.save
call, the coroutine is considered to be suspended and might be resumed on a different thread?
Oh, yeah, my bad. I forgot it. Thanks for mentioning it. We can't say we lack the semantic in the implementation model since we have @llvm.coro.save
.
I believe I found another instance of this in real code (with -O0
). The bad generated code looks like this:
call 0xef740 ; auto k3::internal::CoroutinePromise<void, void>::AwaitTransformCommon<k3::internal::CondVarAwaitable*>(k3::internal::CondVarAwaitable*)::Awaitable::await_suspend(std::__u::coroutine_handle<void>) at coroutine_promise.h:719
mov rdi, qword ptr [rbp - 0x140]
mov qword ptr [rdi], rax
call 0xef420 ; std::__u::coroutine_handle<void>::address at coroutine_handle.h:50
mov rdi, rax
mov rax, qword ptr [rax]
add rsp, 0x1f0
pop rbp
jmp rax
This does the following:
await_suspend
to figure out what coroutine to resume next.std::coroutine_handle
variable from the stack. In reality this address is on the coroutine frame, i.e. rdi
is loaded with an address 152 bytes after the coroutine frame's address. The frame is being used as scratch space.await_suspend
.-O0
).resume
function.The problem is that the generated code is using the coroutine frame as temporary storage after await_suspend
has returned. This causes memory corruption if another thread has already destroyed the coroutine frame, as it does in my real test that fails due to this bug.
Note that this generated code is incorrect even by the already-incorrect definition of "suspended" discussed above, because await_suspend
has already returned but the coroutine frame is still being used for scratch space.
Any chance of getting a fix for this issue? :-)
I see. I'll try to increase the priority for the issue. But I can't promise a exact date.
No worries, thanks in advance for your efforts!
Oh, yeah, my bad. I forgot it. Thanks for mentioning it. We can't say we lack the semantic in the implementation model since we have
@llvm.coro.save
.
This is already called before await_suspend, as per:
Update: Given this is pretty and complex, I send another two issues https://github.com/llvm/llvm-project/issues/64151 and https://github.com/llvm/llvm-project/issues/64152 to make things clear.
Also I think the use-after-free issue is a different issue with the original issue report.
@jacobsa I'm not sure if the later reported use-after-free matches the new issue https://github.com/llvm/llvm-project/issues/64152. Could you give a check?
Also I want to mention that what @llvm.coro.save
does is actually updating the index of the suspend points. So the following code may cause data races:
struct UpdatePromiseAwaiter {
template <typename Promise>
void await_suspend(std::coroutine_handle<Promise> handle> {
operations to update the promise...
async_may_resume(handle)
operations to update the promise...
}
};
...
co_await UpdatePromiseAwaiter{};
co_await UpdatePromiseAwaiter{};
When we're in the await_suspend of the first co_await
, the call to async_may_resume()
may resume the current coroutine from another thread, then it will meet the second co_await
. Then when it executes await_suspend()
, it is possible that the await_suspend
of the first co_await
is not over. Then we're in data race.
I don't think #64152 captures my report here. This report is about the fact that clang unconditionally makes a write that should be conditional. The example you should be looking at is this, which has no use-after-free bugs until Clang introduces one:
// operations to update the promise...
if (!maybe_resume_async(handle)) {
// maybe_resume_async said it won't resume, so we continue to update the promise.
// operations to update the promise...
}
Please have a look at the details in this comment. The write to the coroutine frame should be made if and only if Register
returns null, but Clang makes it unconditionally. That's the bug: it loses the condition.
I don't think #64152 captures my report here. This report is about the fact that clang unconditionally makes a write that should be conditional. The example you should be looking at is this, which has no use-after-free bugs until Clang introduces one:
// operations to update the promise... if (!maybe_resume_async(handle)) { // maybe_resume_async said it won't resume, so we continue to update the promise. // operations to update the promise... }
Please have a look at the details in this comment. The write to the coroutine frame should be made if and only if
Register
returns null, but Clang makes it unconditionally. That's the bug: it loses the condition.
Got it. So the new reported use-after-free problem is a new consequence of the original reproducer. I thought it is from other reproducer. So it still falls to https://github.com/llvm/llvm-project/issues/64151.
Given it is hard to fix the issue properly and any workaround may hurt the performance (also such pattern is rare), maybe it would be much better to add a warning if:
That certainly wouldn't fix my problem, and I also don't think there is any motivation for it aside from pointing out that Clang has a known bug. The code here is 100% legal as far as I can tell; Clang is making an incorrect transformation. We don't add warnings for code that Clang miscompiles in other areas, right?
That certainly wouldn't fix my problem, ..., Clang is making an incorrect transformation.
Yeah, I didn't argue for that. I just think this may be helpful for other developers.
We don't add warnings for code that Clang miscompiles in other areas, right?
It is rare but we did. For example, when we were not able to handle module partitions, we emitted an error directly "sorry, we're not able to support module partitions.".
@ChuanqiXu9 I am trying to wrap my head around this issue and I don't understand how the issues you filed decompose into this. Could you help clarify this for me?
await_suspend
, where the miscompile happens, is an ordinary function. Is there some inlining happening at LLVM IR level that makes suspended
a local variable inside a coroutine?Like @jacobsa mentioned before, in #64152 the code seems to be broken whereas the example from this issue looks correct.
In the original example, the awaiter produced in co_await SomeAwaitable();
is a local temporary. So its member suspended
is a local variable of the coroutine.
It seems like member variable accesses inside await_suspend are only safe to optimize prior to the leaking of the coroutine_handle passed into the function parameter. Once that parameter has escaped (via assignment to an external variable or passing as parameter to an external function), it must be assumed that the awaiter object could be destroyed, and it is no longer safe to reorder accesses to its fields.
The above is true if the awaiter is a local in the coroutine. I am unsure whether this is necessary for an awaitable with external lifetime that does not produce a temporary awaiter (as a result of operator co_await() or await_transform()). In this case, the awaitable will not be destroyed during await_suspend, as its lifetime is not tied to the coroutine. Nevertheless, if a developer is performing subsequent assignments to a variable before and after the coroutine_handle escape, it is likely that they do not expect those writes to be coalesced, and thus we should still block the same optimizations.
A proposed solution: When the address of the coroutine promise pointed to by the coroutine_handle function parameter escapes from await_suspend, it should be treated as a compiler fence.
@jacobsa it seems that a simple workaround would be to insert a std::atomic_signal_fence(std::memory_order_seq_cst)
prior to the call to awaitable.Register(h)
. Obviously this is not ideal, but it should have the same effect as what I'm proposing here.
Edit: Or, rather than treating it as a fence, perhaps instead, when the coroutine promise address escapes from await_suspend, we should act as if the this
pointer (of the awaiter) has also escaped. This may be a simpler/more correct way to implement a fix.
Thanks for your thoughts. I'm far from an expert on how the optimizer works, but both of those solutions sound sufficient to me. Certainly for waiters with automatic storage duration in the coroutine frame it's true that the promise address escaping implies that the this
pointer for the awaiter may no longer be usable.
(I've never used an awaiter that isn't local to the coroutine frame so I haven't thought about that case. But I can't see how this would give an incorrect answer for them; at worst it would be a missed optimization.)
@ChuanqiXu9 I am trying to wrap my head around this issue and I don't understand how the issues you filed decompose into this. Could you help clarify this for me?
64151 is clearly problematic, but in the original example there are no local variables in the coroutine and
await_suspend
, where the miscompile happens, is an ordinary function. Is there some inlining happening at LLVM IR level that makessuspended
a local variable inside a coroutine?
IIRC, the question may be addressed by @tzcnt .
A proposed solution: When the address of the coroutine promise pointed to by the coroutine_handle function parameter escapes from await_suspend, it should be treated as a compiler fence.
It looks like the method wouldn't work personally. The problem in my eyes is as following. There are some necessary concepts about LLVM IR. To make it easier to read, I'll use a pesudo LLVM IR code with some C codes. And please keep in mind that there is very limited C++ semantics in LLVM IR. (In fact, some LLVM developers think the C++ semantics in LLVM IR as a defect. And technically, there shouldn't be any C++ specific semantics in LLVM IR)
At first, the generated code looks like:
%awaiter = alloc %Awaiter ... ; alloc a local variable of the awaiter
...
%coro.handle = call ptr @llvm.coro.begin(...) ; %coro.handle is a pointer points to the coroutine handle. Note that we don't know the structure of the coroutine handle exactly.
...
%save = call token @llvm.coro.save()
%to_resume = call ptr awaiter::await_suspend(%awaiter, %coro.handle) ; the first argument is `this` pointer.
resume %to_resume
call i8 @llvm.coro.suspend()
...
all of things are fine here. Then after inlining, the codes looks like:
%awaiter = alloc %Awaiter ... ; alloc a local variable of the awaiter
...
%coro.handle = call ptr @llvm.coro.begin(...) ; %coro.handle is a pointer points to the coroutine handle. Note that we don't know the structure of the coroutine handle exactly.
...
%save = call token @llvm.coro.save()
store true to %awaitier->suspended;
%to_resume = call ptr SomeAwaitable::Register(%awaiter->awaitable, %coro.handle);
%to_resume.cmp = compare %to_resume with nullptr
if (%to_resume.cmp is not nullptr) {
store false to %awaitier->suspended
}
%continuation = %to_resume.cmp ? %to_resume : %coro.handle
resume %continuation
call i8 @llvm.coro.suspend()
await.resume:
if (%awaiter->suspended)
DidntSuspend();
Then this code met SROA pass, which would split a struct into several scalar variables to find optimization oppotunities.
%suspended = alloc bool ... ; The original %awaiter->suspended
%awaitable = alloc type of awaitable ; The original %awaiter->awaitable
...
%coro.handle = call ptr @llvm.coro.begin(...) ; %coro.handle is a pointer points to the coroutine handle. Note that we don't know the structure of the coroutine handle exactly.
...
%save = call token @llvm.coro.save()
store true to %suspended;
%to_resume = call ptr SomeAwaitable::Register(%awaitable, %coro.handle);
%to_resume.cmp = compare %to_resume with nullptr
if (%to_resume.cmp == false) {
store false to %suspended
}
%continuation = %to_resume.cmp ? %to_resume : %coro.handle
resume %continuation
call i8 @llvm.coro.suspend()
await.resume:
if (%suspended)
DidntSuspend();
Then the key point comes. The optimizer thought that the value of %suspended
is the same with %to_resume.cmp
. Then the optimizier chose to erase the variable %suspended
.
%awaitable = alloc type of awaitable ; The original %awaiter->awaitable
...
%coro.handle = call ptr @llvm.coro.begin(...) ; %coro.handle is a pointer points to the coroutine handle. Note that we don't know the structure of the coroutine handle exactly.
...
%save = call token @llvm.coro.save()
%to_resume = call ptr SomeAwaitable::Register(%awaitable, %coro.handle);
%to_resume.cmp = compare %to_resume with nullptr
%continuation = %to_resume.cmp ? %to_resume : %coro.handle
resume %continuation
call i8 @llvm.coro.suspend()
await.resume:
if (%to_resume.cmp)
DidntSuspend();
Finally, in the coroutine split pass, we see the value %to_resume.cmp
(or %to_resume
actually) is used acorss suspend point, the compiler will try to put the value %to_resume.cmp
to the coroutine frame.
...
%coro.handle = call ptr @llvm.coro.begin(...) ; %coro.handle is a pointer points to the coroutine handle. Note that we don't know the structure of the coroutine handle exactly.
...
%save = call token @llvm.coro.save()
%to_resume = call ptr SomeAwaitable::Register(%coro.handle->awaitable, %coro.handle);
%coro.handle->to_resume.cmp = compare %to_resume with nullptr ; unconditional write to the coroutine handle!
...
Then above it s the complete story of the problem. The reason why I think a fence won't help is that this is not related to reordering according to the process.
Edit: Or, rather than treating it as a fence, perhaps instead, when the coroutine promise address escapes from await_suspend, we should act as if the this pointer (of the awaiter) has also escaped. This may be a simpler/more correct way to implement a fix.
My instinct reaction is that it is hard to make it conditionally while it sounds possible that we can make it unconditionally. That said, we don't act as if the this
of the awaiter escaped if the coroutine handle escaped in await_suspend
but always act if it escaped. But I need more time to investigate the impact (do some benchmarking) or if it is possible to make it conditionally.
Besides that, from the above process, I think the simplest workaround for programmers may be:
I think the condition may be clear enough.
There seems to be some good discussion going on in LLVM Discourse about how this might be fixed, so I will discuss only workarounds here:
I tried many things to prevent this optimization miscompilation, and only found 2 working ways:
await_suspend
as __attribute__((noinline))
suspended
as volatile
None of these things prevented the sinking of the store to suspended to after the call to Register:
std::atomic_signal_fence(std::memory_order_seq_cst);
before awaitable.Register(h)
std::atomic_thread_fence(std::memory_order_seq_cst);
before awaitable.Register(h)
suspended
a std::atomic<bool>
Using volatile
seems to generate shorter and better code than noinline
, so that's what I'll be going with until a compiler fix arrives. I'm kind of shocked to find a use for this keyword in such a way in 2023!
@llvm/issue-subscribers-clang-codegen
Please consider backporting to 16.0.x.
@avikivity: 17 is only currently maintained branch.
Failed to cherry-pick: c4672454743e942f148a1aff1e809dae73e464f6
https://github.com/llvm/llvm-project/actions/runs/5970891644
Please manually backport the fix and push it to your github fork. Once this is done, please add a comment like this:
/branch <user>/<repo>/<branch>
/branch ChuanqiXu9/llvm-project/Resolve56301
/pull-request llvm/llvm-project-release-prs#644
@ChuanqiXu9 thanks very much for working on this. Your commit does fix my reproducer from this thread, but unfortunately it doesn't fix the real example in my codebase mentioned in this comment. I've filed #65018 with a new, even simpler, reproducer. Could you please take a look if you have some time?
I've hit what I think is a miscompilation bug in clang, where a write is moved in an illegal way that introduces a data race and/or use of uninitialized memory. Here is a test case reduced from my real codebase (Compiler Explorer):
The idea is that the awaiter is implemented by calling a
Register
function in a foreign translation unit that decides what to do:If the coroutine should be resumed immediately, it returns a null handle to indicate this.
If the coroutine will be resumed later, it reduces some other handle to resume now, for symmetric control. (Maybe
std::noop_coroutine()
.)Further, when we don't actually wind up suspending we need
await_resume
to do some follow-up work, in this case represented by calling theDidntSuspend
function. So we use asuspended
member to track whether we actually suspended. This is written before callingRegister
, and read after resuming.The bug I see in my codebase is that the write of
true
tosuspended
is delayed until after the call toRegister
. In the reduced test case, we have something similar. Here is what Compiler Explorer gives me for clang with-std=c++20 -O1 -fno-exceptions
:The coroutine frame address is in
rbx
. After callingRegister
, the returned handle is stored into the coroutine frame at offset 24 and then resumed (or the original handle resumed if it's empty), and later in[clone .resume]
the handle in the frame at offset 24 is compared to zero to synthesize theif (!suspended)
condition.But it's not safe to store the returned handle in the coroutine frame unless it's zero: any other value indicates that
Register
took responsibility for the coroutine handle, and may have passed it off to another thread. So another thread may have calleddestroy
on the handle by the time we get around to writing into it. Similarly, the other thread may already have resumed the coroutine and see an uninitialized value at offset 24.I think this is a miscompilation. Consider for example that
Register
may contain a critical section under a mutex that hands the coroutine handle off to another thread to resume, with a similar critical section in the other thread synchronizing with the first. (This is the situation in my codebase.) So we have:The write of
suspended
inawait_suspend
is sequenced before the call toRegister
below it inawait_suspend
.The call to
Register
synchronizes with the function on the other thread that resumes the coroutine.That synchronization is sequenced before resuming the coroutine handle.
Resuming the coroutine handle is (I believe?) sequenced before the call to
await_resume
that readssuspended
.Therefore the write of
suspended
inter-thread happens before the read ofsuspended
.So there was no data race before, but clang has introduced one by delaying the write to the coroutine frame.
For what it's worth, I spent some time dumping IR after optimization passes with my real codebase, and in that case this seemed to be related to an interaction betweem
SROAPass
andCoroSplitPass
:Until
SROAPass
the write was a simple store to the coroutine frame, before the call toRegister
.SROAPass
eliminated the write altogether, turning it into phi nodes that plumbed the value directly into the branch. The value was plumbed from before the call toRegister
to after it.CoroSplitPass
re-introduced astore
instruction, after the call toRegister
.I am far from an expert here, but I wonder if
SROAPass
should be forbidden from making optimizatons of this sort across anllvm.coro.suspend
?