llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.96k stars 11.53k forks source link

[coroutines] Wrong behavior of await_suspend() #98740

Open ddvamp opened 2 months ago

ddvamp commented 2 months ago

There are three versions of await_suspend(). The standard does not prohibit the resuming of coroutines through the handle from these functions. However, if you do this in the version of the function that returns bool, strange behavior is detected. If false is returned, the coroutine will resumed not from the actual suspend point, but from the point from which it entered the current await_suspend() call. The same situation is observed in gcc. I guess this may be the result of incorrect optimization.

A minimal example:

bool await_suspend(auto h) {
  h.resume();
  return false;
}

https://godbolt.org/z/KM63dooae

llvmbot commented 2 months ago

@llvm/issue-subscribers-coroutines

Author: Artyom Kolpakov (ddvamp)

There are three versions of await_suspend(). The standard does not prohibit the resuming of coroutines through the handle from these functions. However, if you do this in the version of the function that returns bool, strange behavior is detected. If false is returned, the coroutine will resumed not from the actual suspension point, but from the point from which it entered the current await_suspend call. The same situation is observed in gcc. I guess this may be the result of incorrect optimization. A minimal example: ``` bool await_suspend(auto h) { h.resume(); return false; } ``` https://godbolt.org/z/KM63dooae
Fulgen301 commented 2 months ago

The standard does not prohibit the resuming of coroutines through the handle from these functions.

The standard however does prohibit resuming an already resumed coroutine.

[expr.await]/5.1.2 says:

Otherwise, if the type of await-suspend is bool, await-suspend is evaluated, and the coroutine is resumed if the result is false.

You're returning false in await_suspend, which leads to the coroutine being resumed, but you already resumed the coroutine manually by calling h.resume(), so this is UB.

ddvamp commented 2 months ago

The standard however does prohibit resuming an already resumed coroutine.

The standard does not prohibit resuming an already resumed coroutine. The standard prohibit resuming of non-suspended coroutine ([dcl.fct.def.coroutine#8.sentence-3]). This is different

You're returning false in await_suspend, which leads to the coroutine being resumed, but you already resumed the coroutine manually by calling h.resume(), so this is UB.

h.resume() call leads to resuming coroutine. Coroutine runs to the next suspend point, suspends and returns to await_suspend(), then resumes when false returns. Everything is fine (Everything should be fine)

ChuanqiXu9 commented 2 months ago

Yes, this sounds like a bug indeed.

almania commented 1 month ago

This one makes me a bit uncomfortable as it's just so hard to imagine a place where a bool-returning awaiter would want to resume a potentially later suspend point of the same coroutine. Symmetric form, sure, and that's an easy drop-in replacement for anyone in this situation. It's so much more commonly used as a "do an interlocked operation that may result in the handle being consumed potentially concurrently, or else resume locally", typically async_mutexes etc, whereas this hybrid of - okay, it's potentially been consumed somewhere, but if so, we know it's suspended again now, and we want to resume it again - is just so hard to imagine the utility of.

And the costs are large - what was once a fallthrough to the next state turns in to a "branch to top of function and re-evaluate the entire switch" and/or the costs of trying to leverage that we know it at least hasn't rewound. Rarely would it be provable that the coroutine hasn't been resumed, due memory barriers and the like, and yet, we already have the symmetric form wherever this is needed, making the utility/gain of bool over that form mostly moot.

I personally, truly, think it's more an underspecification. That the bool form is more "the coroutine is near-suspended, return 'true' if treated as suspended, or 'false' if not". This is how it naturally fits in to frameworks, and is how compilers currently handle a false result from final_suspend().await_suspend [as handle.destroy(), rather than handle.resume(), which would be invalid].

But on a strict reading, you're absolutely right, it's broken on gcc/clang, but similarly, compilers ought always treat false from the final suspend awaiter as UB (as it's illegal to resume a coroutine that's actually, properly, suspended at its final suspend point). But then why permit the bool form in a final suspend awaiter at all, or anywhere, really - if it is to provide such little information to the compiler as this workaround would require. Anyway...

ArsenArsen commented 1 month ago

I agree with you - I think the function should be considered non-suspended; it is probably better to file a DR about this.

(edit: I meant 'should be considered non-suspended')

ddvamp commented 2 weeks ago

I tried to contribute a DR, but was told that this is not a DR, but a proposal to change the standard wording. You have two options: remove the optimizations from the compiler (including returning false in final_suspend), or write a proposal. I guess both are not in your best interest. Or is it?

Good. Let's note what we're talking about. 1) We want to keep the optimization of returning false from await_suspend so that there is a UB here when resuming the coroutine through h before returning. 2) We want to allow return false from final_suspend so that there is no UB here opposite UB OR disallow bool form in final_suspend by standard. Maybe there are some other points where the compiler does not follow the standard?

Could you please describe how important these optimizations are? What will happen if they are removed? Will the existing code break? Will there be no problems with ABI here? Will the performance degrade a lot? Will it not affect any other parts of the compiler implementation? The motivation should be strong enough to add one more point with undefined behavior to the language.

ddvamp commented 2 weeks ago

I apologize very much. Not having figured out who the actual developer of clang is, I took third persons opinions as the organization's opinion and used that to create the further discussion. As it was pointed out to me, the only message from a clang member was that it was most likely a bug in the compiler. I will wait for further messages from the actual clang developers. Once again, I'm very sorry.

yuxuanchen1997 commented 2 weeks ago

You are contributing to valuable discussions. As you mentioned, this could either be changed in the compiler (which currently we don't know the cost of) or a change in the standard wording. We can collectively make the decision on which direction is in the best of our interests. So really there is no need to apologize.

yuxuanchen1997 commented 2 weeks ago

@ChuanqiXu9 Correct me if I am wrong. As far as I can tell, this is not a result of deliberately optimizing. This unoptimized code generated by FE is already wrong if the standard intended this to be a defined behavior. This is a snippet of the coroutine before coro split involving one of its suspend points:

await.suspend:                                    ; preds = %invoke.cont
  %8 = call token @llvm.coro.save(ptr null)
  %9 = call i1 @llvm.coro.await.suspend.bool(ptr %ref.tmp5, ptr %5, ptr @Coro() (.__await_suspend_wrapper__await))
  br i1 %9, label %await.suspend.bool, label %await.ready

await.suspend.bool:                               ; preds = %await.suspend
  %10 = call i8 @llvm.coro.suspend(token %8, i1 false)
  switch i8 %10, label %coro.ret [
    i8 0, label %await.ready
    i8 1, label %await.cleanup
  ]

This branch skipped entire suspend logic if await_suspend returns false, so the coroutine is not suspended either. The standard called for suspend, then resume behavior. Whatever happens in between, including changes to the state of the coroutine, need to be honored. We are currently just directly skipping the suspending part as if no suspend has happened.

ChuanqiXu9 commented 2 weeks ago

@ChuanqiXu9 Correct me if I am wrong. As far as I can tell, this is not a result of deliberately optimizing. This unoptimized code generated by FE is already wrong if the standard intended this to be a defined behavior. This is a snippet of the coroutine before coro split involving one of its suspend points:

await.suspend:                                    ; preds = %invoke.cont
  %8 = call token @llvm.coro.save(ptr null)
  %9 = call i1 @llvm.coro.await.suspend.bool(ptr %ref.tmp5, ptr %5, ptr @Coro() (.__await_suspend_wrapper__await))
  br i1 %9, label %await.suspend.bool, label %await.ready

await.suspend.bool:                               ; preds = %await.suspend
  %10 = call i8 @llvm.coro.suspend(token %8, i1 false)
  switch i8 %10, label %coro.ret [
    i8 0, label %await.ready
    i8 1, label %await.cleanup
  ]

This branch skipped entire suspend logic if await_suspend returns false, so the coroutine is not suspended either. The standard called for suspend, then resume behavior. Whatever happens in between, including changes to the state of the coroutine, need to be honored. We are currently just directly skipping the suspending part as if no suspend has happened.

Thanks for the analysis! I think it is true that we never thought about such cases. For how to fix it, my instinct reaction is, when llvm.coro.await.suspend.bool return false, we shouldn't goto %await.ready but the dispatch block instead.

The tricky part here is, the dispatch block is generated in the CoroSplit pass and we're talking about the incorrect code in the frontend. So maybe we need to invent some markups to represent the dispatch block in the frontend. Also we can still detect the call to @llvm.coro.await.suspend.bool and br pattern. But I am not a fan of pattern matches. It will be prettier if we can represent the dispatch block in the frontend. But after all, these are implementation details.

yuxuanchen1997 commented 2 weeks ago

Yeah. It's kinda unfortunate the LLVM presplit coroutine semantics seem to have overlooked this factor.

I was thinking about potentially introducing a new (noreturn) intrinsic to do an unconditional jump to the dispatch block and lower it during corosplit.

rnk commented 2 weeks ago

My understanding of coroutines is pretty shallow, but this feels like a DR to me, especially if GCC does the same thing. What does MSVC do in this case? --- actually, it looks like it reroutes to the dispatch block when a await_suspend returns false: https://godbolt.org/z/fcqc9z8s9

So there is implementation divergence.

Thinking more about it, I think this is a spec bug. If a suspend point can re-enter a coroutine at the last resume point, that breaks all kinds of assumptions the compiler is supposed to be able to make about dataflow. Consider this modification of the original program: https://godbolt.org/z/dMWK88PYs

Handle Coro() {
    int v = 1;
    std::cout << co_await Awaiter{} << '\n';
    v = 2;
    co_await std::suspend_always{};
    v = 3;
    co_return v;
}

The specification is intended to permit the compiler to optimize things this way before looking at the suspension point checks:

Handle Coro() {
    // eliminate v entirely, eliminate all the dead stores
    std::cout << co_await Awaiter{} << '\n';
    co_await std::suspend_always{};
    co_return 3;
}

These are simple SSA optimizations (mem2reg). If this isn't a DR, then our presplit CFG isn't a proper model of actual program behavior. If we create edges from every "maybe suspend" point back to all other suspension points, that eliminates most of the value of having a presplit coroutine representation.

ChuanqiXu9 commented 2 weeks ago

My understanding of coroutines is pretty shallow, but this feels like a DR to me, especially if GCC does the same thing. What does MSVC do in this case? --- actually, it looks like it reroutes to the dispatch block when a await_suspend returns false: https://godbolt.org/z/fcqc9z8s9

So there is implementation divergence.

Yeah, but not all implementation divergence is a defect. But given the specification actually highly depends on the implementation, let's try to discuss if the current specification is good or not. If not, as mentioned in the above PR, we can try to send a paper for it. (Jens said clearly this may not be accepted as DR so if we want to change it, we need a paper.)

Thinking more about it, I think this is a spec bug. If a suspend point can re-enter a coroutine at the last resume point, that breaks all kinds of assumptions the compiler is supposed to be able to make about dataflow. Consider this modification of the original program: https://godbolt.org/z/dMWK88PYs

Handle Coro() {
    int v = 1;
    std::cout << co_await Awaiter{} << '\n';
    v = 2;
    co_await std::suspend_always{};
    v = 3;
    co_return v;
}

The specification is intended to permit the compiler to optimize things this way before looking at the suspension point checks:

Handle Coro() {
    // eliminate v entirely, eliminate all the dead stores
    std::cout << co_await Awaiter{} << '\n';
    co_await std::suspend_always{};
    co_return 3;
}

I feel we can still optimize the example, or do I misunderstand anything? Or I guess you want to replace v = 2 and v = 3 to v++.

These are simple SSA optimizations (mem2reg). If this isn't a DR, then our presplit CFG isn't a proper model of actual program behavior. If we create edges from every "maybe suspend" point back to all other suspension points, that eliminates most of the value of having a presplit coroutine representation.

Besides the above example, it looks a pretty good argument that the external edge from await.suspend.bool to the dispatch block may change the CFG. But I have two questions: (1) Is it so serious that the new edge from await.suspend.bool to the dispatch block will make the current presplit model meaningless. I do think the statement is over strong. (2) And actually, from the users perspective, I don't think, when await.suspend.bool returns false, the behavior should be different from await.suspend.handle returns the current handle. That said, we've already in such cases and we should have the problems if it really exists.

rnk commented 1 week ago

Regarding the questions: (1) Yes, introducing a dispatch block and adding edges from every resume/suspend point in and out of the dispatch block seems equivalent to dropping the presplit model altogether. It will effectively turn off most data flow optimizations across suspend points. We could limit it to await.suspend.bool points, but it would be a pretty ugly special case. (2) I'm not sure I get it, if suspend.handle returns the current coroutine, is that not equivalent to resuming the current coroutine, symmetrically transfering to the current routine?

ChuanqiXu9 commented 1 week ago

(2) Then, isn't logically the same dataflow with jumping to the dispatch block of the current routine? Do I misunderstand anything?

(1) I still don't get the relationship between adding the new edge and dropping the presplit model. In the above design, the edge will only be added in the coro split pass. (Since the dispatch block doesn't exist before the coro split pass) And most data flow optimization should happen before that. So I think I won't block optimizations.

rnk commented 1 week ago

It sounds like you are proposing we add the extra edge during corosplit. I don't think it is sound to add edges later in the pipeline that aren't modeled in the presplit coroutine CFG, otherwise you can end up with optimizer-dependent behavior, as in this example: https://godbolt.org/z/EGfGv4PMd

The v++ operations are optimized away by early cleanup passes, so you won't get consistent behavior at O0 and O1, and I don't see how adding edges during corosplit can prevent that.

ChuanqiXu9 commented 1 week ago

It sounds like you are proposing we add the extra edge during corosplit. I don't think it is sound to add edges later in the pipeline that aren't modeled in the presplit coroutine CFG, otherwise you can end up with optimizer-dependent behavior, as in this example: https://godbolt.org/z/EGfGv4PMd

But this example is a result of invalid behavior, isn't it? It is an example why the current behavior is bad.

I can't prove it is safe to add edges theoretically. My thought is, when the await.bool returns false, it should have the exactly the same behavior with when the await.handle returns the current coroutine. So if it is bad to do so when the await.bool returns false, it should be bad for the case that await.handle returns the current coroutine.

rnk commented 1 week ago

But this example is a result of invalid behavior, isn't it? It is an example why the current behavior is bad.

This example is meant to be a counter-example that cannot be "fixed" by adding edges during corosplit. The optimizations to turn v++ into constants occur on the pre-split representation, so adding edges during splitting won't affect them. This is meant to show how we cannot support resuming a coroutine during a maybe_suspend operation, and why the spec should label that as UB. The transforms that we want to do (mem2reg on v) are not sound if users find ways to re-enter the coroutine in ways that aren't consistent with the presplit CFG.

ChuanqiXu9 commented 1 week ago

This is meant to show how we cannot support resuming a coroutine during a maybe_suspend operation, and why the spec should label that as UB.

This is my core point. This is already the case today. We have await.handle that may return the current coroutine. And if the pattern is widely used that we may resuming a coroutine in the await_suspend function. So if it is really problematic, we should send a defect report (or a paper) to WG21. Would you like to do this?

rnk commented 1 week ago

WG21 engagement is far beyond what I have time for. :)

What's the actual desired behavior here? Maybe I don't fully understand what you're suggesting. I think this can be made to work if resuming during a maybe-suspend operation resumes at the pending maybe-suspend point. Similarly, a symmetric transfer operation to the currently executing coroutine should make forward progress rather than re-entering the last suspend point. We can fix that by updating the coro state integer earlier, right? I don't think it's a matter of adding edges.

ChuanqiXu9 commented 6 days ago

WG21 engagement is far beyond what I have time for. :)

Got it. I'll try to make a paper and add your name there if this is really problematic.

What's the actual desired behavior here?

I think when await.bool returns false, it should have the same behavior with await.handle returns the current handle. That said, if we don't add the edges, we can have an implementation like:

define ... @coro_func(...) {
...
await.suspendbool:                                    ; preds = %invoke.cont
  %8 = call token @llvm.coro.save(ptr null)
  %9 = call i1 @llvm.coro.await.suspend.bool(ptr %ref.tmp5, ptr %5, ptr @Coro() (.__await_suspend_wrapper__await))
  br i1 %9, label %await.suspend.bool, label %await.bool.resuming

await.suspend.bool:                               ; preds = %await.suspend
  %10 = call i8 @llvm.coro.suspend(token %8, i1 false)
  switch i8 %10, label %coro.ret [
    i8 0, label %await.ready
    i8 1, label %await.cleanup
  ]

await.bool.resuming:
   call coro_func.resume(@frame)
   ret
}

We will update the coro index (probably the coro state integer you said) in @llvm.coro.save() before calling await_suspend (called in @llvm.coro.await.suspend.bool). Then when @llvm.coro.await.suspend.bool returns false, we will call the resuming function directly. And we (the compiler) don't need to consider the suspending point since it should be stored properly in the coro index.

I think this can be made to work if resuming during a maybe-suspend operation resumes at the pending maybe-suspend point.

I am not sure about your meaning. But actually this looks like the same thing I said above.

Similarly, a symmetric transfer operation to the currently executing coroutine should make forward progress rather than re-entering the last suspend point.

This is an optimization, right?

We can fix that by updating the coro state integer earlier, right?

IIUC, we actually update the coro state integer early enough.

I don't think it's a matter of adding edges.

I think, if the above pesudo code is fine. Then it is fine too to add the edges. Since they will have the same dataflow and control flow.