llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.91k stars 11.52k forks source link

wrong code due to incompatibility between C++20 coroutines and 32-bit Windows x86 ABI #59382

Open zygoloid opened 1 year ago

zygoloid commented 1 year ago

Testcase is a little large, see it on compiler explorer here.

The 32-bit Windows ABI passes objects of non-trivially-copyable class type by value on the stack. The C++20 coroutines feature can suspend execution after an argument to a function has finished being evaluated and before the function call happens. In this testcase, that happens here:

  consume_two(co_await make(), co_await make());

This code is therefore not lowerable: the language rules don't permit us to make a copy of the result of the first argument, the ABI requires that we construct it directly into a stack slot where consume_two will find it, and the function containing that stack slot returns between the object being constructed and the call to consume_two.

After the coroutine lowering pass runs, we have this IR:

define internal fastcc void @"?my_task@@YA?AUtask@@XZ.resume"(ptr noalias nonnull align 4 dereferenceable(72) %0) #3 personality ptr @__CxxFrameHandler3 !dbg !880 {
  ; ...
  %argmem.reload.addr = getelementptr inbounds %"?my_task@@YA?AUtask@@XZ.Frame", ptr %0, i32 0, i32 6, !dbg !884
  ; ...
  call void @"?consume_two@@YAXUNoisy@@0@Z"(ptr inalloca(<{ %struct.Noisy, %struct.Noisy }>) %argmem.reload.addr), !dbg !886

... which doesn't satisfy the rules for inalloca arguments. (The verifier doesn't complain about this, but presumably it should!)

It's not clear to me what the machine code we generate in this case actually does, and I don't have a Windows machine / emulator on which to test, but I don't see how it can be generating correct code!

We should decide what we want to do here. It seems like there are a few options:

MSVC appears to do something like the third option. Note that MSVC doesn't call await_resume on either awaitable until both are ready. While this behavior directly violates the C++ sequencing rules, which don't permit anything else to be interleaved between the await_suspend and await_resume calls on a coroutine, maybe that's the best we can do, and it would be compatible with MSVC.

llvmbot commented 1 year ago

@llvm/issue-subscribers-clang-codegen

llvmbot commented 1 year ago

@llvm/issue-subscribers-c-20

llvmbot commented 1 year ago

@llvm/issue-subscribers-coroutines

zygoloid commented 1 year ago

@GorNishanov @ilya-biryukov

iains commented 1 year ago

I wonder if this is not a symptom of a more general problem e.g.

result = co_await t + co_await t;

where the await_resume() for the coroutine has a side-effect that changes the operation of a subsequent invocation. (I have PR against the GCC impl. raised by Lewis for this specific case).

It is my current plan for GCC to ensure that, effectively we compute this like

result = co_await t ;
result += co_await t;

(and similar constraints for other cases, such as the function arguments in this PR).

edit: in the case above it assumes that the LHS and RHS are the same type (otherwise:

 temp = co_await t ;
 result = temp + co_await t ;

)

edit2: I have also over-simplified this for the sake of exposition - the proposed impl. treats the rewrite more like a GNU statement expression, since we need to handle lifetime extensions of temporaries where required to the end of the original statement.

If we say that this is invalid user code, it's not clear to me how we can diagnose it?

ChuanqiXu9 commented 1 year ago

BTW, it looks like that coroutines on Windows is going to be out of maintenance... This is the 4th bug report in the last 4 months: https://github.com/llvm/llvm-project/issues/58556, https://github.com/llvm/llvm-project/issues/58543, https://github.com/llvm/llvm-project/issues/56989

I didn't look deeply into them since I don't have a windows environment too.

h-vetinari commented 1 year ago

[...] We could reorder the evaluation of function arguments on this ABI so that all suspensions are performed before any arguments are constructed. Note that this is not conforming due to the C++17 evaluation order rules, but we already don't follow those rules in some cases because they're incompatible with callee-cleanup argument passing.

MSVC appears to do something like [this]. Note that MSVC doesn't call await_resume on either awaitable until both are ready. While this behavior directly violates the C++ sequencing rules, which don't permit anything else to be interleaved between the await_suspend and await_resume calls on a coroutine, maybe that's the best we can do, and it would be compatible with MSVC.

Is there a CWG issue for these inconsistencies resp. outright impossibility to conform? Does anyone even care (sufficiently) about the 32-bit ABI these days for this to really matter?

AaronBallman commented 1 month ago

CC @rnk @ilya-biryukov @zmodem

rnk commented 1 month ago

I think it is impossible to meet all these requirements:

It's feasible to compromise on 1, but it's a lot of work. It is so much work that we should put in a stopgap, we should probably declare coroutines unsupported for i*86-windows-msvc for the time being and emit a compiler error.

co_await introduces a control flow edge from the prologue into the middle of function call argument evaluation, and that edge will bypass the normal inalloca call stack setup. If some of the results of the co_await are non-trivially copyable, we will need to introduce copy constructor calls in the frontend, to evaluate all of the function arguments to temporary storage, allocate the call arguments, and copy construct them all into place. We can order the copy constructor calls any way we like to maintain the correct construction / destruction ordering, so it's feasible to meet requirements on argument order evaluation.

I think there may be other EH-related correctness reasons to implement this argument-copying mode.

AaronBallman commented 4 weeks ago

Thank you for the analysis @rnk!

It is so much work that we should put in a stopgap, we should probably declare coroutines unsupported for i*86-windows-msvc for the time being and emit a compiler error.

This is an unfortunate outcome, but I agree it's probably the right short-term solution for now.

CC @CaseyCarter @GorNishanov (feel free to CC anyone better) I'm not certain how important y'all find 32-bit support to be for coroutines; if this is going to cause significant pain for you, we could use some help on this.

Otherwise, I think we should:

1) Diagnose use of coroutines on 32-bit Windows targets as an error 2) No longer define __cpp_impl_coroutine on 32-bit Windows targets 3) Update our documentation (User's Manual) to explicitly say coroutines are not supported on 32-bit Windows targets 4) Put a release note under Potentially Breaking Changes so folks who were using these have some idea of why support was retracted.

lewissbaker commented 3 weeks ago

Note that MSVC doesn't call await_resume on either awaitable until both are ready. While this behavior directly violates the C++ sequencing rules, which don't permit anything else to be interleaved between the await_suspend and await_resume calls on a coroutine, maybe that's the best we can do, and it would be compatible with MSVC.

The problem with resequencing the order of await_suspend/await_resume evaluations is that it breaks certain use-cases like async-generators.

e.g. consider:

async_generator<X> gen() {
  co_yield X{10};
  co_yield X{20};
}

void consume_two(X, X);

task consume() {
  auto g = gen();
  consume_two(co_await g.next(), co_await g.next());
}

If you resequence the calls to await_resume() until after both co_await expressions, then the storage holding the value for the first co_yield will have been overwritten by the second co_yield expression and so both await_resume() call will try to read/copy/move the value of the second co_yield expression.

I think the more promising direction would be to relax the guaranteed copy-elision requirement for any call expressions where the argument expressions contain a suspend-point, to allow the compiler to store an intermediate result in the coroutine frame and then later move it if necessary into the right position.

This would be potentially surprising for users, though, as you could potentially end up with cases like:

struct immovable { immovable(immovable&&) = delete; };
void f(immovable, int);

task<int> get_int();

task<void> consumer() {
  f(immovable{}, 42); // ok
  f(immovable{}, co_await get_int()); // error: immovable requires a move-constructor
}

The second call expression contains a suspend-point in one of the arguments, so this then changes all by-value expressions to now require a move-constructor as copy-elision is no longer guaranteed.

There are similar cases where we introduce moves for parameters of coroutine functions passed-by-value to allow them to be moved into the coroutine-frame, so there is already some precedent for this sort of thing.

Maybe this is the best we can do if we want to make sure that platforms like windows x86 calling conventions can work with coroutines?

Alcaro commented 3 weeks ago

consume_two(co_await g.next(), co_await g.next());

Isn't argument evaluation order unspecified by the C++ spec, making that piece nonportable and inadvisable even without the coroutine?

My opinion: Please don't throw out the baby with the bathwater. Please pick a solution that keeps coroutines working in common cases, where all involved types are movable, or at least a solution where it's possible to extract the awaits to local variables.