Unsound idempotent rmw to load optimization in x86 and most likely other backends using lowerIdempotentRMWIntoFencedLoad

gonzalobg commented 1 month ago

The lowerIdempotentRMWIntoFencedLoad seems to allow a backend to implement the LLVM IR optimization that was shown to be unsound here https://github.com/llvm/llvm-project/issues/56450#issuecomment-1183759339 .

The x86 backend (among other) performs it in X86ISelLowering lowering, but the bug above shown that the illegal outcomes not allowed by x86 and Rust were actually produced at runtime with this optimization on x86 (and most other backends probably).

I think we should remove lowerIdempotentRMWIntoFencedLoad from LLVM. We are providing an API to make it easy for backends to implement an optimization that sounds completely reasonable but is known to be broken.

I ran into it while fixing something else. Here is a repro https://clang.godbolt.org/z/9Wj1W8Moq that shows this code:

int thread0() {
  y.store(1, rlx);
  atomic_thread_fence(rel);
  x.fetch_add(0, rlx);
  atomic_thread_fence(acq);
  int r0 = z.load(rlx);
  return r0;
}

int thread1() {
  z.store(1, rlx);
  atomic_thread_fence(rel);
  x.fetch_add(0, rlx);
  atomic_thread_fence(acq);
  int r1 = y.load(rlx);
  return r1;
}

which is guaranteed to not exhibit the r0 == 0 && r1 == 0 outcome because one of the rmw reads from the other and the rmw build release and acquire pattern with the fences around them, is compiled into store buffering:

thread0():
        mov     dword ptr [rip + y], 1
        mov     eax, dword ptr [rip + z]
        ret

thread1():
        mov     dword ptr [rip + z], 1
        mov     eax, dword ptr [rip + y]
        ret

because:

the idempotent rmw are optimized into loads (incorrect)
the loads are dead, so they are removed (the generated code doesn't even access x...)

When piping that code into herd with x86 TSO:

X86 SB
{
}
 P0          | P1          ;
 MOV [y],$1  | MOV [z],$1  ;
 MOV EAX,[z] | MOV EAX,[y] ;
exists
(0:EAX=0 /\ 1:EAX=0)

it confirms that the outcome r0 == 0 && r1 == 0 can be produced, that is, that the produced binary produces an output that the input C++ code forbids.

gonzalobg commented 1 month ago

I can't add labels, but these should include the x86 backend, and the AMDGPU backend, for visibility at least.

keinflue commented 1 month ago

Does this also cause the issue I reported in https://github.com/llvm/llvm-project/issues/100840?

gonzalobg commented 1 month ago

I think that the fix to this one (remove these optimizations) may fix that one, but that one is using volatile atomic, so not 100% sure. I think its good that these are two are linked, we'll have to check whether both are indeed fixed, or something else is required beyond what's proposed here :/

llvmbot commented 1 month ago

@llvm/issue-subscribers-backend-x86

Author: None (gonzalobg)

The `lowerIdempotentRMWIntoFencedLoad` seems to allow a backend to implement the LLVM IR optimization that was shown to be unsound here https://github.com/llvm/llvm-project/issues/56450#issuecomment-1183759339 . The x86 backend (among other) performs it in `X86ISelLowering` lowering, but the bug above shown that the illegal outcomes not allowed by x86 and Rust were actually produced at runtime with this optimization on x86 (and most other backends probably). **I think we should remove `lowerIdempotentRMWIntoFencedLoad` from LLVM. We are providing an API to make it easy for backends to implement an optimization that sounds completely reasonable but is known to be broken.** I ran into it while fixing something else. Here is a repro https://clang.godbolt.org/z/s54cGqozK that shows this code: ```c++ int thread0() { y.store(1, rlx); atomic_thread_fence(rel); x.fetch_add(0, rlx); atomic_thread_fence(acq); int r0 = z.load(rlx); return r0; } int thread1() { z.store(1, rlx); atomic_thread_fence(rel); x.fetch_add(0, rlx); atomic_thread_fence(acq); int r1 = y.load(rlx); return r1; } ``` which is guaranteed to not exhibit the `r0 == 0 && r1 == 0` outcome because one of the rmw reads from the other and the rmw build release and acquire pattern with the fences around them, is compiled into store buffering: ```asm thread0(): mov dword ptr [rip + y], 1 mov eax, dword ptr [rip + z] ret thread1(): mov dword ptr [rip + z], 1 mov eax, dword ptr [rip + y] ret ``` because: - the idempotent rmw are optimized into loads (incorrect) - the loads are dead, so they are removed (the generated code doesn't even access `x`...) When piping that code into herd with x86 TSO: ``` X86 SB { } P0 | P1 ; MOV [y],$1 | MOV [z],$1 ; MOV EAX,[z] | MOV EAX,[y] ; exists (0:EAX=0 /\ 1:EAX=0) ``` it confirms that the outcome `r0 == 0 && r1 == 0` can be produced, that is, that the produced binary produces an output that the input C++ code forbids.

llvmbot commented 1 month ago

@llvm/issue-subscribers-backend-amdgpu

Author: None (gonzalobg)

The `lowerIdempotentRMWIntoFencedLoad` seems to allow a backend to implement the LLVM IR optimization that was shown to be unsound here https://github.com/llvm/llvm-project/issues/56450#issuecomment-1183759339 . The x86 backend (among other) performs it in `X86ISelLowering` lowering, but the bug above shown that the illegal outcomes not allowed by x86 and Rust were actually produced at runtime with this optimization on x86 (and most other backends probably). **I think we should remove `lowerIdempotentRMWIntoFencedLoad` from LLVM. We are providing an API to make it easy for backends to implement an optimization that sounds completely reasonable but is known to be broken.** I ran into it while fixing something else. Here is a repro https://clang.godbolt.org/z/s54cGqozK that shows this code: ```c++ int thread0() { y.store(1, rlx); atomic_thread_fence(rel); x.fetch_add(0, rlx); atomic_thread_fence(acq); int r0 = z.load(rlx); return r0; } int thread1() { z.store(1, rlx); atomic_thread_fence(rel); x.fetch_add(0, rlx); atomic_thread_fence(acq); int r1 = y.load(rlx); return r1; } ``` which is guaranteed to not exhibit the `r0 == 0 && r1 == 0` outcome because one of the rmw reads from the other and the rmw build release and acquire pattern with the fences around them, is compiled into store buffering: ```asm thread0(): mov dword ptr [rip + y], 1 mov eax, dword ptr [rip + z] ret thread1(): mov dword ptr [rip + z], 1 mov eax, dword ptr [rip + y] ret ``` because: - the idempotent rmw are optimized into loads (incorrect) - the loads are dead, so they are removed (the generated code doesn't even access `x`...) When piping that code into herd with x86 TSO: ``` X86 SB { } P0 | P1 ; MOV [y],$1 | MOV [z],$1 ; MOV EAX,[z] | MOV EAX,[y] ; exists (0:EAX=0 /\ 1:EAX=0) ``` it confirms that the outcome `r0 == 0 && r1 == 0` can be produced, that is, that the produced binary produces an output that the input C++ code forbids.

dmpots commented 1 month ago

@gonzalobg It looks like there is a typo in your godbolt repro in the description.

constexpr auto acq = memory_order_release;

I'm pretty sure this should be memory_order_acquire.

gonzalobg commented 1 month ago

@dmpots thanks, yes, that's correct! well spotted! I think I made the mistake when cleaning it up for posting, but didn't notice it because it did not impact the codegen :/

Have updated the link in the description from https://clang.godbolt.org/z/s54cGqozK to https://clang.godbolt.org/z/9Wj1W8Moq .

gonzalobg commented 1 month ago

Pinging code owners: @topperc (x86), @tstellar (amdgpu), do you object to removing the generic functionality to introduce the unsound optimizations, and remove the unsound optimizations from the impacted backends?

arsenm commented 1 month ago

As someone who knows next to nothing about memory model details, is this fundamentally unsound, or an x86 implementation bug? I don't really follow why load + fence is wrong. X86's implementation emitted a fence, but then it got lost? If I codegen the example on AMDGPU, I get:

thread0:
  flat_store_dword
  flat_load_dword
  buffer_wbinvl1_vol
  flat_load_dword 

thread1:
  flat_store_dword
  flat_load_dword
  buffer_wbinvl1_vol
  flat_load_dword

So an access wasn't dropped anywhere?

cc @t-tye @Pierre-vh

gonzalobg commented 1 month ago

The key feature of read-modify-writes operations is that, e.g., in this example, one will read from the other. That is, either the atomic add of thread 0 happens first, or the one of thread 1 happens first, but they can't both happen at the same time (cause that would clobber/loose an addition). That property holds even if the value written to memory is the same one that was read (e.g. if they add zero).

The whole idea behind this optimization is that an atomic read-modify-write that writes the same value that was read, is not a write, and can be optimized into a load. But then we end up with two loads in this example, which can happen concurrently, and do so on any HW worth using. That is, this optimization fails to preserve a key semantic property about read-modify-writes.

That is what makes this transformation - in general - unsound for C, C++, LLVM IR, and all hw ISAs i'm familiar with (ptx, arm, x86, ppc).

That property is what guarantees that the program above never prints r0 == 0 && r1 == 0. Transforming the program to

int thread0() {
  y.store(1, rlx);
  atomic_thread_fence(rel);
  x.load(rlx); // optimized from x.fetch_add(0, rlx);
  atomic_thread_fence(acq);
  int r0 = z.load(rlx);
  return r0;
}

already suffices to allow the forbidden outcome.

The rest is just collateral damage.

gonzalobg commented 1 month ago

These optimizations sound plausible, so them being accidentally added to compilers is not uncommon, e.g., to GCC, and even to LLVM: https://github.com/llvm/llvm-project/issues/56450 (that was at the LLVM IR -> LLVM IR level, while this bug is that it was also added via a backend hook..).

I think we should assume that, eventually, someone else will have the idea of performing this optimization, will try to implement it, and that it will be extremely unlikely for a reviewer to catch any of these issues, because they are very subtle. So as part of fixing this, we should try to help others by leaving some tests in the impacted backends, e.g., for the example in this bug, with a comment pointing at this and the other bug, so that hopefully they'll hit those.

That other bug has a runtime reproducer for this bug in x86, in which it breaks the implementation of a concurrency primitive. That is, these issues are not purely theoretical. It just takes a large group of people to nail them down to issues like the one in this bug.

arsenm commented 1 month ago

So as part of fixing this, we should try to help others by leaving some tests in the impacted backends, e.g., for the example in this bug, with a comment pointing at this and the other bug, so that hopefully they'll hit those.

Of course, tests should just be left with the new output

RalfJung commented 1 month ago

Based on this comment, seems like there's even more instances of this wrong transform? Or is that the same as what is already being discussed above?

arsenm commented 1 month ago

Based on this comment, seems like there's even more instances of this wrong transform? Or is that the same as what is already being discussed above?

It's the same thing

gonzalobg commented 2 weeks ago

@topperc ping

Pierre-vh commented 2 weeks ago

As someone who knows next to nothing about memory model details, is this fundamentally unsound, or an x86 implementation bug? I don't really follow why load + fence is wrong. X86's implementation emitted a fence, but then it got lost? If I codegen the example on AMDGPU, I get:
thread0:
  flat_store_dword
  flat_load_dword
  buffer_wbinvl1_vol
  flat_load_dword 

thread1:
  flat_store_dword
  flat_load_dword
  buffer_wbinvl1_vol
  flat_load_dword
So an access wasn't dropped anywhere?

cc @t-tye @Pierre-vh

I can't confidently answer this without spending more time on it, but my first impression is that removing the rmw is indeed incorrect simply because you're eliminating an atomic write. Even if the write didn't update the value, I think it could still have paired with a fence to create happens-before (I always forget if fences can pair with RMWs, I'm pretty sure they do)

nhaehnle commented 4 days ago

On the AMDGPU discussion. I suspect the optimization we're doing is de facto correct, even if it is dodgy as hell at the IR level. My reasoning is that for a relaxed atomic load we should generate a flat_load_dword that bypasses the relevant caches. So, the generated flat_load_dword causes a transaction in the coherent cache that is for all purposes equivalent to the flat_atomic_add that we would have generated without the optimization. There is a total order of accesses to this memory location.

gonzalobg commented 4 days ago

So in the original example (for AMDGPU here):

thread0:
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        v_mov_b32_e32 v6, 1
        flat_store_dword v[0:1], v6 sc0 sc1
        buffer_wbl2 sc0 sc1
        s_waitcnt vmcnt(0) lgkmcnt(0)
        flat_load_dword v0, v[2:3] sc0 sc1 ;; atomicrmw add 0
        s_waitcnt vmcnt(0) lgkmcnt(0)
        buffer_inv sc0 sc1
        flat_load_dword v0, v[4:5] sc0 sc1  ;; r0 in top comment example
        s_waitcnt vmcnt(0) lgkmcnt(0)
        s_setpc_b64 s[30:31]
thread1:
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        v_mov_b32_e32 v6, 1
        flat_store_dword v[4:5], v6 sc0 sc1
        buffer_wbl2 sc0 sc1
        s_waitcnt vmcnt(0) lgkmcnt(0)
        flat_load_dword v0, v[0:1] sc0 sc1 ;; atomicrmw add 0
        s_waitcnt vmcnt(0) lgkmcnt(0)
        buffer_inv sc0 sc1
        flat_load_dword v0, v[2:3] sc0 sc1 ;; r1 in top comment example
        s_waitcnt vmcnt(0) lgkmcnt(0)
        s_setpc_b64 s[30:31]

is the rationale that the two concurrent flat_load_dword to the same memory location (annotated with atomicrmw add 0) are totally ordered (e.g. as if they were an idempotent atomic read-modify-write), and that prevents the r0 == 0 && r1 == 0 outcome?

nhaehnle commented 3 days ago

Yes, that was the idea.

Though, now that I'm thinking about it again, I'm not so sure about this on the various MI chips that AMDGPU supports, because they do have coherency protocols that may behave differently for a load vs. an atomic_add. It is true for the graphics chips I'm used to because those all have a single cache for every (scope instance, memory location) pair.

llvm / llvm-project

Unsound idempotent rmw to load optimization in x86 and most likely other backends using lowerIdempotentRMWIntoFencedLoad #101551