Open gonzalobg opened 1 month ago
I can't add labels, but these should include the x86 backend, and the AMDGPU backend, for visibility at least.
Does this also cause the issue I reported in https://github.com/llvm/llvm-project/issues/100840?
I think that the fix to this one (remove these optimizations) may fix that one, but that one is using volatile atomic
, so not 100% sure. I think its good that these are two are linked, we'll have to check whether both are indeed fixed, or something else is required beyond what's proposed here :/
@llvm/issue-subscribers-backend-x86
Author: None (gonzalobg)
@llvm/issue-subscribers-backend-amdgpu
Author: None (gonzalobg)
@gonzalobg It looks like there is a typo in your godbolt repro in the description.
constexpr auto acq = memory_order_release;
I'm pretty sure this should be memory_order_acquire
.
@dmpots thanks, yes, that's correct! well spotted! I think I made the mistake when cleaning it up for posting, but didn't notice it because it did not impact the codegen :/
Have updated the link in the description from https://clang.godbolt.org/z/s54cGqozK to https://clang.godbolt.org/z/9Wj1W8Moq .
Pinging code owners: @topperc (x86), @tstellar (amdgpu), do you object to removing the generic functionality to introduce the unsound optimizations, and remove the unsound optimizations from the impacted backends?
As someone who knows next to nothing about memory model details, is this fundamentally unsound, or an x86 implementation bug? I don't really follow why load + fence is wrong. X86's implementation emitted a fence, but then it got lost? If I codegen the example on AMDGPU, I get:
thread0:
flat_store_dword
flat_load_dword
buffer_wbinvl1_vol
flat_load_dword
thread1:
flat_store_dword
flat_load_dword
buffer_wbinvl1_vol
flat_load_dword
So an access wasn't dropped anywhere?
cc @t-tye @Pierre-vh
The key feature of read-modify-writes operations is that, e.g., in this example, one will read from the other. That is, either the atomic add of thread 0 happens first, or the one of thread 1 happens first, but they can't both happen at the same time (cause that would clobber/loose an addition). That property holds even if the value written to memory is the same one that was read (e.g. if they add zero).
The whole idea behind this optimization is that an atomic read-modify-write that writes the same value that was read, is not a write, and can be optimized into a load. But then we end up with two loads in this example, which can happen concurrently, and do so on any HW worth using. That is, this optimization fails to preserve a key semantic property about read-modify-writes.
That is what makes this transformation - in general - unsound for C, C++, LLVM IR, and all hw ISAs i'm familiar with (ptx, arm, x86, ppc).
That property is what guarantees that the program above never prints r0 == 0 && r1 == 0
.
Transforming the program to
int thread0() {
y.store(1, rlx);
atomic_thread_fence(rel);
x.load(rlx); // optimized from x.fetch_add(0, rlx);
atomic_thread_fence(acq);
int r0 = z.load(rlx);
return r0;
}
already suffices to allow the forbidden outcome.
The rest is just collateral damage.
These optimizations sound plausible, so them being accidentally added to compilers is not uncommon, e.g., to GCC, and even to LLVM: https://github.com/llvm/llvm-project/issues/56450 (that was at the LLVM IR -> LLVM IR level, while this bug is that it was also added via a backend hook..).
I think we should assume that, eventually, someone else will have the idea of performing this optimization, will try to implement it, and that it will be extremely unlikely for a reviewer to catch any of these issues, because they are very subtle. So as part of fixing this, we should try to help others by leaving some tests in the impacted backends, e.g., for the example in this bug, with a comment pointing at this and the other bug, so that hopefully they'll hit those.
That other bug has a runtime reproducer for this bug in x86, in which it breaks the implementation of a concurrency primitive. That is, these issues are not purely theoretical. It just takes a large group of people to nail them down to issues like the one in this bug.
So as part of fixing this, we should try to help others by leaving some tests in the impacted backends, e.g., for the example in this bug, with a comment pointing at this and the other bug, so that hopefully they'll hit those.
Of course, tests should just be left with the new output
Based on this comment, seems like there's even more instances of this wrong transform? Or is that the same as what is already being discussed above?
Based on this comment, seems like there's even more instances of this wrong transform? Or is that the same as what is already being discussed above?
It's the same thing
@topperc ping
As someone who knows next to nothing about memory model details, is this fundamentally unsound, or an x86 implementation bug? I don't really follow why load + fence is wrong. X86's implementation emitted a fence, but then it got lost? If I codegen the example on AMDGPU, I get:
thread0: flat_store_dword flat_load_dword buffer_wbinvl1_vol flat_load_dword thread1: flat_store_dword flat_load_dword buffer_wbinvl1_vol flat_load_dword
So an access wasn't dropped anywhere?
cc @t-tye @Pierre-vh
I can't confidently answer this without spending more time on it, but my first impression is that removing the rmw is indeed incorrect simply because you're eliminating an atomic write. Even if the write didn't update the value, I think it could still have paired with a fence to create happens-before (I always forget if fences can pair with RMWs, I'm pretty sure they do)
On the AMDGPU discussion. I suspect the optimization we're doing is de facto correct, even if it is dodgy as hell at the IR level. My reasoning is that for a relaxed atomic load we should generate a flat_load_dword
that bypasses the relevant caches. So, the generated flat_load_dword
causes a transaction in the coherent cache that is for all purposes equivalent to the flat_atomic_add
that we would have generated without the optimization. There is a total order of accesses to this memory location.
So in the original example (for AMDGPU here):
thread0:
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
v_mov_b32_e32 v6, 1
flat_store_dword v[0:1], v6 sc0 sc1
buffer_wbl2 sc0 sc1
s_waitcnt vmcnt(0) lgkmcnt(0)
flat_load_dword v0, v[2:3] sc0 sc1 ;; atomicrmw add 0
s_waitcnt vmcnt(0) lgkmcnt(0)
buffer_inv sc0 sc1
flat_load_dword v0, v[4:5] sc0 sc1 ;; r0 in top comment example
s_waitcnt vmcnt(0) lgkmcnt(0)
s_setpc_b64 s[30:31]
thread1:
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
v_mov_b32_e32 v6, 1
flat_store_dword v[4:5], v6 sc0 sc1
buffer_wbl2 sc0 sc1
s_waitcnt vmcnt(0) lgkmcnt(0)
flat_load_dword v0, v[0:1] sc0 sc1 ;; atomicrmw add 0
s_waitcnt vmcnt(0) lgkmcnt(0)
buffer_inv sc0 sc1
flat_load_dword v0, v[2:3] sc0 sc1 ;; r1 in top comment example
s_waitcnt vmcnt(0) lgkmcnt(0)
s_setpc_b64 s[30:31]
is the rationale that the two concurrent flat_load_dword
to the same memory location (annotated with atomicrmw add 0
) are totally ordered (e.g. as if they were an idempotent atomic read-modify-write), and that prevents the r0 == 0 && r1 == 0
outcome?
Yes, that was the idea.
Though, now that I'm thinking about it again, I'm not so sure about this on the various MI chips that AMDGPU supports, because they do have coherency protocols that may behave differently for a load
vs. an atomic_add
. It is true for the graphics chips I'm used to because those all have a single cache for every (scope instance, memory location) pair.
The
lowerIdempotentRMWIntoFencedLoad
seems to allow a backend to implement the LLVM IR optimization that was shown to be unsound here https://github.com/llvm/llvm-project/issues/56450#issuecomment-1183759339 .The x86 backend (among other) performs it in
X86ISelLowering
lowering, but the bug above shown that the illegal outcomes not allowed by x86 and Rust were actually produced at runtime with this optimization on x86 (and most other backends probably).I think we should remove
lowerIdempotentRMWIntoFencedLoad
from LLVM. We are providing an API to make it easy for backends to implement an optimization that sounds completely reasonable but is known to be broken.I ran into it while fixing something else. Here is a repro https://clang.godbolt.org/z/9Wj1W8Moq that shows this code:
which is guaranteed to not exhibit the
r0 == 0 && r1 == 0
outcome because one of the rmw reads from the other and the rmw build release and acquire pattern with the fences around them, is compiled into store buffering:because:
x
...)When piping that code into herd with x86 TSO:
it confirms that the outcome
r0 == 0 && r1 == 0
can be produced, that is, that the produced binary produces an output that the input C++ code forbids.