Open lukeg101 opened 1 year ago
@llvm/issue-subscribers-backend-aarch64
Note that the SEQ_CST load might be in a function built for v8.4, while the STLXR is in another function built for v8.0 - ie. the bug exists in all LLVM versions. Atomics are part of the ABI - all possible combinations of atomic sequences you can emit with different options or encounter from other compilers (GCC!) must correctly work with each other.
CC @tmatheson-arm @lenary .
We don't emit patch releases for older versions of clang, so there isn't really anything we can do in that respect. But there's still an issue mixing code built for armv8.4 with code built for armv8.0, I think?
I was chatting with Wilco and we came up with a few scenarios regarding mixing code. Tomas and I can look into these, but for this PR there is one example that motivates fixing the above bug in trunk.
Consider a case where the function P0
above does a write and a read, but those accesses are in different functions. Those functions are compiled separately ie for different architectures (e.g. the store is emitted as a CAS loop on v8 and the load is an LDP on v8.4). ABI demands these implementations work together, but when the CPU runs P0
with the two implementations the accesses can still be reordered and the above bug observed.
Indeed. My point was that being locally correct is not sufficient, reordering of atomics will happen across functions that were built with different options or compilers.
It looks like this bug also exists in AArch32 but for all atomic sizes (v7 SEQ_CST load does not start with a DMB and thus can be reordered with a v8 STL/STLEX).
For comparison, on Arm64 Windows, we emit a barrier after sequentially consistent store/rmw atomic operations to avoid reordering.
As I received a few questions regarding the above here is an AArch64 litmus test that demonstrates the fix. If we use LDAR
ahead of LDP, we should restore sequential consistency:
AArch64 SB
{ uint64_t 0:X2 = y; int64_t x[2] = {0,0}; 0:X4 = x; 0:X5 = 1; 0:X6 = 1;
uint64_t 1:X2 = x; int64_t y[2] = {0,0}; 1:X4 = y; 1:X5 = 1; 1:X6 = 1; }
P0 | P1 ;
DMB ISH | DMB ISH ;
loop1: LDAXP X8, X9, [X4] | loop2: LDAXP X8, X9, [X4];
STLXP W7, X5,X6, [X4] | STLXP W7, X5, X6, [X4];
CBNZ W7, loop1 | CBNZ W7, loop2 ;
LDAR X3, [X2] | LDAR X3, [X2] ;
LDP X0, X1, [X2] | LDP X0, X1, [X2];
DMB ISH | DMB ISH ;
exists (0:X0=0 /\ 0:X1=0 /\ 1:X0 = 0 /\ 1:X1 = 0)
the outcome in the exists clause (similar to the one in Godbolt link), is forbidden by the AArch64 model.
The use of LDAR
constrains the execution between any CAS-loop and the LDP
so that the re-ordering is forbidden.
I do however note that if you use the recent LDAPR
instruction, the outcome is allowed - a bug! (Don't use LDAPR
). as LDAPR
constrains executions per-location (ie x) whereas LDAR
constrains executions for any locations (ie x and y above)
In Summary, the fix should be: LDP; DMB
-> LDAR; LDP; DMB
Thanks for the confirmation. LDAPR
should not be used with sequential consistency indeed.
Note the most efficient sequence is LDAR; LDP; DMB ISHLD
since this imposes the least constraints on memory reordering (in particular earlier relaxed stores can be safely reordered after this sequence).
After chatting with @Wilco1 I generated some concurrent litmus tests to test the compilation of concurrent 128-bit C/C++ programs using Telechat [1]. Consider the following litmus test (https://godbolt.org/z/hEcsc77oe):
If simulated using the rc11 memory model we get the following outcomes when run from an initial state where
x,y,P0_r0, P1_r0
are zero initialised:When compiled using
clang -O3 -pthread -std=c11 -march=armv8.4-a
the 128-bit sequentially consistent (SC) load emitsldp; dmb ish
and the SC store is implemented as a compare and swap loop usingldaxp,stlxp
.ldp
has no ordering constraint and can be reordered before thestlxp
, allowing an outcome when the AArch64 program is simulated under the AArch64 memory model, that is forbidden forbidden by the source program under the source model:As far as I can tell, this affects any mix of LSE{2} (and above) uses of LDP, with the compare and swap loop. I have observed this on clang versions 13,14,15, (albeit with a different implementation using
sync
), but not clang 16 as thecaspal
instruction has acquire release semantics and does not allow the forbidden outcome.My recomendation is to follow Wilco's fix in GCC and emit a barrier before the
ldp
like in GCC [2][1] https://www.youtube.com/watch?v=xn4jtXOGfKg [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108891