[AArch64]: 128-bit seq_cst load can be reordered before prior RMW operations under LSE and above.

lukeg101 commented 1 year ago

After chatting with @Wilco1 I generated some concurrent litmus tests to test the compilation of concurrent 128-bit C/C++ programs using Telechat [1]. Consider the following litmus test (https://godbolt.org/z/hEcsc77oe):

// globals 
__int128_t* P0_r0;
__int128_t* P1_r0;
_Atomic __int128_t* x;
_Atomic __int128_t* y;

// Test body
void P0 () {
  atomic_fetch_add_explicit(x,1,memory_order_seq_cst);
  __int128_t r0 = atomic_load_explicit(y,memory_order_seq_cst);
  *P0_r0 = r0;
}

void P1 () {
  atomic_store_explicit(y,1,memory_order_seq_cst);
  __int128_t r0 = atomic_load_explicit(x,memory_order_seq_cst);
    *P1_r0 = r0;
}

// exists (0:r0=0 /\ 1:r0=0) <-- forbidden by rc11 memory model

If simulated using the rc11 memory model we get the following outcomes when run from an initial state where x,y,P0_r0, P1_r0 are zero initialised:

{ 0:r0=0; 1:r0=1; }
{ 0:r0=1; 1:r0=0; }
{ 0:r0=1; 1:r0=1; }

When compiled using clang -O3 -pthread -std=c11 -march=armv8.4-a the 128-bit sequentially consistent (SC) load emits ldp; dmb ish and the SC store is implemented as a compare and swap loop using ldaxp,stlxp. ldp has no ordering constraint and can be reordered before the stlxp, allowing an outcome when the AArch64 program is simulated under the AArch64 memory model, that is forbidden forbidden by the source program under the source model:

{ 0:r0=0; 1:r0=0; } <- forbidden by rc11, a bug!
{ 0:r0=0; 1:r0=1; }
{ 0:r0=1; 1:r0=0; }
{ 0:r0=1; 1:r0=1; }

As far as I can tell, this affects any mix of LSE{2} (and above) uses of LDP, with the compare and swap loop. I have observed this on clang versions 13,14,15, (albeit with a different implementation using sync), but not clang 16 as the caspal instruction has acquire release semantics and does not allow the forbidden outcome.

My recomendation is to follow Wilco's fix in GCC and emit a barrier before the ldp like in GCC [2]

[1] https://www.youtube.com/watch?v=xn4jtXOGfKg [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108891

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-aarch64

Wilco1 commented 1 year ago

Note that the SEQ_CST load might be in a function built for v8.4, while the STLXR is in another function built for v8.0 - ie. the bug exists in all LLVM versions. Atomics are part of the ABI - all possible combinations of atomic sequences you can emit with different options or encounter from other compilers (GCC!) must correctly work with each other.

efriedma-quic commented 1 year ago

CC @tmatheson-arm @lenary .

We don't emit patch releases for older versions of clang, so there isn't really anything we can do in that respect. But there's still an issue mixing code built for armv8.4 with code built for armv8.0, I think?

lukeg101 commented 1 year ago

I was chatting with Wilco and we came up with a few scenarios regarding mixing code. Tomas and I can look into these, but for this PR there is one example that motivates fixing the above bug in trunk.

Consider a case where the function P0 above does a write and a read, but those accesses are in different functions. Those functions are compiled separately ie for different architectures (e.g. the store is emitted as a CAS loop on v8 and the load is an LDP on v8.4). ABI demands these implementations work together, but when the CPU runs P0 with the two implementations the accesses can still be reordered and the above bug observed.

Wilco1 commented 1 year ago

Indeed. My point was that being locally correct is not sufficient, reordering of atomics will happen across functions that were built with different options or compilers.

It looks like this bug also exists in AArch32 but for all atomic sizes (v7 SEQ_CST load does not start with a DMB and thus can be reordered with a v8 STL/STLEX).

efriedma-quic commented 1 year ago

For comparison, on Arm64 Windows, we emit a barrier after sequentially consistent store/rmw atomic operations to avoid reordering.

lukeg101 commented 1 year ago

As I received a few questions regarding the above here is an AArch64 litmus test that demonstrates the fix. If we use LDAR ahead of LDP, we should restore sequential consistency:

AArch64 SB

{ uint64_t 0:X2 = y; int64_t x[2] = {0,0}; 0:X4 = x; 0:X5 = 1; 0:X6 = 1;
  uint64_t 1:X2 = x; int64_t y[2] = {0,0}; 1:X4 = y; 1:X5 = 1; 1:X6 = 1; }

P0                        | P1              ;
DMB ISH                   | DMB ISH         ;
loop1: LDAXP X8, X9, [X4] | loop2: LDAXP X8, X9, [X4];
STLXP W7, X5,X6, [X4]     | STLXP W7, X5, X6, [X4];
CBNZ W7, loop1            | CBNZ W7, loop2  ;
LDAR X3, [X2]             | LDAR X3, [X2]   ;
LDP X0, X1, [X2]          | LDP X0, X1, [X2];
DMB ISH                   | DMB ISH         ;

exists (0:X0=0 /\ 0:X1=0 /\ 1:X0 = 0 /\ 1:X1 = 0)

the outcome in the exists clause (similar to the one in Godbolt link), is forbidden by the AArch64 model.

The use of LDAR constrains the execution between any CAS-loop and the LDP so that the re-ordering is forbidden.

I do however note that if you use the recent LDAPR instruction, the outcome is allowed - a bug! (Don't use LDAPR). as LDAPR constrains executions per-location (ie x) whereas LDAR constrains executions for any locations (ie x and y above)

In Summary, the fix should be: LDP; DMB -> LDAR; LDP; DMB

Wilco1 commented 1 year ago

Thanks for the confirmation. LDAPR should not be used with sequential consistency indeed.

Note the most efficient sequence is LDAR; LDP; DMB ISHLD since this imposes the least constraints on memory reordering (in particular earlier relaxed stores can be safely reordered after this sequence).

llvm / llvm-project

[AArch64]: 128-bit seq_cst load can be reordered before prior RMW operations under LSE and above. #62652