[Armv7-a]: Sequentially Consistent Load Allows Reordering of Prior Store when Implementations are Mixed

lukeg101 commented 1 year ago

Consider the following litmus test that has buggy behaviour:

C test

{ x = 0; y = 0 }

P0 (atomic_int *x, atomic_int *y) {
  atomic_fetch_add_explicit(x,1,memory_order_seq_cst);
  int32_t r0 = atomic_load_explicit(y,memory_order_seq_cst);
}

void P1 (atomic_int *x, atomic_int *y) {
  atomic_store_explicit(y,1,memory_order_seq_cst);
  int32_t r0 = atomic_load_explicit(x,memory_order_seq_cst);
}

exists 0:r0 = 0 /\ 1:r0 = 0

where 'P0:r0 = 0' means thread P0, local variable r0 has value 0

When simulating this test under the C/C++ model from its initial state, the outcome of execution in the exists clause is forbidden by the source model. The allowed outcomes are:

{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }

When compiling P1, to target armv8-a cortex-a15 (to dmb;str;dmb;ldr) using clang trunk (https://godbolt.org/z/v6cso7zq5), compiling the load of y on P0 to target armv7-a cortex-a15 (ldr;dmb) using clang trunk, and compiling the fetch_add of x on P0 for armv8-a cortex-m55 (also ldrex;strex;cbnz loop). The compiled program has the following outcomes when simulated under the aarch32 model:

{ P0:R3=0; P1:R0=0; }     <--- forbidden by source, bug!
{ P0:R3=0; P1:R0=1; }
{ P0:R3=1; P1:R0=0; }
{ P0:R3=1; P1:R0=1; }

which occurs since the sequentially consistent atomic load of y on P0 is compiled to ldr;dmb which does not have a leading dmb fence, and can thus be reordered before the STLEX on P0. Allowing the buggy outcome forbidden by the source.

ARM test

{ int x = 0; int y = 0;
  uint32_t %P0_x=x; uint32_t %P0_y = y;
  uint32_t %P1_x=x; uint32_t %P1_y = y;
  1:R2 = 1; }

P0                       | P1;
label: ldaex r1, [%P0_x] | dmb ish;
  add r1, r1, #1         | str r2, [%P1_y] ;
  stlex r2, r1, [%P0_x]  | dmb ish;
  cbnz r2, label         | ldr r0, [%P1_x];
  ldr r3, [%P0_y]        | dmb ish;
  DMB ISH                |;

exists (0:r3=0 /\ 1:r0=0)

To fix this, add a dmb before the ldr when compiling that load: dmb;ldr;dmb to forbid the buggy behaviour:

ARM test

{ int x = 0; int y = 0;
  uint32_t %P0_x=x; uint32_t %P0_y = y;
  uint32_t %P1_x=x; uint32_t %P1_y = y;
  1:R2 = 1; }

P0                       | P1;
label: ldaex r1, [%P0_x] | dmb ish;
  add r1, r1, #1         | str r2, [%P1_y] ;
  stlex r2, r1, [%P0_x]  | dmb ish;
  cbnz r2, label         | ldr r0, [%P1_x];
  DMB ISH                | ;
  ldr r3, [%P0_y]        | dmb ish;
  DMB ISH                | ;

exists (0:r3=0 /\ 1:r0=0)

which no longer allows the buggy outcome under the aarch32 model:

{ P0:R3=0; P1:R0=1; }
{ P0:R3=1; P1:R0=0; }
{ P0:R3=1; P1:R0=1; }

This bug would not have been caught in normal execution, but only when multiple implementations are mixed together.

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-aarch64

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-arm

lukeg101 commented 1 year ago

@efriedma-quic This is an example of one of those tricky bugs that occur when atomics implementations are mixed

efriedma-quic commented 1 year ago

This looks like it's basically the same as https://reviews.llvm.org/D141748, except for 32-bit.

Wilco1 commented 1 year ago

Atomics are part of the ABI. The v7 and v8 model have to be compatible because we don't disallow such mixing. Note you can also get mixing between different compilers, compiler versions, ARM/Thumb-1/Thumb-2.

Wilco1 commented 1 year ago

This looks like it's basically the same as https://reviews.llvm.org/D141748, except for 32-bit.

It also happens in AArch64: https://godbolt.org/z/ovjWG3dsK (as reported in https://github.com/llvm/llvm-project/issues/62652)

lukeg101 commented 1 month ago

My colleagues have pointed out that a simpler test can induce this bug, consider the store buffering test below without RMW ops:

C test

{ x = 0; y = 0 }

P0 (atomic_int *x, atomic_int *y) {
  atomic_store_explicit(x,1,memory_order_seq_cst);
  int32_t r0 = atomic_load_explicit(y,memory_order_seq_cst);
}

void P1 (atomic_int *x, atomic_int *y) {
  atomic_store_explicit(y,1,memory_order_seq_cst);
  int32_t r0 = atomic_load_explicit(x,memory_order_seq_cst);
}

exists 0:r0 = 0 /\ 1:r0 = 0

which has the following outcomes under the C/C++ model:

{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }

If the store on P0 is compiled for Armv8 and the load for Armv7 (https://godbolt.org/z/Td4TThTan) then we get the following test:

ARM fig7

{ x = 0; y = 0;
  uint64_t %P0_x = x; uint64_t %P0_y = y;
  uint64_t %P1_x = x; uint64_t %P1_y = y; }

P0                       | P1;
  MOV R1, #1             | MOV R1, #1;
                         | DMB ISH;
  STL  R1, [%P0_x]       | STR R1, [%P1_y];
                         | DMB ISH;
  LDR R0, [%P0_y]        | LDR R0, [%P1_x];
  DMB ISH                | DMB ISH;

exists (P0:R0 = 0 /\ P1:R0 = 0)

which has the following outcomes under either the Armv7 model or the Armv8 (AArch32) model:

{ P0:R0=0; P1:R0=0; } <-- additional outcome, bug!
{ P0:R0=0; P1:R0=1; }
{ P0:R0=1; P1:R0=0; }
{ P0:R0=1; P1:R0=1; }

Once again this is because the effects of executing the LDR can reorder before the STL instruction, since the LDR does not have ordering semantics with respect to the STL.

To fix this, add a fence before the LDR instruction.

llvm / llvm-project

[Armv7-a]: Sequentially Consistent Load Allows Reordering of Prior Store when Implementations are Mixed #65541