clang vs gcc difference in armv6m vs armv7m inline atomic codegen

m-gupta commented 1 year ago

Our firmware teams have been looking into evaluating clang for their products. They recently reported one difference between Clang and GCC for armv6m related to inline atomic vs libcall.

$ cat atomic_load.c
typedef int atomic_t;
typedef atomic_t atomic_val_t;

atomic_val_t atomic_get(const atomic_t *target)
{
        return __atomic_load_n(target, __ATOMIC_SEQ_CST);
}

GCC generated code:

$ arm-none-eabi-gcc  -o atomic_load_cb.o atomic_load.c -c -O1 -march=armv6-m
00000000 <atomic_get>:
       0: bf f3 5b 8f  dmb ish
       4: 00 68        ldr r0, [r0]
       6: bf f3 5b 8f  dmb ish
       a: 70 47        bx lr

Clang generated code:

$ clang --target=arm-none-eabi -o atomic_load_cb.o atomic_load.c -c
-O1 -march=armv6-m

00000000 <atomic_get>:
       0: 80 b5        push {r7, lr}
       2: 00 af        add r7, sp, #0
       4: 05 21        movs r1, #5
       6: ff f7 fe ff  bl 0x6 <atomic_get+0x6>    @ imm = #-4
00000006:  R_ARM_THM_CALL __atomic_load_4 // call instead of a load
       a: 80 bd        pop {r7, pc}

No libcall is generated for armv7-m. This difference is coming from following LLVM code: https://github.com/llvm/llvm-project/blob/main/clang/lib/Basic/Targets/ARM.cpp#L138

  bool ShouldUseInlineAtomic =
      (ArchISA == llvm::ARM::ISAKind::ARM && ArchVersion >= 6) ||
      (ArchISA == llvm::ARM::ISAKind::THUMB && ArchVersion >= 7);

Is it correct to restrict it for thumb >=7 (v6-m IIUC is thumb2 only)?

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-arm

efriedma-quic commented 1 year ago

See also #58184. See also https://llvm.org/docs/Atomics.html#libcalls-atomic

armv6m does not have ldrex/strex (armv7m does). In general, on targets that don't have atomic cmpxchg, all atomic operations need to go through __atomic_* libcalls, so any required locking is self-consistent.

The alternative here is that we assume that you're targeting an operating system that provides lock-free atomic emulation, via __sync_val_compare_and_swap_4 etc.. Apparently gcc does this. What environment are you using? Does it provide those routines?

m-gupta commented 1 year ago

@thughes for the details.

thughes commented 1 year ago

What environment are you using? Does it provide those routines?

We're building for a baremetal environment: https://issuetracker.google.com/254710459.

__sync_val_compare_and_swap_4

Should this be in compiler-rt? Or implemented in the OS that is being used?

It doesn't appear to be in either of those:

arm-none-eabi-nm /usr/lib64/clang/15.0.0/lib/baremetal/libclang_rt.builtins-armv6m.a | grep sync

~/chromiumos/src/platform/ec $ git grep '__sync_val'

It also looks like there is some confusion around naming: https://llvm.org/docs/Atomics.html#libcalls-atomic. In the example in the first comment, the assumption is that __atomic_load_n is a builtin coming from the compiler per the gcc "standard": https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html.

efriedma-quic commented 1 year ago

Should this be in compiler-rt? Or implemented in the OS that is being used?

LLVM doesn't provide either atomic or sync routines for armv6m because it's impossible to implement them without making strong assumptions about your environment. It requires either messing with the interrupt mask (i.e. delaying the delivery of interrupts), or making the interrupt handler handle preemption of atomic operations in some special way, or assuming atomics don't actually need to be atomic.

We could consider giving up on the "don't mess with the interrupt mask" constraint on armv6m. Maybe behind a compiler-rt CMake option. I might be overestimating the number of people that would care about that.

It also looks like there is some confusion around naming

You're looking at the wrong gcc documentation. The correct link is https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary . (The documentation you're looking at describes the C builtins, which have confusingly similar names, but aren't the same thing.)

thughes commented 1 year ago

The documentation you're looking at describes the C builtins, which have confusingly similar names, but aren't the same thing

Yes, I am confused. I assumed that when writing code (as in the first comment and http://b/254710459) that calls __atomic_load_n(target, __ATOMIC_SEQ_CST), that I'm calling the C builtins. Is that not the case?

efriedma-quic commented 1 year ago

Yes, when you're writing code, you're calling the C builtins. Depending on the target, those get translated to either inline code, __sync_* calls, or __atomic_* calls (general-purpose library routines that back the C builtins).

The __atomic_* calls are normally implemented in some shared library, like libatomic on Linux. compiler-rt has a static implementation suitable for embedded, but it's off by default because we don't want to make assumptions about the user's environment. It also won't compile on armv6m because it requires lock-free atromics.

thughes commented 1 year ago

We could consider giving up on the "don't mess with the interrupt mask" constraint on armv6m. Maybe behind a compiler-rt CMake option. I might be overestimating the number of people that would care about that.

This sounds like a reasonable approach.

Looking at the code generated by gcc, do we even need to disable interrupts for __atomic_load_n?

00000000 <atomic_get>:
       0: bf f3 5b 8f  dmb ish
       4: 00 68        ldr r0, [r0]
       6: bf f3 5b 8f  dmb ish
       a: 70 47        bx lr

I think the assumption is that there are no multicore armv6m processors.

efriedma-quic commented 1 year ago

If you assume that cmpxchg for a given width is "lock-free" (has a native instruction, or implemented by messing with the interrupt mask, or using a special interruptible sequence like Linux supports, or something similar), then load and store can be lowered to plain ldr/str plus whatever barriers are necessary.

The dmb+ldr+dmb sequence is exactly what we use on armv7. You can skip the dmb instructions if you assume single-core.

thughes commented 1 year ago

I guess we would need to disable interrupts for accesses to values > 4 bytes since those loads require multiple instructions.

Looks like gcc doesn't have an implementation for __atomic_load_8:

typedef unsigned long long atomic_t;
typedef atomic_t atomic_val_t;

static inline atomic_val_t atomic_get(const atomic_t *target)
{
        return __atomic_load_n(target, __ATOMIC_SEQ_CST);
}

int main(int argc, const char* argv[])
{
        atomic_t val = 42;
        atomic_val_t ret = atomic_get(&val);
        return 0;
}

arm-none-eabi-gcc  -o atomic_load_cb.o atomic_load_64.c  -O1 -march=armv6-m

/usr/x86_64-pc-linux-gnu/arm-none-eabi/binutils-bin/2.36.1/ld.bfd.real: /usr/lib/gcc/arm-none-eabi/10.2.0/../../../../arm-none-eabi/lib/thumb/libc.a(lib_a-exit.o): in function `exit':
exit.c:(.text.exit+0x1a): undefined reference to `_exit'
/usr/x86_64-pc-linux-gnu/arm-none-eabi/binutils-bin/2.36.1/ld.bfd.real: /tmp/ccGoP5Ed.o: in function `main':
atomic_load_64.c:(.text+0x10): undefined reference to `__atomic_load_8'
collect2: error: ld returned 1 exit status

In my case I'd be happy if clang behaved the same way that gcc behaves.

RossSmyth commented 1 year ago

Using lock-free atomics on armv6m is supported in LLVM since b1b1086973d5 landed. Rust uses this feature by default, but I'm not sure what the incantation for using it in Clang is. Below shows the code output of Rust versus Clang with my guess on the CLI flags, but it does not emit what I would expect (the same as the Rust output). But those in this issue should be happy to hear that this has support in LLVM somewhere.

https://godbolt.org/z/x33nqdbez

But if this available to be used from Clang in some way, this issue can probably be closed as solved.

lukeg101 commented 9 months ago

Hi @RossSmyth can you point to a place where v6 atomics are used by the Rust folks? the patch you link states they are supported only if the user also provides the implementation

nikic commented 9 months ago

@RossSmyth It looks like clang is explicitly generating a call to __atomic_load_4, so LLVM's atomic lowering never even comes into play: https://godbolt.org/z/vKa6K5KeY No idea why Clang does this.

RossSmyth commented 9 months ago

@lukeg101 I am unsure where you are reading that, but it says that the user is responsible for generating any __sync implementations. Which Rust does by not generating any and erroring on operations that would require that (CAS, RMW, ...). thumbv6 loads and stores can be (are? They are on Cortex M0 at least) atomic, so no calls are needed. Rust sets the feature here https://github.com/rust-lang/rust/blob/afe67fa2ef1bb0dcf9077d55cdd1c64a9969c9cf/compiler/rustc_target/src/spec/thumbv6m_none_eabi.rs#L18

@nikic Ah, thanks for the insight! I didn't think to look at it. I don't care enough to dive in to Clang to fix it since GCC is good enough for my purposes.

RossSmyth commented 9 months ago

My guess is it is this line or something https://github.com/llvm/llvm-project/blob/8fd02d540fbbec1edef4e426e75521bf71509020/clang/lib/Basic/Targets/ARM.cpp#L141-L148

RossSmyth commented 1 month ago

Good new! Something has changed in a good way

ARM's Clang fork emits lock-free atomics by default (my guess is this patch) :)
Upstream Clang still does not emit them by default :(
BUT it is possible to pass -Xclang -target-feature -Xclang +atomics-32 and it works correctly :)

https://godbolt.org/z/E3nzvbajP

It is not perfect yet, so I would not close this issue quite yet. Clang still issues a warning saying it is not lock-free, even when calls to libatomic are not emitted. It would also be correct to emit lock-free loads & stores by default.

After hunting around, I believe it was #73176 that did this?

llvm / llvm-project

clang vs gcc difference in armv6m vs armv7m inline atomic codegen #58603