C11 atomics always yielding library call with `-mcpu=cortex-m0plus`

maribu commented 5 years ago


Bugzilla Link	43603
Version	9.0
OS	other
CC	@DougGregor,@efriedma-quic,@zygoloid

Extended Description

Using this minimum example code:

#include <stdatomic.h>

void foo(void);
extern atomic_int_least16_t bar;

void foo(void)
{
    atomic_store_explicit(&bar, 0x1337, memory_order_relaxed);
}

Compiling with clang -mcpu=cortex-m3 -mlittle-endian -mthumb -mfloat-abi=soft -target arm-none-eabi -Weverything -Werror -c -o test.o test.c works fine.

Compiling with clang -mcpu=cortex-m0plus -mlittle-endian -mthumb -mfloat-abi=soft -target arm-none-eabi -Weverything -Werror -c -o test.o test.c yields:

test.c:8:2: error: large atomic operation may incur significant performance
      penalty [-Werror,-Watomic-alignment]
        atomic_store_explicit(&bar, 0x1337, memory_order_relaxed);
        ^
/usr/lib/clang/9.0.0/include/stdatomic.h:118:31: note: expanded from macro
      'atomic_store_explicit'
#define atomic_store_explicit __c11_atomic_store
                              ^
1 error generated.

However, compiling with arm-none-eabi-gcc -mcpu=cortex-m0plus -mlittle-endian -mthumb -mfloat-abi=soft -Wall -Wextra -pedantic -Werror -c -o test.o test.c works fine and generates:

arm-none-eabi-objdump -d test.o

test.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 <foo>:
   0:   b580        push    {r7, lr}
   2:   b082        sub sp, #&#8203;8
   4:   af00        add r7, sp, #&#8203;0
   6:   4b07        ldr r3, [pc, #&#8203;28]    ; (24 <foo+0x24>)
   8:   607b        str r3, [r7, #&#8203;4]
   a:   1cbb        adds    r3, r7, #&#8203;2
   c:   4a06        ldr r2, [pc, #&#8203;24]    ; (28 <foo+0x28>)
   e:   801a        strh    r2, [r3, #&#8203;0]
  10:   1cbb        adds    r3, r7, #&#8203;2
  12:   2200        movs    r2, #&#8203;0
  14:   5e9b        ldrsh   r3, [r3, r2]
  16:   b29a        uxth    r2, r3
  18:   687b        ldr r3, [r7, #&#8203;4]
  1a:   801a        strh    r2, [r3, #&#8203;0]
  1c:   46c0        nop         ; (mov r8, r8)
  1e:   46bd        mov sp, r7
  20:   b002        add sp, #&#8203;8
  22:   bd80        pop {r7, pc}
  24:   00000000    .word   0x00000000
  28:   00001337    .word   0x00001337

Therefore, I assume that the atomic store can be implemented without resorting a a library call to __atomic_store_2().

maribu commented 4 years ago

In case my arguments for just assuming that __sync_fetch_and_add_2() and all its friends are not interruptible for bare metal targets are not convincing: I'd very happily take the compiler flag to tell clang that the library functions are indeed disabling interrupts.

Btw (but a off topic): If there would be flag to allow clang to just emit code to disable interrupts before the atomic operations and restore the previous interrupt state afterwards, this could also be very interesting for embedded developers. This could allow the compiler to group a number of atomic operations that can only be implemented via disabling interrupts and the ROM size would be reduced a bit, as less often interrupts would be disabled and restored. A flag to limit the number of atomics ops to group would be needed to limit the worst case delay for interrupts to be acted upon for those wanting hard real time capabilities, though.

maribu commented 4 years ago

The problem with arm-none-eabi is that we don't really know much about where the code is actually going to run.

Mind the -mcpu=cortex-m0plus flag that was given during compilation - so it is already known that the code currently running simply can disable interrupts, as there is no "user mode" and "kernel mode" on that platform. This means that to my best knowledge the most efficient implementation of __sync_fetch_and_add_2() and friends is to simply disable interrupts for the duration of the function. And also: GCC is already enforcing that the implementations has to disable interrupts, as otherwise concurrent lock free operations might interfere with the operations implemented by __sync_fetch_and_add_2() and friends.

So: Why not just also require that implementations of __sync_fetch_and_add_2() and friends do disable interrupts on single-threaded single-core embedded CPUs (no MMU, no kernel and user mode)?

efriedma-quic commented 4 years ago

If a lock-free sequence does not exist for all possible atomic operations of a given width, the compiler must generate a call to __atomic_* for all atomic operations of that width. Otherwise, the locking mechanism in libatomic breaks.

The problem with arm-none-eabi is that we don't really know much about where the code is actually going to run. For arm-pc-linux-eabi, specifically, __sync_val_compare_and_swap_N where N=1,2,4,8 is implemented using a special kernel-assisted sequence. But we don't really want make the same assumption for every baremetal target.

Maybe we could control it with a flag.

maribu commented 4 years ago

Thanks for the reply.

The Cortex M0+ is an embedded processor (single-threaded single core CPU), in which the only source of concurrency are interrupts. To my understanding in the given context atomic_store_explicit(&bar, 0x1337, memory_order_relaxed); is already correctly implemented if a single machine instructions implements the store, as no ISR will interrupt in the middle of that instruction.

Resorting to __sync_fetch_and_add_2 as GCC does for atomic_fetch_add_explicit(&bar, 0x1337, memory_order_relaxed); is also correct to my understanding, as two machine instructions are needed to implement that.

(Well, you can technically make an arbitrary sequence "atomic" by disabling interrupts, but I don't think that really counts.)

This is exactly how __sync_fetch_and_add_2() is implemented in the context of RIOT 1. But the use of __atomic_store_2() should not be required for properly aligned stores - so why is it still used?

efriedma-quic commented 4 years ago

I think that's a gcc bug.

If you write something like the following with gcc:

void foo2(void)
{
    atomic_fetch_add_explicit(&bar, 0x1337, memory_order_relaxed);
}

It produces the following:

foo2:
        push    {r4, lr}
        ldr     r1, .L5
        ldr     r0, .L5+4
        bl      __sync_fetch_and_add_2
        pop     {r4, pc}

Meaning, it's assuming that there's some lock-free atomic sequence that can implement __sync_fetch_and_add_2. But there is no such sequence in a baremetal environment; cortex-m0 doesn't have the ll/sc instructions. (Well, you can technically make an arbitrary sequence "atomic" by disabling interrupts, but I don't think that really counts.)

llvm / llvm-project

C11 atomics always yielding library call with `-mcpu=cortex-m0plus` #42948

Extended Description