Open m-gupta opened 1 year ago
@llvm/issue-subscribers-backend-arm
See also #58184. See also https://llvm.org/docs/Atomics.html#libcalls-atomic
armv6m does not have ldrex/strex (armv7m does). In general, on targets that don't have atomic cmpxchg, all atomic operations need to go through __atomic_*
libcalls, so any required locking is self-consistent.
The alternative here is that we assume that you're targeting an operating system that provides lock-free atomic emulation, via __sync_val_compare_and_swap_4 etc.. Apparently gcc does this. What environment are you using? Does it provide those routines?
@thughes for the details.
What environment are you using? Does it provide those routines?
We're building for a baremetal environment: https://issuetracker.google.com/254710459.
__sync_val_compare_and_swap_4
Should this be in compiler-rt
? Or implemented in the OS that is being used?
It doesn't appear to be in either of those:
arm-none-eabi-nm /usr/lib64/clang/15.0.0/lib/baremetal/libclang_rt.builtins-armv6m.a | grep sync
~/chromiumos/src/platform/ec $ git grep '__sync_val'
It also looks like there is some confusion around naming: https://llvm.org/docs/Atomics.html#libcalls-atomic. In the example in the first comment, the assumption is that __atomic_load_n
is a builtin coming from the compiler per the gcc "standard": https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html.
Should this be in compiler-rt? Or implemented in the OS that is being used?
LLVM doesn't provide either atomic or sync routines for armv6m because it's impossible to implement them without making strong assumptions about your environment. It requires either messing with the interrupt mask (i.e. delaying the delivery of interrupts), or making the interrupt handler handle preemption of atomic operations in some special way, or assuming atomics don't actually need to be atomic.
We could consider giving up on the "don't mess with the interrupt mask" constraint on armv6m. Maybe behind a compiler-rt CMake option. I might be overestimating the number of people that would care about that.
It also looks like there is some confusion around naming
You're looking at the wrong gcc documentation. The correct link is https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary . (The documentation you're looking at describes the C builtins, which have confusingly similar names, but aren't the same thing.)
The documentation you're looking at describes the C builtins, which have confusingly similar names, but aren't the same thing
Yes, I am confused. I assumed that when writing code (as in the first comment and http://b/254710459) that calls __atomic_load_n(target, __ATOMIC_SEQ_CST)
, that I'm calling the C builtins. Is that not the case?
Yes, when you're writing code, you're calling the C builtins. Depending on the target, those get translated to either inline code, __sync_*
calls, or __atomic_*
calls (general-purpose library routines that back the C builtins).
The __atomic_*
calls are normally implemented in some shared library, like libatomic on Linux. compiler-rt has a static implementation suitable for embedded, but it's off by default because we don't want to make assumptions about the user's environment. It also won't compile on armv6m because it requires lock-free atromics.
We could consider giving up on the "don't mess with the interrupt mask" constraint on armv6m. Maybe behind a compiler-rt CMake option. I might be overestimating the number of people that would care about that.
This sounds like a reasonable approach.
Looking at the code generated by gcc, do we even need to disable interrupts for __atomic_load_n
?
00000000 <atomic_get>:
0: bf f3 5b 8f dmb ish
4: 00 68 ldr r0, [r0]
6: bf f3 5b 8f dmb ish
a: 70 47 bx lr
I think the assumption is that there are no multicore armv6m processors.
If you assume that cmpxchg for a given width is "lock-free" (has a native instruction, or implemented by messing with the interrupt mask, or using a special interruptible sequence like Linux supports, or something similar), then load and store can be lowered to plain ldr/str plus whatever barriers are necessary.
The dmb+ldr+dmb sequence is exactly what we use on armv7. You can skip the dmb instructions if you assume single-core.
I guess we would need to disable interrupts for accesses to values > 4 bytes since those loads require multiple instructions.
Looks like gcc doesn't have an implementation for __atomic_load_8
:
typedef unsigned long long atomic_t;
typedef atomic_t atomic_val_t;
static inline atomic_val_t atomic_get(const atomic_t *target)
{
return __atomic_load_n(target, __ATOMIC_SEQ_CST);
}
int main(int argc, const char* argv[])
{
atomic_t val = 42;
atomic_val_t ret = atomic_get(&val);
return 0;
}
arm-none-eabi-gcc -o atomic_load_cb.o atomic_load_64.c -O1 -march=armv6-m
/usr/x86_64-pc-linux-gnu/arm-none-eabi/binutils-bin/2.36.1/ld.bfd.real: /usr/lib/gcc/arm-none-eabi/10.2.0/../../../../arm-none-eabi/lib/thumb/libc.a(lib_a-exit.o): in function `exit':
exit.c:(.text.exit+0x1a): undefined reference to `_exit'
/usr/x86_64-pc-linux-gnu/arm-none-eabi/binutils-bin/2.36.1/ld.bfd.real: /tmp/ccGoP5Ed.o: in function `main':
atomic_load_64.c:(.text+0x10): undefined reference to `__atomic_load_8'
collect2: error: ld returned 1 exit status
In my case I'd be happy if clang
behaved the same way that gcc
behaves.
Using lock-free atomics on armv6m is supported in LLVM since b1b1086973d5 landed. Rust uses this feature by default, but I'm not sure what the incantation for using it in Clang is. Below shows the code output of Rust versus Clang with my guess on the CLI flags, but it does not emit what I would expect (the same as the Rust output). But those in this issue should be happy to hear that this has support in LLVM somewhere.
https://godbolt.org/z/x33nqdbez
But if this available to be used from Clang in some way, this issue can probably be closed as solved.
Hi @RossSmyth can you point to a place where v6 atomics are used by the Rust folks? the patch you link states they are supported only if the user also provides the implementation
@RossSmyth It looks like clang is explicitly generating a call to __atomic_load_4
, so LLVM's atomic lowering never even comes into play: https://godbolt.org/z/vKa6K5KeY No idea why Clang does this.
@lukeg101 I am unsure where you are reading that, but it says that the user is responsible for generating any __sync implementations. Which Rust does by not generating any and erroring on operations that would require that (CAS, RMW, ...). thumbv6 loads and stores can be (are? They are on Cortex M0 at least) atomic, so no calls are needed. Rust sets the feature here https://github.com/rust-lang/rust/blob/afe67fa2ef1bb0dcf9077d55cdd1c64a9969c9cf/compiler/rustc_target/src/spec/thumbv6m_none_eabi.rs#L18
@nikic Ah, thanks for the insight! I didn't think to look at it. I don't care enough to dive in to Clang to fix it since GCC is good enough for my purposes.
My guess is it is this line or something https://github.com/llvm/llvm-project/blob/8fd02d540fbbec1edef4e426e75521bf71509020/clang/lib/Basic/Targets/ARM.cpp#L141-L148
Good new! Something has changed in a good way
-Xclang -target-feature -Xclang +atomics-32
and it works correctly :)https://godbolt.org/z/E3nzvbajP
It is not perfect yet, so I would not close this issue quite yet. Clang still issues a warning saying it is not lock-free, even when calls to libatomic are not emitted. It would also be correct to emit lock-free loads & stores by default.
After hunting around, I believe it was #73176 that did this?
Our firmware teams have been looking into evaluating clang for their products. They recently reported one difference between Clang and GCC for armv6m related to inline atomic vs libcall.
GCC generated code:
Clang generated code:
No libcall is generated for armv7-m. This difference is coming from following LLVM code: https://github.com/llvm/llvm-project/blob/main/clang/lib/Basic/Targets/ARM.cpp#L138
Is it correct to restrict it for thumb >=7 (v6-m IIUC is thumb2 only)?