Amanieu commented 2 years ago

Some embedded targets (thumbv6m, riscv32i) don't support general atomic operations (compare_exchange, fetch_add, etc) but do support atomic load and store operations. We previously supported this on thumbv6m (but not riscv32i, see #98333) by cfging out most of the methods on Atomic* except for load and store. However recent changes to LLVM (#99595) cause it to emit calls to libatomic for load and store (it already does this for the other atomic operations).

It seems that LLVM's support for lowering atomic loads and stores on ARM was an accident that is being reverted. However the reason for this revert is that the directly lowered load/store cannot interoperate correctly with the other atomic operations which are lowered to libcalls. This concern doesn't apply to Rust since we don't expose emulated (read: not lock free) atomics, so there is no interoperability concern (except maybe for FFI with C that uses atomics?).

I see 2 ways we can move forward:

Continue supporting atomic load/store on such targets by lowering them in rustc to a volatile load/store. This will avoid breakage in the embeded Rust ecosystem.
Remove support for atomic load/store on targets that don't support full atomics. Users are expected to switch to using volatile load/store instead.

jamesmunns commented 2 years ago

Just wanted to add my $0.02, option two, "Remove support for atomic load/store on targets that don't support full atomics", would be a breaking change to (at least) one Tier 2 target: thumbv6m-none-eabi.

It would break a non-trivial number of embedded projects (there are lots of Cortex-M0+ cores out there, including the very popular RP2040), and might not be picked up by a crater run, as crate may not exercise specific targets during the run.

jamesmunns commented 2 years ago

Also just as a note, we might want to check in that armv5te still builds with these relevant changes. It has "native" load/stores, but no CAS operations. These are provided by an OS-level polyfill for the linux target.

It's likely their load/stores will also be affected by any change here.

Amanieu commented 2 years ago

AFAIK armv5te should not be affected by this change, but we should double check with the upgrade to LLVM 15.

adamgreig commented 2 years ago

Currently the core::ptr docs say:

All accesses performed by functions in this module are non-atomic in the sense of atomic operations used to synchronize between threads. This means it is undefined behavior to perform two concurrent accesses to the same location from different threads unless both accesses only read from memory. Notice that this explicitly includes read_volatile and write_volatile: Volatile accesses cannot be used for inter-thread synchronization.

One downside of requiring users on these platforms to use volatile ops instead is that (in Rust) volatile is a property of an access (so a static UnsafeCell/static mut could still be accessed with a non-volatile operation) while the Atomic types only have atomic operations (via AtomicU8, AtomicBool, etc), and I don't think there's any appetite for VolatileU32 or similar.

There is the atomic-polyfill crate which is quite popular in embedded Rust and provides CAS operations on thumbv6 etc while directly forwarding core::sync::atomic on other platforms. If option 2 was chosen, probably this crate could help smooth out a transition.

taiki-e commented 2 years ago

the atomic-polyfill crate

That is unsound on multi-core systems. (See https://github.com/tokio-rs/bytes/issues/461#issuecomment-1072028462 for an approach to handle such cases soundly.)

adamgreig commented 2 years ago

That is unsound on multi-core systems.

I thought it used the critical-section crate, which requires users to provide a critical section implementation for their system - so on a multi-core system you need to provide something with the appropriate guarantees, and it won't compile if no implementation is provided. It doesn't simply disable all interrupts on the CM0 core to obtain the CAS locks.

Edit: The currently-released critical-section 0.2.7 does by default provide a "disable all interrupts" critical section for thumbv platforms, which is unsound on multi-core, though users on multi-core systems can still provide their own implementation which is sound. The next release of critical-section removes all the built-in implementations entirely, and it's expected most platform support crates will provide an implementation instead, which should be sound for their respective platform. In any event I don't think it gets in the way of having atomic-polyfill provide replacement Atomic types that can do load/store on thumbv6 if need be.

apiraino commented 2 years ago

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-critical +T-compiler

taiki-e commented 2 years ago

requires users to provide a critical section implementation for their system - so on a multi-core system you need to provide something with the appropriate guarantees, and it won't compile if no implementation is provided

Note that https://github.com/embassy-rs/critical-section/commit/a3cdfd8e27a8fd823492c1c04ab28dd1be82f1f3 has not been released yet. The releases available on crates.io do not work this way.

And even if it were released, atomic-polyfill is still unsound on multi-core systems because it mixes native atomic load/store and critical-section.

taiki-e commented 2 years ago

FWIW, my portable-atomic crate implements the equivalent of the atomic types that was be removed in https://github.com/rust-lang/rust/pull/99595 (by using inline assembly). It also provides equivalents on other platforms where atomic load/store is not provided (riscv without A-extension, msp430, avr).

RalfJung commented 2 years ago

Remove support for atomic load/store on targets that don't support full atomics. Users are expected to switch to using volatile load/store instead.

That seems in conflict with us telling people for years now that volatile operations are not atomic. And they aren't, so this strategy risks subtle miscompilations.

taiki-e commented 2 years ago

using volatile load/store

If we implement this in the compiler (i.e., the first way that Amanieu mentioned), I think it is safe to use volatile load/store -- If there are tests to ensure that these operations are lowered properly by LLVM, we can guarantee compiler that abusing UB of using non-atomic volatile load/store as atomic load/store will never be shipped.

RalfJung commented 2 years ago

If we implement this in the compiler (i.e., the first way that Amanieu mentioned), I think it is safe to use volatile load/store

If we do this after LLVM, yes. But if we tell LLVM that these are (implicitly non-atomic) volatile accesses, it might optimize accordingly. Checking how LLVM lowers them is not enough, it does not take into account the effect of optimizations. So at that point inline assembly would be the safer choice.

nikic commented 2 years ago

I think the ideal solution would be to add a new target feature to LLVM, which makes the promise that no atomic CAS will be used, and thus allow lowering atomic load/store without libcalls (I've commented to that effect here: https://reviews.llvm.org/D120026#3674333). That would restore the status quo.

We should probably also document the special assumptions made by this target (or any other with "only load/store atomics"), which should boil down to:

Linking against non-Rust code using atomics is not supported (and may be silently unsound).
Atomic CAS can only be safely emulated on single-core systems.
To support atomic CAS on multi-core systems, a custom target spec that disables all atomics should be used and all atomic operations should go through a polyfill (with a shared lock) instead.

nikic commented 2 years ago

Dropping regression label and downgrading priority as the change is reverted now. We still need a solution for the LLVM upgrade of course.

Quoting Eli's response from https://reviews.llvm.org/D120026#3674604:

I see two possible approaches to restore the functionality you want.

One, you could add a target feature that says "I have 32-bit (or 64-bit) atomics". That would override the normal rule for setMaxAtomicSizeInBitsSupported. The target environment would be expected to provide whatever _sync* functions are necessary. If you don't actually use any atomic operations that require CAS, you won't see any calls, so everything works even if the functions don't actually exist. (Or it's possible to write an operating system that implements lock-free CAS on Cortex-M0, the same way the Linux kernel implements atomics on old ARM cores.)

Two, you could relax the constraint that atomic load/store are required to be atomic with respect to cmpxchg. We can add a value to syncscope to represent this. This would allow mixing code that uses load-store atomics with code that requires "real" atomics.

That said, I'm not sure how you use load-store atomics in practice. I mean, I guess you can use Dekker's alogorithm, or some kinds of lock-free queues, but that seems a bit exotic. Or do you use some sort of lock implemented by disabling interrupts?

I think option one is what we want, and it seems like a pretty elegant solution. With a custom target spec one could even set atomic_cas: true, as long as one provides the necessary __sync libcalls.

I am kinda curious about the last question though -- what do people actually use pure load/store atomics for?

Amanieu commented 2 years ago

I believe load/store atomics are mainly used for synchronization with an interrupt handler. Somewhat like volatile sig_atomic_t in C, hence my suggestion to lower it to volatile load/store.

adamgreig commented 2 years ago

I think almost all use cases are indeed synchronisation with interrupt contexts, but then again that's pretty much the only concurrency present in many bare-metal thumbv6-m systems. Usually multi-core Cortex-M0 systems (like RP2040) have to use some hardware-specific primitive for cross-core synchronisation, which is outside the scope of atomics.

Many uses are probably in application firmware which doesn't tend to be as easy to find on GitHub, but there are a couple of examples in widely used libraries:

heapless' SPSC queue is a lock-free queue using just atomic load/store (no CAS): https://github.com/japaric/heapless/blob/master/src/spsc.rs
bbqueue has a separate implementation for cortex-m0 that uses a disable-all-interrupts critical section (notably only sound on single-core) alongside atomic load/store operations to emulate CAS

I expect another significant use case is that atomics provide a safe interior-mutable type for a static, so if you want to coordinate a bool flag or an int or something with an interrupt, you need to use a static, and an Atomic* type lets you easily load/store it without using unsafe or a mutex or anything else. Since all the word-sized operations are atomic anyway you could just do this with a regular static u32, but that would need a static mut or an UnsafeCell etc. This is often in a context where it's not incorrect to only use load/store, for example a timer interrupt incrementing a timestamp which many other contexts read from (only one writer, many readers).

I'll see if I can find any other actual application examples too.

Lokathor commented 2 years ago

If you're looking for examples: on the GBA (ARMv4T / ARM7TDMI) I do exactly what you're describing with UnsafeCell so that the interrupt handler can communicate with the main program (gba_cell.rs). Currently this is technically unsound to mix with non-volatile accesses (though in practice there's never been an issue).

luojia65 commented 2 years ago

In RISC-V the ISA (micro architecture implementation) would provide different levels of atomic instructions, but different address range in internal bus would choose to support them or not. In this way when I develop RustSBI, I use inline assembly directly to avoid problems on atomic operations. I don't know if this is the case we should consider on when it comes to atomic loads and stores under generic instruction architecture.

jannic commented 2 years ago

I am kinda curious about the last question though -- what do people actually use pure load/store atomics for?

There is also code like https://github.com/rp-rs/rp-hal/blob/main/rp2040-hal/src/rom_data.rs#L80-L92, where a data race would not really matter as the value written is always the same. I think in this case the AtomicU16 is only necessary because otherwise the concurrent access would be technically UB.

Or more general, code that doesn't care if it reads the old or the new value of some memory which is written by a different thread (ie. other core or interrupt). AFAICT on thumbv6-m that means that at the assembler level, simple (aligned) loads / stores up to 32 bit are sufficient, but at the rust level, without Atomic those would still be UB.

Nassiel commented 2 years ago

I am kinda curious about the last question though -- what do people actually use pure load/store atomics for?

In heap and global allocators is very frequent to use it. Being able to ensure that read and write are linearly executed is critical, while CAS is only necessary against interruptions in the case of single core chps, something "easily" fixable.

That's why I guess, many single cores don't implement natively the CAS and, I'd say all, multi-core CPUs have CAS, due to being impossible to ensure interruption disabling in all the cores at the same time.

If I understood the case right, Rust should be aligned with LLVM and the decision about ARM, first, to be coherent and secondly because I think most of us agree that this is the correct/expected behaviour if your CPU supports atomics, even if it's partially, to have atomics. Thanks for reading my 0,02€ :moneybag:

jannic commented 2 years ago

That's why I guess, many single cores don't implement natively the CAS and, I'd say all, multi-core CPUs have CAS, due to being impossible to ensure interruption disabling in all the cores at the same time.

rp2040 is multi-core and doesn't have CAS. Yes, it's strange.

Dirbaio commented 2 years ago

I am kinda curious about the last question though -- what do people actually use pure load/store atomics for?

I have a few! :)

AtomicPtr for storing pointers to async tasks to wake them. register is called from futures when polled, wake is called from interrupt context to wake them. (Embassy "compresses" wakers to just 1 ptr, instead of vtable+data ptrs)

https://github.com/embassy-rs/embassy/blob/84cffc751ac0a38e6736959e68d441b99138d6d0/embassy/src/waitqueue/waker.rs#L72

Global state in HALs, to share data between main and interrupt.

https://github.com/embassy-rs/embassy/blob/db685c04049449ac3e4f256f2e7e26dad550d94c/embassy-nrf/src/time_driver.rs#L115

Global state in firmware binaries, like "battery level", "is network up?", "is mains power present?", that fits into a bool or u32. Mainly to share it between tasks, for example to blink a status LED if network's down, or turn off a keypad backlight if mains power is out and we're on battery. Firmware binaries are usually closed source, so this use case is underrepresented in code you can find on GitHub. Libraries tend to be opensource but using global state in libs is a bad practice, while it's totally fine in firmwares (you only have one battery after all!).

My opinion is:

We should keep supporting atomic load/store on targets without CAS.
- It is a very useful feature, and very widely used.
- Targets with load/store but no CAS atomics are widespread and they'll continue to be so in the future. New chips with these targets keep being released. RP2040 (chip used in the Raspberry Pi Pico) is thumbv6m, released 2021. ESP32C3 is riscv32imc, released 2021 too.
"Use volatile" is not a good solution.
- Atomic ops can get optimized in ways that volatile can't. IIUC x.store(1); x.store(2) can be optimized to x.store(2) with atomics, not with volatile.
- It's unclear if it's sound. It's technically a data race. It might be sound in single-core thumbv6m if atomics are lowered to LDR/STR. What about multicore targets, what about targets with per-core dcaches? Maybe you need full memory fences to make that sound, which are slower than fine-grained atomics on targets like x86.
- Can't write portable code safely. For example, use volatile read/write on a u32, then port the code to a 16 bit target. It compiles but it's no longer atomic. Vs if you use real atomics you're protected by core "knowing" the 16bit target can't do atomic u32 load/store, build will fail and you can fix it.
- It needs unsafe.

RalfJung commented 2 years ago

Being able to ensure that read and write are linearly executed is critical,

Atomic reads and writes are not necessarily linearly executed. For example, two atomic 'relaxed' loads can be reordered. This is covered by the as-if rule; reordering them does not introduce any new behavior that was not already possible before. Atomic != Volatile.

If you make all your atomic accesses SeqCst, then they will be linearly executed as far as the program can see. But that's still not the same as volatile; if you look at program behavior with a debugger or on the memory bus, things can still happen in arbitrary order.

nikic commented 2 years ago

I've put up https://reviews.llvm.org/D130480 to add a target feature.

Nassiel commented 2 years ago

Being able to ensure that read and write are linearly executed is critical,

Atomic reads and writes are not necessarily linearly executed. For example, two atomic 'relaxed' loads can be reordered. This is covered by the as-if rule; reordering them does not introduce any new behavior that was not already possible before. Atomic != Volatile.

Sorry I should have written using the proper word: linearizability

RalfJung commented 2 years ago

What linearizability even means when using weak memory (anything but SeqCst) is a complicated subject under active research.

Lokathor commented 2 years ago

@nikic question: does that new feature also allow 16 and 8 bit (non-CAS) atomics on ARM?

Amanieu commented 2 years ago

I've put up reviews.llvm.org/D130480 to add a target feature.

This only solves the issue for ARM, but not other targets like RISC-V which also want load/store-only atomics (#98333). From a quick search of the target specs, AVR and MSP430 might also be affected.

workingjubilee commented 2 years ago

I would concur that whatever solution is arrived at, it should not be "we now ask Rust programmers who are not themselves writing core to use volatile_* to express atomicity". We should not confound volatile access with atomic access, as an operation can be separately volatile and atomic (and thus potentially we could need atomic_volatile_load in the future).

As far as the matter of Rust atomic interop with C atomics over FFI: That's still unsafe, and if we need to explain that there is actually another constraint on certain platforms, that doesn't seem to interfere with Rust code much. I suspect in practice a lot of Rust embedded code is "pure Rust". While Rust interop with C via FFI is part of the Rust language's story, we can afford to be a comparatively "bad citizen" in cases where our program is more likely to have already occupied the entire RAM and ROM anyways.

Dirbaio commented 2 years ago

Mixing C+Rust in embedded is somewhat common: writing Rust code on top of a C RTOS, or using C libraries (in both source or precompiled blob form) from chip vendors.

However sending pointers to atomics across the FFI boundary is not common I'd say. I think it should be sound to mix C+Rust atomics in the same binary, as long as only Rust touches Rust atomics and only C touches C atomics, right? If that's the case then IMO it's fine if they're incompatible.

Somewhat related: perhaps we could add an opt-in target feature to tell rustc to make atomics always available, lowering them to libcalls on archs without hardware support? This would have 2 advantages:

Atomics would become FFI-compatible across C/Rust.
Forking libs to make them use crates like portable-atomic or atomic-polyfill wouldn't be needed anymore. You could use any lib that uses core::sync::atomic on any arch, by enabling the target feature then supplying the __atomic_* symbols.

bjorn3 commented 2 years ago

perhaps we could add an opt-in target feature to tell rustc to make atomics always available, lowering them to libcalls on archs without hardware support?

There is code that depends on atomics being non-blocking. For example the futex based mutex implementation in libstd would deadlock if the atomic libcalls were to use libstd mutexes. Or because they are mapping the same atomics in two different processes and expect them to be atomic across process boundaries, which usage of locks doesn't guarantee.

Atomics would become FFI-compatible across C/Rust.

This is not true. There are targets where atomics have less alignment than the equivalent integers (for example 64bit atomics on x86). However in Rust atomics always have at least the alignment of the equivalent integers as necessary for the stable get_mut method. The unstable from_mut has the opposite requirement. That is the rust atomics must not be more aligned than the equivalent integers. Together this means the alignment between the two must match. I also don't think there is a guarantee _Atomic types in C only contain a value and not for example a lock.

nikic commented 2 years ago

I've put up reviews.llvm.org/D130480 to add a target feature.

This only solves the issue for ARM, but not other targets like RISC-V which also want load/store-only atomics (#98333). From a quick search of the target specs, AVR and MSP430 might also be affected.

It should be possible to implement similar support for other targets. I've put up a patch for RISCV: https://reviews.llvm.org/D130621

taiki-e commented 2 years ago

@Dirbaio

perhaps we could add an opt-in target feature to tell rustc to make atomics always available, lowering them to libcalls on archs without hardware support?

I have tried something similar to this in the past in portable-atomic. However, there was a problem when atomic builtins were implemented in a way that was incompatible with the guarantees provided by portable-atomic. (It was a case similar to the one bjorn3 mentioned.) There was also an unresolved issue about how to handle natively unsupported atomic types (e.g., 64-bit atomics on no-std or old arm targets. Whether providing lock-based fallback as usual or generating libcalls for all of them). (Currently, that patch is stalled, but if you or anyone has any ideas on this, I would like to support this in portable-atomic.)

@bjorn3

There are targets where atomics have less alignment than the equivalent integers (for example 64bit atomics on x86).

Did you mean "greater alignment than"? Atomic types always have the same alignment as size, but integers do not. (On most targets, this only happens with integers greater than usize (e.g., x86's u64, x86_64's u128, powerpc64's u128), but IIRC, it also happens with usize/{i,u}16 on avr.)

nikic commented 2 years ago

RISCV +forced-atomics support has landed, are there any other targets apart from ARM/RISCV where we would need something like this?

Amanieu commented 2 years ago

RISCV +forced-atomics support has landed, are there any other targets apart from ARM/RISCV where we would need something like this?

MSP430 and AVR.

Nassiel commented 2 years ago

RISCV +forced-atomics support has landed, are there any other targets apart from ARM/RISCV where we would need something like this?

Xtensa.

MabezDev commented 2 years ago

RISCV +forced-atomics support has landed, are there any other targets apart from ARM/RISCV where we would need something like this?

Xtensa.

The Xtensa backend we are maintaining here is unfortunately not upstream. Patches are stuck in review (If anyone with LLVM knowledge would like to help review here are the currently submitted patched: https://github.com/espressif/llvm-project/issues/4#issuecomment-568783525).

I will let my colleagues know about this change when they rebase next, Xtensa targets like the ESP32-S2 would need to take advantage of this feature.

cr1901 commented 2 years ago

MSP430 and AVR

I have said in the past that "MSP430 supports atomics because it has RMW insns besides load and store that can modify memory without being interrupted". The latter part is true, but the former isn't quite correct (because actually returning the old value requires an additional initial read). So in this case, MSP430 should do what armv6 and rv32i are doing.

Right now, MSP430 doesn't lower atomic operations correctly in LLVM (long story); the msp430-atomic crate directly uses inline assembly to implement a subset of rust's atomic intrinsics. However, I think a patch for lowering atomic loads/stores in line w/ armv6 and rv32i should be accepted. I'm actively looking into this.

Mark-Simulacrum commented 2 years ago

Dropping 1.64 milestone per https://github.com/rust-lang/rust/issues/99668#issuecomment-1193401083.

@nikic, does the LLVM 15 upgrade make this a regression in 1.65, now that https://github.com/rust-lang/rust/pull/99464 has landed?

nikic commented 2 years ago

@nikic, does the LLVM 15 upgrade make this a regression in 1.65, now that #99464 has landed?

The regression has been addressed, so this issue is now about allowing load/store atomics on additional targets.

Lokathor commented 2 years ago

load/store appears to still not work with ARMv4T, though it should

See https://github.com/rust-lang/rust/issues/101300

pnkfelix commented 1 year ago

T-compiler reviewed this as part of 2022 Q3 P-high review

What is current situation here? It seems like there are still unresolved questions about what we want to tell people to do;

but also, I see that @nikic added target changes on the LLVM side for ARM and RISC-V: https://reviews.llvm.org/D130480 and https://reviews.llvm.org/D130621 do we need to be leveraging the +atomic-32 and +forced-atomics features in some fashion within the rustc interface to LLVM in order to make progress here?

nikic commented 1 year ago

We already use +atomic-32 for relevant ARM/Thumb targets.

Once we update to LLVM 16, we can consider using +forced-atomics on RISC-V (it would allow landing the change from https://github.com/rust-lang/rust/pull/98333).

I think at this point in time, the only thing that can be done here is adding upstream LLVM support for +forced-atomics to more targets, like MSP430 and AVR mentioned above.

Lokathor commented 1 year ago

101300 is still open. The TLDR of that issue is that, at least on some targets, the fake atomic access has extremely bad performance, to the point that it might drive people to use inline assembly instead.

cr1901 commented 1 year ago

@nikic I have no bandwidth at this time to add LLVM support for +forced-atomics or lowering atomics for MSP430, but I'm not against the idea and neither is the maintainer last I checked.

If this is only simple thing that just drops some bits and lowers atomic stuff to normal loads / stores, then probably it could be fine

Right now, MSP430 doesn't lower atomic operations correctly in LLVM (long story)

This is still true.

pkubaj commented 1 year ago

powerpc (32-bit) is also affected.

pnkfelix commented 1 year ago

Discussed at 2023 Q1 P-high review

My current understanding is that there is no one objecting to adding support to +forced-atomics for the targets that need it, but no one is taking up the task of actually doing so for all such targets.

I'm not clear on who would be best to take ownership of that, but maybe that doesn't matter yet; the first order of business is to collectively agree that is our strategy going forward.

@Amanieu @nikic am I right in inferring that you're both aligned on the plan of leveraging +forced-atomics where we need it, and the main work items here are adding support for it upstream with LLVM? (And I guess maybe there are some efficiency issues with the current support, as flagged e.g. in #101300 ?)

nikic commented 1 year ago

We have +forced-atomics support for RISCV in LLVM 16, so we could give adding that to our target specs (and enabling atomic load/store) a try.

I believe the only real concern is that it makes Rust's atomics ABI-incompatible with C's atomics (unless the C code also uses +forced-atomics), but I think people value having atomics in Rust at all much more than being able to work with atomics across an ABI boundary.

cr1901 commented 1 year ago

I believe the only real concern is that it makes Rust's atomics ABI-incompatible with C's atomics (unless the C code also uses +forced-atomics)

I've been thinking about this; does anyone have a concrete example of where calling C code using libatomic from Rust code using Rust atomics is unsound/memory-unsafe (single core and multi core case )? Or is it more a "precaution that interop will subtly break"?

I can't visualize an example after thinking about it, because I already can't think of a way to share Rust atomics and use as if they were C _Atomic_s; Rust uses repr(C) on a newtype, and C _Atomics are not guaranteed to be the same size as the underlying type (do gcc and clang accommodate this?).

asl commented 1 year ago

We have +forced-atomics support for RISCV in LLVM 16, so we could give adding that to our target specs (and enabling atomic load/store) a try.

FWIW, I'm planning to find some spare time to address some atomic weirdness on MSP430 in LLVM.

rust-lang / rust

How should we expose atomic load/store on targets that don't support full atomics #99668

101300 is still open. The TLDR of that issue is that, at least on some targets, the fake atomic access has extremely bad performance, to the point that it might drive people to use inline assembly instead.