Closed alexcrichton closed 2 years ago
Then why not simply add std::sync::atomic
to the list of modules that require #[cfg] trickery? This seems like the best solution at this point.
This doesn't block the portability lint: the link can simply check that you have wrapped your atomic usage in #[cfg(target_has_atomic)]
.
I don't know if this applies to anyone else, but as a user, I'm primarily interested in AtomicU32
/AtomicI32
, because there are lots of APIs that involve 32-bit atomic values on 64-bit platforms. If every platform has 32-bit CAS and therefore can support these types acceptably, couldn't they be stabilized immediately? :-)
If every platform also has smaller CAS, or if it's deemed acceptable to synthesize smaller atomics using "oversize" CAS, the smaller atomics could be stabilized immediately as well.
Basically it seems like the only truly non-portable case might be AtomicI64
/AtomicU64
, so perhaps only those types really need to wait for the portability lint to be sorted out. And since all the platforms I care about are 64-bit (I'll never run my proprietary code on 32-bit), I won't miss them because I can just use AtomicUsize
/AtomicIsize
instead.
It's one possible solution, yeah, to simply say types in std::sync::atomic
are not platform-agnostic.
I wanted to get a grasp on what concrete portability story we're talking about, and I wasn't aware of any analysis done here recently, so I've run the compiler over a bunch of targets to see what the various sized swap
instructions generated and got this table:
The points of note are:
swap
. IIRC it's just load/store and other more simplistic operationsAll types look to be available on all other platforms tested. This doesn't cover architectures like s390x, sparc, wasm, probably some arm variant, etc.
For targets that are "practically up there in their level of support" that's pretty bleak...
I think I would personally push back against simply saying the types are stable as-is today. We have no precedent for these sort of types with varying support across platforms (at least of this prominence and this level of support) being in libstd without a clear warning about portability.
The portability problem has already been gotten wrong with SIMD which has tons and tons of warnings about how platform-specific it is, and this represents yet-another-portability-hazard if it's in such a prominent place as std::sync::atomic
.
I'm ok with the solution of moving these types to std::arch
myself, however, as that has clear warnings about portability and is I feel the best we can do at this time.
@willmo empirically it looks like AtomicU32
is indeed supported everywhere I tested at least!
Regarding your question about ARM architectures: armv5te and thumv6 targets don't support atomics, except that armv5te emulates them with Linux kernel support.
I disagree with your "(BUG)" comments: the ll/sc loop is the standard way of performing sub-word atomic operations on those platforms. It is just a more complicated version of the ll/sc loop used for word-sized atomic operations.
In short there are really only 3 categories for targets:
With the last category, we already have a stabilized precedent for variations in support for std::sync::atomic: thumbv6 doesn't support atomics at all. Also I feel that moving atomic types to std::arch::$arch
will actually make code less portable. Code using integer atomic will now have to be specialized for every architecture:
#[cfg(target_arch = x86)]
use std::arch::x86::atomic::AtomicU32;
#[cfg(target_arch = x86_64)]
use std::arch::x86_64::atomic::AtomicU32;
// Oops, now this code will only work on x86 despite the fact that it would work
// just as well on ARM, PowerPC, MIPS, etc.
And even then, there is still varying atomic support within an architecture. This is particularly true for ARM, but it is also the case on x86_64: AtomicU128 is support on all x86_64 chips except the earliest ones from AMD.
In conclusion, I don't feel that moving atomic types to std::arch
actually solves any problems, and instead introduces new ones. I feel that the current (unstable) situation of having atomic types conditionally available depending on the target is the best approach to take. Look at it this way: if a crate is found not to compile on some architecture due to missing atomic support, an issue will be opened on Github and the problem will be quickly solved.
Oh sorry yeah by "BUG" I meant that it didn't follow what I assumed to be our contract, that we only provide atomic types which match exactly with the architecture in question, excluding the fact that any smaller atomic operations can be implemented in terms of larger ones. It's fine for that to be a separable question, I don't mean for it to get in the way.
It's true that atomics on ARM are sort of odd! I'm not sure what to really do about that. That being said most of the platforms that don't have atomics are pretty low down on the platform support tiers, so we could relegate them to "unresolved questions" like targets without floats rather than having them block other designs.
It's true that moving these into std::arch
would have to have special code per every architecture. A crate on crates.io, however, could reexport a portable interface which does all the multiplexing and has documented fallbacks or options for what fallbacks should do on unsupported platforms.
I personally disagree that std::arch
doesn't solve any problems, but I do agree that it creates an ergonomic barrier to use the types. I feel it's clearly signaling that these operations aren't 100% portable as most of the rest of the standard library already is. These are already somewhat niche types so the ergonomics I don't think are so important as AtomicUsize
and friends.
Using std::arch
, in my mind, is basically entirely centered around:
From this discussion I conclude that there are basically three tiers of platform support for any atomic operation:
Two directly conflicting goals are a) portability and b) protecting the programmer from a potential performance footgun (using emulated atomics over a more efficient solution).
I think these very different goals deserve different treatment: While an application targeted towards a machine that only offers 32-bit atomics is almost always better using 32-bit atomics instead of loop-emulated 16-bit atomics (potentially wasting some memory though), I'd much rather have a library fall back to loop emulation than getting a compile error - the slight performance impact is just not worth the despair of having to dive into a foreign codebase in order to fix portability errors.
Mutex emulation is less obvious considering that it can slow down an application by several orders of magnitude. But I would argue that from a portability standpoint ("I'm trying to use someone else's code on my platform") even this is generally acceptable.
Similar to how I don't get linter warnings for dependency crates, this is how I think the compiler should react to different kinds of atomic calls:
Compile type | Native support | Loop emulation | Mutex emulation |
---|---|---|---|
Local | fine | warning | warning |
Dependency | fine | fine | warning |
This gives me a heads up if a crate I'm depending on is going to be severely slower while avoiding accidental performance problems from my own code.
I generally agree with your sentiment about tiers of atomic support. But note that some popular RISC platforms, such as 32-bit ARM, require all atomics to be implemented using load-linked/store-conditional loops. I would be wary of linting on those, so it feels to me that such a lint should be allow-by-default.
Even on x86, some atomic operations are implemented using a cmpxchg
loop, e.g. fetch_and
. Implementing atomic operations using a loop is completely normal and should be treated the same way as native support.
The performance isn't actually the reason why we make a distinction between so called "lock-free" atomics and ones emulated with a mutex. If you are using atomics to share data between main code and a signal/interrupt handler then you must use lock-free atomics, otherwise your code may deadlock. This can happen if the interrupt happens while the mutex used for atomic emulation is locked.
Maybe a bit of extra terminology could help here. When atomics are implemented using CAS or LL/SC loops, they are lock-free but not wait free.
In simple terms, lock-freedom means no single thread can block every other thread if 1/it only interacts with them via atomics and 2/the atomics are not used in a loop to implement a higher-level lock.
Wait-freedom, in contrast, means that a thread cannot be infinitely delayed by other threads hammering the atomic in a loop. This is obviously not true of an atomic operation that is implemented via a CAS or LL/SC loop.
Since we cannot provide wait-free atomics on some platforms, we may want to clarify in the documentation that atomics are only guaranteed to be lock-free, not wait-free.
If [#{cfg(accessible(...)]
](https://github.com/rust-lang/rfcs/pull/2523) is added I think it will cover all use cases of #[cfg(target_has_atomic="x")]
, as users can then check for the presence of the atomic types directly.
Then one use case I could think of, if smaller atomics are emulated and therefore always present but target_has_atomic="x"
is only true if there is native support, is using it to check for wait-freeness.
But from Amanieu's comment above, that would only tell whether store()
is wait-free, so people will need to check that the specific operations they need are wait-free.
Now, accessible(...)
hasn't even passed the RFC stage yet, but it seems like a much cleaner solution than exposing a target_has_xxx
attribute for each kind of target feature.
Stabilizing the types without either accessible(...)
or target_has_atomic="x"
being available would still be useful, as people can use target_arch="x"
which while less portable, offers much stronger guarantees.
I think that documenting that atomics are at least lock-free and possibly wait-free should be enough to move forward. An optional documentation feature that would make this pretty damn perfect: also documenting which platforms/atomic sizes are wait-free.
In general, I completely agree with @Amanieu that CAS loop (AKA lock-free but not wait-free) atomics would be considered "native atomic support" by basically everyone. But going above-and-beyond by documenting this should settle any concerns.
I'm nominating this for discussion at the next libs triage meeting, but to try to make progress on this discussion I'd like to separate out a few points. If others have thoughts on these (or other points), please let me know!
AtomicXYY
, if they exist in libstd, guarantee they are lock-freeAtomicXYY
types do not guarantee they are wait-free, and I think this is where we are todayAtomicU8
with AtomicU32
operations.
AtomicU8
in terms of AtomicU32
(in terms of LLVM guarantees and whatnot). This may only be a valid thing for LLVM's backend code generator to generateAtomicU8
to be implemented in terms of AtomicU32
std::sync::atomic
? Others in std::arch
? All in std::sync::atomic
?The main two options for placement of these types are:
std::arch
and exposed as they're available. This likely wouldn't stabilize the target_has_atomic
cfg directive.std::sync::atomic
and the target_has_atomic
cfg directive is also stabilizedThe second option is whether or not to stabilize the emulated atomics, and that's just a question of whether the APIs are stabilized or not.
AtomicXYY
types do not guarantee they are wait-free, and I think this is where we are today
This is fine, keep in mind that C++11 atomics (which we are based on) makes no guarantees about wait-freedom either.
- It's unclear (to me at least) what to do about "emulation" of small size atomics using larger-sized atomics. For example emulating
AtomicU8
withAtomicU32
operations.
I think it's fine to defer this issue. Currently none of the built-in targets make use of the min-atomic-width
attribute. This was only added in #38579 to support the out-of-tree OR1K target, and even then it should be possible to implement an emulation for those in compiler-builtins. In any case, this doesn't block stabilization as you mentioned at the end.
Ok we've discussed this in a recent @rust-lang/libs triage, and the conclusion was that the proposal to stabilize all these types as-is is probably the way to go. The stabilization would be coupled with documentation updates indicating that these aren't as portable as, say, Add for u8
, but they're available on most platforms. Additionally it was concluded that stabilizing smaller-size atomics for platforms that only have larger-size atomics was fine to do.
I believe this is generally the trend of this thread anyway, so I'm going to open a dedicated thread and FCP this for stable
Ok for those following along here, I've opened a formal proposal for stabilization at https://github.com/rust-lang/rust/issues/56753, feedback of course is always welcome!
smaller-size atomics for platforms that only have larger-size atomics
Just to be sure, the encodings of smaller-sized atomics in terms of larger-sized atomics happens by LLVM, as part of LLVM lowering to machine-specific IR or so? I maintain that doing this at the level of Rust, MIR or LLVM IR is illegal because of potential out-of-bounds accesses, and we shouldn't do it.
@RalfJung correct, that's what convinced me personally that we can't do this on crates.io, which means if we want it at all we need it in the standard library (via LLVM intrinsics). I think we want it, so I'm convinced to put it into libstd :)
@RalfJung This lowering is done either within LLVM, or through a function in compiler_builtins.
The latter is currently only used on armv5te-unknown-linux-gnu at the moment, and uses this code. It could be argued that this is UB since intrinsics::atomic_load_unordered
could be used to read out-of-bounds data, however this is guaranteed not to fault because it doesn't cross a page boundary.
@alexcrichton makes sense!
@Amanieu
It could be argued that this is UB
And the argument would be correct :)
is guaranteed not to fault because it doesn't cross a page boundary.
And as in the last N cases we have had this argument (and as I am sure you are aware, but not everybody else might be), that doesn't change anything about this being UB when we are talking about code written in Rust, MIR or LLVM IR. ;) (I am beginning to feel sorry for being so annoying about this, but LLVM is way too smart and getting smarter every day, so I am actively worried that such arguments will blow in our face some day.)
Is this a pattern supported/intended by LLVM? Is there advise from the LLVM devs for how to do this?
Is there any chance of LLVM ever inlining those compiler-builtins functions? Actually even having them in the same translation unit could be enough to cause problems, because LLVM could infer attributes on the functions to propagate information about what they do out to use sites.
One safer alternative would be to use inline assembly to implement such operations, that would most likely exclude any way for LLVM to notice that there are out-of-bounds accesses. But I am not sure if that's an option here.
The code is more-or-less based on the GCC implementation, which gets away with a normal atomic load.
Would changing the load to a volatile atomic load help in this case?
To add to what @RalfJung is saying, @Amanieu just because it works at the hardware layer doesn't mean it's UB in LLVM's IR. For example this function:
define i8 @bar() {
start:
%a = alloca i8
store i8 0, i8* %a
%b = call i8 @foo(i8* %a)
ret i8 %b
}
define internal i8 @foo(i8*) {
start:
%b = getelementptr i8, i8* %0, i32 1
%a = load i8, i8* %b
ret i8 %a
}
is sort of a simplisitic view but it's guaranteed to never fault because the out-of-bounds load will just load some byte of the return address on the call stacsk or something weird like that. When optimized, however, it yields:
define i8 @bar() local_unnamed_addr #0 {
start:
ret i8 undef
}
(a showing that this is undefined behavior)
LLVM can't automatically deduce that all instances of this pattern is undefined behavior, in isolation foo
optimized just fine. That's why compiler-builtins happens to work, we're forcing LLVM to have less knowledge about the inputs so it just-so-happens that it can't deduce that undefined behavior is happening.
All that's just to say that @RalfJung I think is totally correct here, a crates.io based implementation of smaller-sized atomics with larger-sized atomics I think is just a segfault waiting to happen. LLVM may not even detect it's UB today, but it's definitely UB at the LLVM IR layer (and probably the Rust layer) to read out of bounds on objects. Why exactly it's UB or what exactly happens is always up for grabs which is why it works most of the time, but this is fundamentally why we need LLVM's backend to do the lowering because the IR passes need to see that we're just modifying/loading one byte, not the bytes around it
The operations that @Amanieu wants to perform cannot be performed by a programming language generating LLVM-IR directly. Inline assembly appears to be the only way to perform these right now, so we could still expose them I think (@RalfJung ? I don't know whether compiler-builtins would work too).
In the meantime, I think it would be better to open an issue in the LLVM bugzilla about this, explaining why these operations are useful, why the LLVM-IR generated for them has undefined behavior, and how that requires us to use inline assembly (or modify compiler-builtins) instead. We should ask: what should we do? Should we use inline assembly / our own compiler built-ins ? Will LLVM expose intrinsics to allow these safely? etc.
It might be worth mentioning that this is not the only situation in which we need to perform reads out-of-bounds (see https://github.com/rust-rfcs/unsafe-code-guidelines/issues/2).
Would changing the load to a volatile atomic load help in this case?
No. Volatile reads in practice have some positive effects on racy reads (but LLVM may change those rules any time as we are relying on de-facto behavior here). It doesn't change anything about the requirement that accesses must be in-bounds.
The proper way to fox this is (as @gnzlbg mentioned) to add an attribute to LLVM that can be set on reads/writes and that indicates that the access may be partially out-of-bounds. Then we need a matching intrinsic in Rust, and methods such as read_out_of_bounds
and write_out_of_bounds
on pointers. Considering we need this for concurrency, we'd also need to think about how to expose atomic out-of-bounds accesses in Rust. Anything else (anything just arguing based on page boundaries but not informing LLVM) will remain a hack. Given that this seems to be a useful pattern, I absolutely think we should lobby for LLVM to add such an attribute!
just because it works at the hardware layer doesn't mean it's UB in LLVM's IR. For example this function
Thanks for the example, I'll link to this when such discussions come up again in the future. :)
That's why compiler-builtins happens to work, we're forcing LLVM to have less knowledge about the inputs so it just-so-happens that it can't deduce that undefined behavior is happening.
That sounds way less confident than I had hoped...
When does compiler-builtins get linked with the real program? Is there a chance that LTO might inline compiler-builtins functions (which then would mean LLVM could deduce the UB)?
When does compiler-builtins get linked with the real program? Is there a chance that LTO might inline compiler-builtins functions (which then would mean LLVM could deduce the UB)?
As rtlib calls are only inserted at the SelectionDAG layer, while LTO still operates on LLVM IR, I don't believe there is any possibility of these getting inlined.
@nikic is correct, we explicitly don't LTO compiler-builtins as well (it's a very special crate). In that sense there's no worry for inlining compiler-builtins intrinsics.
Okay. I can live with that. We should keep it in mind though for the future, if/when compiler-builtins treatment ever changes.
So, yeah, I agree we should go forward with such "emulated" small-int atomics implemented via LLVM lowering or compiler-builtints.
For those following this thread, the stabilization proposal is now in FCP
Can i ask a question (just curious) ? It seems that constants like ATOMIC_I64_INIT are marked as stable since 1.34 and deprecated since 1.34 at the same time. Why stabilize something that is deprecated? It may be just my opinion, but i think that getting new stable feature that is deprecated from the beginning is rather strange...
Nice catch! I think we should just remove those constants.
That's convincing to me, @macpp -- opened https://github.com/rust-lang/rust/issues/58089 to track it.
This is listed as the tracking issue for cfg_target_has_atomic
, which is still unstable. Should this be reopened?
Yep - reopened.
Removing T-Libs since this is a pure language feature.
Is there any progress on this? Can anyone explain what cfg_target_has_atomic
is blocked on? I.e. what are questions we need to resolve before stabilizing?
AtomicU32
was stabilized for 1.34.0
I have one objection to the way target_has_atomic = "cas"
works. I would prefer if we split this into two separate cfg
s:
target_has_atomic = 8/16/32/64/128
: This indicates the largest width that the target can atomically CAS (which implies support for all atomic operations).target_has_atomic_load_store = 8/16/32/64/128
: This indicates the largest width that the target can support loading or storing atomically (but may not support CAS).(bikeshed: maybe a slightly shorter name target_has_atomic_ldst
)
Is CAS the only operation that we'd need to call out that way (e.g. are there any platforms we care about that have atomic load/store but not swap)?
It seems like we should be able to stabilize target_has_atomic itself though with @Amanieu's definition.
thumbv6 has load, store, but no swap or cas
Does thumbv6 have any kind of read-modify-write instruction? Maybe presence or absence of atomic RMW instructions could be the right discrimination criterion...
No, thumbv6 has nothing of the sort. Perhaps a better name would be #[cfg(target_has_atomic = "rmw")]
, but that still doesn't really capture the swap operation.
Why ? Swap reads the old value, replaces it with the new one, and writes that in a single atomic transaction, so it is RMW in my book.
cc #65214
Why ? Swap reads the old value, replaces it with the new one, and writes that in a single atomic transaction, so it is RMW in my book.
Yeah, fair point.
Does thumbv6 have any kind of read-modify-write instruction? Maybe presence or absence of atomic RMW instructions could be the right discrimination criterion...
I think a CAS cfg is correct because all the other RMW operations can be implemented with it, but having one RMW operation like swap doesn't allow you to implement the rest. So, for targets that just have a swap,fetch_add etc., but not CAS we might need more cfgs, but I don't think it would add enough value to be worth it.
Good point! I think we can agree on the following conclusion:
target_has_atomic_test_and_set
), because test-and-set is all you need to implement a mutex, and a mutex is all you need to emulate any other atomic instruction in a blocking manner.Since Rust atomics are guaranteed to be at least lock-free, this substitution cannot be done silently by std and must be performed manually on the user's side. Therefore, it is not transparent and must be exposed by a cfg, if and when the situation arises.
All this is conditional on the existence of hardware which has some atomic RMW instructions, but none with infinite consensus number. I'm not personally aware of any, but embedded chips and legacy hardware are full of surprises so it's best to keep that door open at the syntax level.
I believe that the syntax proposed by @Amanieu (target_has_atomic vs target_has_atomic_load_store) does so, therefore I'm happy with it.
If we find hardware which supports e.g. test-and-set but not CAS, then we may want support it as well with a finer-grained cfg (e.g. target_has_atomic_test_and_set), because test-and-set is all you need to implement a mutex
However, "normal" sequentially consistent loads and stores are also sufficient to implement a Mutex using Dekker's algorithm (for 2 CPUs) or Peterson's algorithm (for any number of CPUs). Now I wonder, how does that fit in?
All this is conditional on the existence of hardware which has some atomic RMW instructions, but none with infinite consensus number. I'm not personally aware of any, but embedded chips and legacy hardware are full of surprises so it's best to keep that door open at the syntax level.
Some old ARM chips (~ARMv5) only have an atomic SWP instruction and nothing else. However neither GCC nor LLVM actually use this instruction for atomics so atomics are unsupported on this architectures.
IMO we should follow the same general policy: only support atomic operations if all of them are supported (which essentially boils down to whether CAS is supported since you can use it to emulate the others).
Having access to limited atomic operations might still be useful for some niche applications (eg. on ARM7TDMI, which is still somewhat widespread), so I think it would be unfortunate if these use cases are prevented by a matter of policy.
If we find hardware which supports e.g. test-and-set but not CAS, then we may want support it as well with a finer-grained cfg (e.g. target_has_atomic_test_and_set), because test-and-set is all you need to implement a mutex
However, "normal" sequentially consistent loads and stores are also sufficient to implement a Mutex using Dekker's algorithm (for 2 CPUs) or Peterson's algorithm (for any number of CPUs). Now I wonder, how does that fit in?
Given that...
I don't think that these algorithms are applicable outside of very constrained embedded scenarios where the target hardware is exactly known and hardware portability is not desired at all.
if a hardware architecture cares so little about concurrency that it does not even expose a test-and-set instruction, it is unlikely to provide the required memory barriers for SeqCst ordering,
Even thumbv6 has fully working loads and stores, despite it not having anything more sophisticated than that (no swap, CAS, or anything else). These are still sufficient for implementing things like SPSC queues.
Thumbv6 is also used in multicore processors, often alongside a more powerful Cortex-M3/M4 core (which is thumbv7 and does have CAS, etc.). This means that implementing a Mutex using one of the algorithms Ralf linked above might actually make sense on these MCUs. Manufacturers of these MCUs also provide peripherals that provide synchronization primitives, but these are often specific to the MCU family and don't exist on others.
Tracking https://github.com/rust-lang/rfcs/pull/1543