rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.28k stars 12.71k forks source link

Tracking issue for #[cfg(target_has_atomic = ...)] #32976

Closed alexcrichton closed 2 years ago

alexcrichton commented 8 years ago

Tracking https://github.com/rust-lang/rfcs/pull/1543

Amanieu commented 5 years ago

Then why not simply add std::sync::atomic to the list of modules that require #[cfg] trickery? This seems like the best solution at this point.

This doesn't block the portability lint: the link can simply check that you have wrapped your atomic usage in #[cfg(target_has_atomic)].

willmo commented 5 years ago

I don't know if this applies to anyone else, but as a user, I'm primarily interested in AtomicU32/AtomicI32, because there are lots of APIs that involve 32-bit atomic values on 64-bit platforms. If every platform has 32-bit CAS and therefore can support these types acceptably, couldn't they be stabilized immediately? :-)

If every platform also has smaller CAS, or if it's deemed acceptable to synthesize smaller atomics using "oversize" CAS, the smaller atomics could be stabilized immediately as well.

Basically it seems like the only truly non-portable case might be AtomicI64/AtomicU64, so perhaps only those types really need to wait for the portability lint to be sorted out. And since all the platforms I care about are 64-bit (I'll never run my proprietary code on 32-bit), I won't miss them because I can just use AtomicUsize/AtomicIsize instead.

alexcrichton commented 5 years ago

It's one possible solution, yeah, to simply say types in std::sync::atomic are not platform-agnostic.

I wanted to get a grasp on what concrete portability story we're talking about, and I wasn't aware of any analysis done here recently, so I've run the compiler over a bunch of targets to see what the various sized swap instructions generated and got this table:

| Target | `AtomicU8::swap` | `AtomicU16::swap` | `AtomicU32::swap` | `AtomicU64::swap` | |------|--------------------|-----------------------|--------------|---------------------| | `x86_64-unknown-linux-gnu` | `xchgb` | `xchgw` | `xchgl` | `xchgq` | | `x86_64-apple-darwin` | `xchgb` | `xchgw` | `xchgl` | `xchgq` | | `i686-unknown-linux-gnu` | `xchgb` | `xchgw` | `xchgl` | `cmpxchg8b` | | `i586-unknown-linux-gnu` | `xchgb` | `xchgw` | `xchgl` | `cmpxchg8b` | | `arm-unknown-linux-gnueabi` | `ldrexb` | `ldrexh` | `ldrex` | `ldrexd` | | `arm-unknown-linux-gnueabihf` | `ldrexb` | `ldrexh` | `ldrex` | `ldrexd` | | `armv7-unknown-linux-gnueabihf` | `ldrexb` | `ldrexh` | `ldrex` | `ldrexd` | | `mips-unknown-linux-gnu` | `ll`/`sc` (BUG) | `ll`/`sc` (BUG) | `ll`/`sc` | N/A | | `mips64-unknown-linux-gnuabi64` | `ll`/`sc` (BUG) | `ll`/`sc` (BUG) | `ll`/`sc` | `lld`/`scd` | | `powerpc-unknown-linux-gnu` | ?? (BUG?) | ?? (BUG?) | `lwarx`/`stwcx` | N/A | | `powerpc64-unknown-linux-gnu` | ?? (BUG?) | ?? (BUG?) | `lwarx`/`stwcx` | `ldarx`/`stdcx` | | `aarch64-unknown-linux-gnu` | `ldxrb` | `ldxrh` | `ldxr` | `ldxr` | | `thumbv6m-none-eabi` | (no swap) | (no swap) | (no swap) | N/A | | `thumbv7m-none-eabi` | `ldm` | `ldm` | `ldmda` | N/A | | `thumbv7em-none-eabi` | `ldrexb` | `ldrexh` | `ldrex` | N/A |

The points of note are:

All types look to be available on all other platforms tested. This doesn't cover architectures like s390x, sparc, wasm, probably some arm variant, etc.

For targets that are "practically up there in their level of support" that's pretty bleak...


I think I would personally push back against simply saying the types are stable as-is today. We have no precedent for these sort of types with varying support across platforms (at least of this prominence and this level of support) being in libstd without a clear warning about portability.

The portability problem has already been gotten wrong with SIMD which has tons and tons of warnings about how platform-specific it is, and this represents yet-another-portability-hazard if it's in such a prominent place as std::sync::atomic.

I'm ok with the solution of moving these types to std::arch myself, however, as that has clear warnings about portability and is I feel the best we can do at this time.

@willmo empirically it looks like AtomicU32 is indeed supported everywhere I tested at least!

Amanieu commented 5 years ago

Regarding your question about ARM architectures: armv5te and thumv6 targets don't support atomics, except that armv5te emulates them with Linux kernel support.

I disagree with your "(BUG)" comments: the ll/sc loop is the standard way of performing sub-word atomic operations on those platforms. It is just a more complicated version of the ll/sc loop used for word-sized atomic operations.

In short there are really only 3 categories for targets:

With the last category, we already have a stabilized precedent for variations in support for std::sync::atomic: thumbv6 doesn't support atomics at all. Also I feel that moving atomic types to std::arch::$arch will actually make code less portable. Code using integer atomic will now have to be specialized for every architecture:

#[cfg(target_arch = x86)]
use std::arch::x86::atomic::AtomicU32;
#[cfg(target_arch = x86_64)]
use std::arch::x86_64::atomic::AtomicU32;
// Oops, now this code will only work on x86 despite the fact that it would work
// just as well on ARM, PowerPC, MIPS, etc.

And even then, there is still varying atomic support within an architecture. This is particularly true for ARM, but it is also the case on x86_64: AtomicU128 is support on all x86_64 chips except the earliest ones from AMD.

In conclusion, I don't feel that moving atomic types to std::arch actually solves any problems, and instead introduces new ones. I feel that the current (unstable) situation of having atomic types conditionally available depending on the target is the best approach to take. Look at it this way: if a crate is found not to compile on some architecture due to missing atomic support, an issue will be opened on Github and the problem will be quickly solved.

alexcrichton commented 5 years ago

Oh sorry yeah by "BUG" I meant that it didn't follow what I assumed to be our contract, that we only provide atomic types which match exactly with the architecture in question, excluding the fact that any smaller atomic operations can be implemented in terms of larger ones. It's fine for that to be a separable question, I don't mean for it to get in the way.

It's true that atomics on ARM are sort of odd! I'm not sure what to really do about that. That being said most of the platforms that don't have atomics are pretty low down on the platform support tiers, so we could relegate them to "unresolved questions" like targets without floats rather than having them block other designs.

It's true that moving these into std::arch would have to have special code per every architecture. A crate on crates.io, however, could reexport a portable interface which does all the multiplexing and has documented fallbacks or options for what fallbacks should do on unsupported platforms.

I personally disagree that std::arch doesn't solve any problems, but I do agree that it creates an ergonomic barrier to use the types. I feel it's clearly signaling that these operations aren't 100% portable as most of the rest of the standard library already is. These are already somewhat niche types so the ergonomics I don't think are so important as AtomicUsize and friends.

Using std::arch, in my mind, is basically entirely centered around:

main-- commented 5 years ago

From this discussion I conclude that there are basically three tiers of platform support for any atomic operation:

Two directly conflicting goals are a) portability and b) protecting the programmer from a potential performance footgun (using emulated atomics over a more efficient solution).

I think these very different goals deserve different treatment: While an application targeted towards a machine that only offers 32-bit atomics is almost always better using 32-bit atomics instead of loop-emulated 16-bit atomics (potentially wasting some memory though), I'd much rather have a library fall back to loop emulation than getting a compile error - the slight performance impact is just not worth the despair of having to dive into a foreign codebase in order to fix portability errors.

Mutex emulation is less obvious considering that it can slow down an application by several orders of magnitude. But I would argue that from a portability standpoint ("I'm trying to use someone else's code on my platform") even this is generally acceptable.

Similar to how I don't get linter warnings for dependency crates, this is how I think the compiler should react to different kinds of atomic calls:

Compile type Native support Loop emulation Mutex emulation
Local fine warning warning
Dependency fine fine warning

This gives me a heads up if a crate I'm depending on is going to be severely slower while avoiding accidental performance problems from my own code.

HadrienG2 commented 5 years ago

I generally agree with your sentiment about tiers of atomic support. But note that some popular RISC platforms, such as 32-bit ARM, require all atomics to be implemented using load-linked/store-conditional loops. I would be wary of linting on those, so it feels to me that such a lint should be allow-by-default.

Amanieu commented 5 years ago

Even on x86, some atomic operations are implemented using a cmpxchg loop, e.g. fetch_and. Implementing atomic operations using a loop is completely normal and should be treated the same way as native support.

The performance isn't actually the reason why we make a distinction between so called "lock-free" atomics and ones emulated with a mutex. If you are using atomics to share data between main code and a signal/interrupt handler then you must use lock-free atomics, otherwise your code may deadlock. This can happen if the interrupt happens while the mutex used for atomic emulation is locked.

HadrienG2 commented 5 years ago

Maybe a bit of extra terminology could help here. When atomics are implemented using CAS or LL/SC loops, they are lock-free but not wait free.

In simple terms, lock-freedom means no single thread can block every other thread if 1/it only interacts with them via atomics and 2/the atomics are not used in a loop to implement a higher-level lock.

Wait-freedom, in contrast, means that a thread cannot be infinitely delayed by other threads hammering the atomic in a loop. This is obviously not true of an atomic operation that is implemented via a CAS or LL/SC loop.

Since we cannot provide wait-free atomics on some platforms, we may want to clarify in the documentation that atomics are only guaranteed to be lock-free, not wait-free.

tormol commented 5 years ago

If [#{cfg(accessible(...)]](https://github.com/rust-lang/rfcs/pull/2523) is added I think it will cover all use cases of #[cfg(target_has_atomic="x")], as users can then check for the presence of the atomic types directly.

Then one use case I could think of, if smaller atomics are emulated and therefore always present but target_has_atomic="x" is only true if there is native support, is using it to check for wait-freeness. But from Amanieu's comment above, that would only tell whether store() is wait-free, so people will need to check that the specific operations they need are wait-free.

Now, accessible(...) hasn't even passed the RFC stage yet, but it seems like a much cleaner solution than exposing a target_has_xxx attribute for each kind of target feature. Stabilizing the types without either accessible(...) or target_has_atomic="x" being available would still be useful, as people can use target_arch="x" which while less portable, offers much stronger guarantees.

Valloric commented 5 years ago

I think that documenting that atomics are at least lock-free and possibly wait-free should be enough to move forward. An optional documentation feature that would make this pretty damn perfect: also documenting which platforms/atomic sizes are wait-free.

In general, I completely agree with @Amanieu that CAS loop (AKA lock-free but not wait-free) atomics would be considered "native atomic support" by basically everyone. But going above-and-beyond by documenting this should settle any concerns.

alexcrichton commented 5 years ago

I'm nominating this for discussion at the next libs triage meeting, but to try to make progress on this discussion I'd like to separate out a few points. If others have thoughts on these (or other points), please let me know!

The main two options for placement of these types are:

The second option is whether or not to stabilize the emulated atomics, and that's just a question of whether the APIs are stabilized or not.

Amanieu commented 5 years ago
  • AtomicXYY types do not guarantee they are wait-free, and I think this is where we are today

This is fine, keep in mind that C++11 atomics (which we are based on) makes no guarantees about wait-freedom either.

  • It's unclear (to me at least) what to do about "emulation" of small size atomics using larger-sized atomics. For example emulating AtomicU8 with AtomicU32 operations.

I think it's fine to defer this issue. Currently none of the built-in targets make use of the min-atomic-width attribute. This was only added in #38579 to support the out-of-tree OR1K target, and even then it should be possible to implement an emulation for those in compiler-builtins. In any case, this doesn't block stabilization as you mentioned at the end.

alexcrichton commented 5 years ago

Ok we've discussed this in a recent @rust-lang/libs triage, and the conclusion was that the proposal to stabilize all these types as-is is probably the way to go. The stabilization would be coupled with documentation updates indicating that these aren't as portable as, say, Add for u8, but they're available on most platforms. Additionally it was concluded that stabilizing smaller-size atomics for platforms that only have larger-size atomics was fine to do.

I believe this is generally the trend of this thread anyway, so I'm going to open a dedicated thread and FCP this for stable

alexcrichton commented 5 years ago

Ok for those following along here, I've opened a formal proposal for stabilization at https://github.com/rust-lang/rust/issues/56753, feedback of course is always welcome!

RalfJung commented 5 years ago

smaller-size atomics for platforms that only have larger-size atomics

Just to be sure, the encodings of smaller-sized atomics in terms of larger-sized atomics happens by LLVM, as part of LLVM lowering to machine-specific IR or so? I maintain that doing this at the level of Rust, MIR or LLVM IR is illegal because of potential out-of-bounds accesses, and we shouldn't do it.

alexcrichton commented 5 years ago

@RalfJung correct, that's what convinced me personally that we can't do this on crates.io, which means if we want it at all we need it in the standard library (via LLVM intrinsics). I think we want it, so I'm convinced to put it into libstd :)

Amanieu commented 5 years ago

@RalfJung This lowering is done either within LLVM, or through a function in compiler_builtins.

The latter is currently only used on armv5te-unknown-linux-gnu at the moment, and uses this code. It could be argued that this is UB since intrinsics::atomic_load_unordered could be used to read out-of-bounds data, however this is guaranteed not to fault because it doesn't cross a page boundary.

RalfJung commented 5 years ago

@alexcrichton makes sense!

@Amanieu

It could be argued that this is UB

And the argument would be correct :)

is guaranteed not to fault because it doesn't cross a page boundary.

And as in the last N cases we have had this argument (and as I am sure you are aware, but not everybody else might be), that doesn't change anything about this being UB when we are talking about code written in Rust, MIR or LLVM IR. ;) (I am beginning to feel sorry for being so annoying about this, but LLVM is way too smart and getting smarter every day, so I am actively worried that such arguments will blow in our face some day.)

Is this a pattern supported/intended by LLVM? Is there advise from the LLVM devs for how to do this?

Is there any chance of LLVM ever inlining those compiler-builtins functions? Actually even having them in the same translation unit could be enough to cause problems, because LLVM could infer attributes on the functions to propagate information about what they do out to use sites.

One safer alternative would be to use inline assembly to implement such operations, that would most likely exclude any way for LLVM to notice that there are out-of-bounds accesses. But I am not sure if that's an option here.

Amanieu commented 5 years ago

The code is more-or-less based on the GCC implementation, which gets away with a normal atomic load.

Would changing the load to a volatile atomic load help in this case?

alexcrichton commented 5 years ago

To add to what @RalfJung is saying, @Amanieu just because it works at the hardware layer doesn't mean it's UB in LLVM's IR. For example this function:

define i8 @bar() {
start:                                  
  %a = alloca i8                        
  store i8 0, i8* %a                    
  %b = call i8 @foo(i8* %a)             
  ret i8 %b                             
}                                       

define internal i8 @foo(i8*) {          
start:                                  
  %b = getelementptr i8, i8* %0, i32 1  
  %a = load i8, i8* %b                  
  ret i8 %a                             
}                                       

is sort of a simplisitic view but it's guaranteed to never fault because the out-of-bounds load will just load some byte of the return address on the call stacsk or something weird like that. When optimized, however, it yields:

define i8 @bar() local_unnamed_addr #0 {
start:
  ret i8 undef
}

(a showing that this is undefined behavior)

LLVM can't automatically deduce that all instances of this pattern is undefined behavior, in isolation foo optimized just fine. That's why compiler-builtins happens to work, we're forcing LLVM to have less knowledge about the inputs so it just-so-happens that it can't deduce that undefined behavior is happening.

All that's just to say that @RalfJung I think is totally correct here, a crates.io based implementation of smaller-sized atomics with larger-sized atomics I think is just a segfault waiting to happen. LLVM may not even detect it's UB today, but it's definitely UB at the LLVM IR layer (and probably the Rust layer) to read out of bounds on objects. Why exactly it's UB or what exactly happens is always up for grabs which is why it works most of the time, but this is fundamentally why we need LLVM's backend to do the lowering because the IR passes need to see that we're just modifying/loading one byte, not the bytes around it

gnzlbg commented 5 years ago

The operations that @Amanieu wants to perform cannot be performed by a programming language generating LLVM-IR directly. Inline assembly appears to be the only way to perform these right now, so we could still expose them I think (@RalfJung ? I don't know whether compiler-builtins would work too).

In the meantime, I think it would be better to open an issue in the LLVM bugzilla about this, explaining why these operations are useful, why the LLVM-IR generated for them has undefined behavior, and how that requires us to use inline assembly (or modify compiler-builtins) instead. We should ask: what should we do? Should we use inline assembly / our own compiler built-ins ? Will LLVM expose intrinsics to allow these safely? etc.

It might be worth mentioning that this is not the only situation in which we need to perform reads out-of-bounds (see https://github.com/rust-rfcs/unsafe-code-guidelines/issues/2).

RalfJung commented 5 years ago

Would changing the load to a volatile atomic load help in this case?

No. Volatile reads in practice have some positive effects on racy reads (but LLVM may change those rules any time as we are relying on de-facto behavior here). It doesn't change anything about the requirement that accesses must be in-bounds.

The proper way to fox this is (as @gnzlbg mentioned) to add an attribute to LLVM that can be set on reads/writes and that indicates that the access may be partially out-of-bounds. Then we need a matching intrinsic in Rust, and methods such as read_out_of_bounds and write_out_of_bounds on pointers. Considering we need this for concurrency, we'd also need to think about how to expose atomic out-of-bounds accesses in Rust. Anything else (anything just arguing based on page boundaries but not informing LLVM) will remain a hack. Given that this seems to be a useful pattern, I absolutely think we should lobby for LLVM to add such an attribute!

just because it works at the hardware layer doesn't mean it's UB in LLVM's IR. For example this function

Thanks for the example, I'll link to this when such discussions come up again in the future. :)

That's why compiler-builtins happens to work, we're forcing LLVM to have less knowledge about the inputs so it just-so-happens that it can't deduce that undefined behavior is happening.

That sounds way less confident than I had hoped...

When does compiler-builtins get linked with the real program? Is there a chance that LTO might inline compiler-builtins functions (which then would mean LLVM could deduce the UB)?

nikic commented 5 years ago

When does compiler-builtins get linked with the real program? Is there a chance that LTO might inline compiler-builtins functions (which then would mean LLVM could deduce the UB)?

As rtlib calls are only inserted at the SelectionDAG layer, while LTO still operates on LLVM IR, I don't believe there is any possibility of these getting inlined.

alexcrichton commented 5 years ago

@nikic is correct, we explicitly don't LTO compiler-builtins as well (it's a very special crate). In that sense there's no worry for inlining compiler-builtins intrinsics.

RalfJung commented 5 years ago

Okay. I can live with that. We should keep it in mind though for the future, if/when compiler-builtins treatment ever changes.

So, yeah, I agree we should go forward with such "emulated" small-int atomics implemented via LLVM lowering or compiler-builtints.

alexcrichton commented 5 years ago

For those following this thread, the stabilization proposal is now in FCP

macpp commented 5 years ago

Can i ask a question (just curious) ? It seems that constants like ATOMIC_I64_INIT are marked as stable since 1.34 and deprecated since 1.34 at the same time. Why stabilize something that is deprecated? It may be just my opinion, but i think that getting new stable feature that is deprecated from the beginning is rather strange...

Amanieu commented 5 years ago

Nice catch! I think we should just remove those constants.

scottmcm commented 5 years ago

That's convincing to me, @macpp -- opened https://github.com/rust-lang/rust/issues/58089 to track it.

jonas-schievink commented 5 years ago

This is listed as the tracking issue for cfg_target_has_atomic, which is still unstable. Should this be reopened?

sfackler commented 5 years ago

Yep - reopened.

Centril commented 5 years ago

Removing T-Libs since this is a pure language feature.

LukasKalbertodt commented 5 years ago

Is there any progress on this? Can anyone explain what cfg_target_has_atomic is blocked on? I.e. what are questions we need to resolve before stabilizing?

asomers commented 5 years ago

AtomicU32 was stabilized for 1.34.0

Amanieu commented 5 years ago

I have one objection to the way target_has_atomic = "cas" works. I would prefer if we split this into two separate cfgs:

(bikeshed: maybe a slightly shorter name target_has_atomic_ldst)

sfackler commented 5 years ago

Is CAS the only operation that we'd need to call out that way (e.g. are there any platforms we care about that have atomic load/store but not swap)?

It seems like we should be able to stabilize target_has_atomic itself though with @Amanieu's definition.

jonas-schievink commented 5 years ago

thumbv6 has load, store, but no swap or cas

HadrienG2 commented 5 years ago

Does thumbv6 have any kind of read-modify-write instruction? Maybe presence or absence of atomic RMW instructions could be the right discrimination criterion...

jonas-schievink commented 5 years ago

No, thumbv6 has nothing of the sort. Perhaps a better name would be #[cfg(target_has_atomic = "rmw")], but that still doesn't really capture the swap operation.

HadrienG2 commented 5 years ago

Why ? Swap reads the old value, replaces it with the new one, and writes that in a single atomic transaction, so it is RMW in my book.

Amanieu commented 5 years ago

cc #65214

jonas-schievink commented 5 years ago

Why ? Swap reads the old value, replaces it with the new one, and writes that in a single atomic transaction, so it is RMW in my book.

Yeah, fair point.

parched commented 5 years ago

Does thumbv6 have any kind of read-modify-write instruction? Maybe presence or absence of atomic RMW instructions could be the right discrimination criterion...

I think a CAS cfg is correct because all the other RMW operations can be implemented with it, but having one RMW operation like swap doesn't allow you to implement the rest. So, for targets that just have a swap,fetch_add etc., but not CAS we might need more cfgs, but I don't think it would add enough value to be worth it.

HadrienG2 commented 5 years ago

Good point! I think we can agree on the following conclusion:

Since Rust atomics are guaranteed to be at least lock-free, this substitution cannot be done silently by std and must be performed manually on the user's side. Therefore, it is not transparent and must be exposed by a cfg, if and when the situation arises.

All this is conditional on the existence of hardware which has some atomic RMW instructions, but none with infinite consensus number. I'm not personally aware of any, but embedded chips and legacy hardware are full of surprises so it's best to keep that door open at the syntax level.

I believe that the syntax proposed by @Amanieu (target_has_atomic vs target_has_atomic_load_store) does so, therefore I'm happy with it.

RalfJung commented 5 years ago

If we find hardware which supports e.g. test-and-set but not CAS, then we may want support it as well with a finer-grained cfg (e.g. target_has_atomic_test_and_set), because test-and-set is all you need to implement a mutex

However, "normal" sequentially consistent loads and stores are also sufficient to implement a Mutex using Dekker's algorithm (for 2 CPUs) or Peterson's algorithm (for any number of CPUs). Now I wonder, how does that fit in?

Amanieu commented 5 years ago

All this is conditional on the existence of hardware which has some atomic RMW instructions, but none with infinite consensus number. I'm not personally aware of any, but embedded chips and legacy hardware are full of surprises so it's best to keep that door open at the syntax level.

Some old ARM chips (~ARMv5) only have an atomic SWP instruction and nothing else. However neither GCC nor LLVM actually use this instruction for atomics so atomics are unsupported on this architectures.

IMO we should follow the same general policy: only support atomic operations if all of them are supported (which essentially boils down to whether CAS is supported since you can use it to emulate the others).

jonas-schievink commented 5 years ago

Having access to limited atomic operations might still be useful for some niche applications (eg. on ARM7TDMI, which is still somewhat widespread), so I think it would be unfortunate if these use cases are prevented by a matter of policy.

HadrienG2 commented 5 years ago

If we find hardware which supports e.g. test-and-set but not CAS, then we may want support it as well with a finer-grained cfg (e.g. target_has_atomic_test_and_set), because test-and-set is all you need to implement a mutex

However, "normal" sequentially consistent loads and stores are also sufficient to implement a Mutex using Dekker's algorithm (for 2 CPUs) or Peterson's algorithm (for any number of CPUs). Now I wonder, how does that fit in?

Given that...

  1. the number of cores of the target CPU is rarely known at compile time, which is pretty much a prerequisite for efficiently implementing those algorithms,
  2. if a hardware architecture cares so little about concurrency that it does not even expose a test-and-set instruction, it is unlikely to provide the required memory barriers for SeqCst ordering,

I don't think that these algorithms are applicable outside of very constrained embedded scenarios where the target hardware is exactly known and hardware portability is not desired at all.

jonas-schievink commented 5 years ago

if a hardware architecture cares so little about concurrency that it does not even expose a test-and-set instruction, it is unlikely to provide the required memory barriers for SeqCst ordering,

Even thumbv6 has fully working loads and stores, despite it not having anything more sophisticated than that (no swap, CAS, or anything else). These are still sufficient for implementing things like SPSC queues.

Thumbv6 is also used in multicore processors, often alongside a more powerful Cortex-M3/M4 core (which is thumbv7 and does have CAS, etc.). This means that implementing a Mutex using one of the algorithms Ralf linked above might actually make sense on these MCUs. Manufacturers of these MCUs also provide peripherals that provide synchronization primitives, but these are often specific to the MCU family and don't exist on others.