rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do
https://rust-lang.github.io/unsafe-code-guidelines
Apache License 2.0
660 stars 57 forks source link

SIMD + Data races #209

Closed Diggsey closed 1 year ago

Diggsey commented 5 years ago

The Rustonomicon defines a data race like so:

  • two or more threads concurrently accessing a location of memory
  • one of them is a write
  • one of them is unsynchronized

AIUI, all data races are automatically UB.

However, there is useful behaviour whose obvious implementation is forbidden in Rust with this definition. One example of this is with an atomic write, but an unsynchronised read:

(assuming thread 2 doesn't care whether it observes the changes from thread 1, or even if it observes partial changes)

For most cases, you can work around this by just doing a relaxed atomic read instead. The data race goes away and you just have to worry about normal race conditions. As a bonus, you avoid seeing partial results.

However, if you want to do a SIMD or other "wide" read then it becomes impossible in Rust. You would have to use assembly to avoid introducing UB.

My question is: if one wanted to want to support this in Rust, what would be a valid way to do it? Is this even OK to do at the assembly level? We can't introduce atomic versions of SIMD types, because SIMD is inherently not atomic. Would there need to be "volatile" versions of all SIMD operations whose only difference is that they don't introduce UB?

gnzlbg commented 5 years ago

Data races are defined in terms of the memory model: in this case, the Rust memory model. When writing assembly that memory model is irrelevant because you're writing code for actual hardware, not an abstract machine

If you plan to call this assembly code from Rust, then the Rust memory model dictates how the Rust code around the assembly is optimized, and a data-race will cause a mis-optimization.

If it's unknown code, then the optimizer is not allowed to assume anything about it.

The optimizer does assume that unknown code won't introduce undefined behavior in Rust.

However, it's rather unsatisfying to hope that the optimiser is doing the right thing (hence why the SIMD instructions exist at all!) and I could definitely see cases where you might want to have eg. an "Acquire" memory ordering on the operation. That would be very problematic because I don't see how LLVM could ever optimise a sequence of "acquire" atomic loads into a SIMD load because the atomic loads would have slightly stronger guarantees.

If the SIMD load doesn't need to be atomic, then you have many options:

none of which guarantees that the exact code you want will be generated. If the SIMD load doesn't need to be atomic but you want guarantees about what specific instruction is generated, that's what asm! is for, so with appropriate synchronization (compiler-fences or the right constraints) you might be able to use that.

If the SIMD load needs to be atomic, then there are no options, because the hardware doesn't support atomic SIMD loads, so there is no machine code that the compiler could reasonably generate for that operation.

RalfJung commented 5 years ago

If the SIMD load needs to be atomic, then there are no options, because the hardware doesn't support atomic SIMD loads, so there is no machine code that the compiler could reasonably generate for that operation.

Well on x86 every load is acquire, so SIMD acquire loads are a thing. But for other platforms, I don't know. I mean, does hardware even specify that e.g. on ARM, a 256bit-wide SIMD load is actually fully cache coherent and cannot observe partial writes?

Diggsey commented 5 years ago

If the SIMD load needs to be atomic, then there are no options, because the hardware doesn't support atomic SIMD loads

It's really unfortunate that we use the word "atomic" for this because operations like a SIMD load are not atomic (in the regular way the word is used), but they could be equivalent to a "composite atomic operation". It's even more unfortunate because memory ordering can apply to the compound operation as a whole, but not to the atomic parts.

I feel like it should be possible to express this to LLVM, eg. have "atomic memcpy" take a memory ordering that applies to the operation as a whole, but where the "atoms" are unordered relative to each other.

gnzlbg commented 5 years ago

Well on x86 every load is acquire, so SIMD acquire loads are a thing.

@RalfJung see this comment. While that happens to be how we observe that many x86 CPUs implement this, nothing guarantees that this is the case. Also, loads from unaligned memory addresses tear on x86, so not every load behave likes that =/

But for other platforms, I don't know. I mean, does hardware even specify that e.g. on ARM, a 256bit-wide SIMD load is actually fully cache coherent and cannot observe partial writes?

ARM doesn't have hardware support for 256-bit SIMD so a generic (non-platform-specific) intrinsic for implementing such loads cannot guarantee an atomic access in general.

@Diggsey

I feel like it should be possible to express this to LLVM, eg. have "atomic memcpy" take a memory ordering that applies to the operation as a whole, but where the "atoms" are unordered relative to each other.

I think this is one of the most promising alternatives discussed. It looks like a generally useful intrinsic to have that's not just tied to SIMD, and if LLVM doesn't emit optimal code for it, it looks like it would be simpler to fix that for that intrinsic than to, e.g., try to implement optimizations that coalesce multiple atomic accesses into a single SIMD load or similar.

RalfJung commented 5 years ago

ARM doesn't have hardware support for 256-bit SIMD so a generic (non-platform-specific) intrinsic for implementing such loads cannot guarantee an atomic access in general.

Well then take the largest bitwidth they have, same question.

I feel like it should be possible to express this to LLVM, eg. have "atomic memcpy" take a memory ordering that applies to the operation as a whole, but where the "atoms" are unordered relative to each other.

I can't make sense of a "compound atomic acquire" operation that has an "ordering as a whole". What is that supposed to be? Acquire is defined via the reads-from relation, but if this SIMD access does two reads there's two reads-from edges; should it synchronize via both... or what?

gnzlbg commented 5 years ago

Well then take the largest bitwidth they have, same question.

For x86 your question was answered with a "No, SIMD acquire loads are not a thing". So if there is some architecture for which such loads are a thing, that would be an architecture dependent feature anyways. For ARM in particular, I have no idea of what precise semantics does the ISA guarantee for 128-bit or 64-bit SIMD loads (cc @parched). [These](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vld1q_u64%20()) are the docs I currently use, and I don't think they do say.

I can't make sense of a "compound atomic acquire" operation that has an "ordering as a whole".

I expected that to refer to what LLVM says :

the copy between buffers uses a sequence of unordered atomic load/store operations that are a positive integer multiple of the element_size in size.

[...] This intrinsic does not provide any additional ordering guarantees over those provided by a set of unordered loads from the source location and stores to the destination.

Diggsey commented 5 years ago

I can't make sense of a "compound atomic acquire" operation that has an "ordering as a whole". What is that supposed to be?

I can try to explain with an example (imagine we have two adjacent memory locations, A, B initialized to zero).

Thread 1:

Thread 2:

Now if thread 2 observes that B=1, then it must also observe that A=1

This means that we cannot optimize thread 2 to this:

Because SIMD does not provide that guarantee.

Acquire is defined via the reads-from relation, but if this SIMD access does two reads there's two reads-from edges; should it synchronize via both... or what?

There should be an edge from the "last effective read", but we don't specify what order (within a single thread) those reads happen in.

HadrienG2 commented 5 years ago

@Diggsey I think a way to formalize the compound Acquire/Release semantics that you have in mind without breaking SIMD support is to say that code doing a compound Acquire load synchronizes with code doing a compound Release store only if it observes all the writes from said compound store.

(Not sure if that definition can be made to work with SeqCst though.)

Diggsey commented 5 years ago

@HadrienG2 that's a better way of explaining it, although I think I had in mind that it would synchronise if it observes any of the writes from the compound store, as that is closer to what the hardware guarantees.

HadrienG2 commented 5 years ago

@Diggsey I would be wary of claiming that this is guaranteed to always work on hardware architectures with a weak memory model. In my opinion, a definition based on "all the writes" is safer in this respect (since "all writes" is a superset of "any write"), and it doesn't actually forbid much on the developer's and optimizer's side.

Diggsey commented 5 years ago

I did not intend to claim that, I was talking about the specific case of SIMD on x86

In my opinion, a definition based on "all the writes" is safer in this respect, and doesn't actually forbid much on the developer's and optimizer's side.

I think that doesn't quite work because the writer may not be using SIMD. The writer may write a single atomic u32, and I would want a SIMD-based reader to acquire the changes from that writer if it sees the u32 that was written as part of the SIMD value. (Assuming everything is aligned correctly)

In my example of the concurrent hash map, if I observe any of the hashes to match the hash I'm looking for, and then I try to read the corresponding value, I would want to guarantee that the value had already been written. (Assuming the writer writes the value first, and then stores the hash)

HadrienG2 commented 5 years ago

(Continuation of my previous post which raced with your comment being published)

Even if it could be made to work, a definition based on "any write" could preclude useful optimizations on a hardware with a weak memory model.

Consider a hypothetical but suspiciously ARM-like 32-bit weak memory processor which is being asked to perform a 256-bit atomic memcpy with Acquire/Release synchronization. With an "all the writes must have been observed in order to synchronize" definition, this pseudo-assembly is legal:

// Thread 1
st.relaxed r1 -> [target]
st.relaxed r2 -> [target+4]
st.relaxed r3 -> [target+8]
st.relaxed r4 -> [target+12]
st.relaxed r5 -> [target+16]
st.relaxed r6 -> [target+20]
st.relaxed r7 -> [target+24]
st.release r8 -> [target+28]

// Thread 2
ld.acquire [target+28] -> r8
ld.relaxed [target] -> r1
ld.relaxed [target+4] -> r2
ld.relaxed [target+8] -> r3
ld.relaxed [target+12] -> r4
ld.relaxed [target+16] -> r5
ld.relaxed [target+20] -> r6
ld.relaxed [target+24] -> r7

With an "observing any write is enough to synchronize" definition, however, this compilation is unsound, and all individual operations must have acquire or release ordering, which can be much more costly in terms of memory barriers when doing large atomic memcpys.


(Written after your post)

In my opinion, a definition based on "all the writes" is safer in this respect, and doesn't actually forbid much on the developer's and optimizer's side.

I think that doesn't quite work because the writer may not be using SIMD. The writer may write a single atomic u32, and I would want a SIMD-based reader to acquire the changes from that writer if it sees the u32 that was written as part of the SIMD value. (Assuming everything is aligned correctly)

Oooh, that's nasty. IIRC, the situation on this front is that LLVM tried to specify either atomics or volatile with a writer and reader that use operations of different width, but C++11 doesn't, and it seems we aren't quite ready to accept LLVM-ish major extensions to the C++11 memory model in Rust at this point in time.

That being said, your use case could be made to work under an "all writes must have been observed" defintion, if we specified atomics of heterogeneous width to work such that if all u32 writes are Release and the compound load is Acquire, then the compound load synchronizes with every individual u32 Release store that it observed.

What would not be allowed, but is not relevant to your use case as far as I understand, is the reverse scenario of an u32 Acquire load synchronizing with a partial view of a 256-bit Release store.

RalfJung commented 5 years ago

For x86 your question was answered with a "No, SIMD acquire loads are not a thing".

Fair.

I expected that to refer to what LLVM says :

That's just element-wise atomic, it doesn't need anything new to be explained. Just a loop.

There should be an edge from the "last effective read", but we don't specify what order (within a single thread) those reads happen in.

That sounds like a new beast in the world of weak memory concurrency in. Not sure if it is implementable in hardware, but it's not something LLVM exposes to my knowledge, so there are some way more fundamental questions to be answered here first. Also, everything I said here applies.

I'd rather if we restricted ourselves to what actually seems feasible instead of designing some novel SIMD extensions for LLVM.^^

I think that doesn't quite work because the writer may not be using SIMD. The writer may write a single atomic u32, and I would want a SIMD-based reader to acquire the changes from that writer if it sees the u32 that was written as part of the SIMD value. (Assuming everything is aligned correctly)

Ah, now that sounds somewhat familiar. There are some models for mixed-size atomics. But AFAIK (a) all of the are operational, no axiomatic models exist (which is good IMO, as mentioned elsewhere I strongly prefer operational models), but (b) they are all hardware-level models, not language-level models. That means they can be defined in terms of syntactic data dependencies and also do not have to think about some of the crazy weak behaviors that only optimizations can introduce. You can find some of these papers here.

comex commented 5 years ago

If you want an acquire SIMD load, how about just doing a relaxed/unordered SIMD load followed by an acquire fence (std::sync::atomic::fence)? Similarly, if you want a release store, do a relaxed/unordered store preceded by a release fence.

On x86, both acquire and release fences expand to literally nothing (they're just a compiler fence). To rephrase that from another perspective, SIMD load instructions are at least as strong as a series of relaxed loads (in unspecified order) followed by an acquire fence, and SIMD store instructions are at least as strong as a release fence followed by a series of relaxed stores. Well, either that, or those instructions are unsound to use in an atomic context at all, but I don't think that's true.

HadrienG2 commented 5 years ago

The problem is that on x86 (and probably on other archs as well), SIMD loads and stores are architecturally not guaranteed to be atomic in the sense of being indivisible, and are provably not atomic on some CPU models. They are best modeled an unordered stream of atomic load or store operations, LLVM atomic memcpy style.

comex commented 5 years ago

Indeed. I'm just saying that that they do have ordering "as a whole" (with respect to other instructions), which can be modeled by adding a fence before/after that stream of atomic store/load operations.

RalfJung commented 1 year ago

Triage:

So there's not really anything left to be tracked here; new issues should be opened for the potentially missing features.