SIMD + Data races - Githubissues

Diggsey commented 5 years ago

The Rustonomicon defines a data race like so:

two or more threads concurrently accessing a location of memory

one of them is a write

one of them is unsynchronized

AIUI, all data races are automatically UB.

However, there is useful behaviour whose obvious implementation is forbidden in Rust with this definition. One example of this is with an atomic write, but an unsynchronised read:

thread 1: atomic write
thread 2: non-atomic read

(assuming thread 2 doesn't care whether it observes the changes from thread 1, or even if it observes partial changes)

For most cases, you can work around this by just doing a relaxed atomic read instead. The data race goes away and you just have to worry about normal race conditions. As a bonus, you avoid seeing partial results.

However, if you want to do a SIMD or other "wide" read then it becomes impossible in Rust. You would have to use assembly to avoid introducing UB.

My question is: if one wanted to want to support this in Rust, what would be a valid way to do it? Is this even OK to do at the assembly level? We can't introduce atomic versions of SIMD types, because SIMD is inherently not atomic. Would there need to be "volatile" versions of all SIMD operations whose only difference is that they don't introduce UB?

Lokathor commented 5 years ago

1) Volatile doesn't affect if an operation is thread safe or not. As far as I know rust doesn't currently have atomic_volatile reads, you only get one or the other. 2) I believe the data race definition is an LLVM level rule that gets propagated up into rust, so there's no way for rust to just set aside the rule.

Diggsey commented 5 years ago

Yeah, I used "volatile" for lack of a better word. Atomic would be incorrect because the results are not and cannot be atomic.

I believe the data race definition is an LLVM level rule that gets propagated up into rust, so there's no way for rust to just set aside the rule.

Ah yes, that's a problem. However, speculative reads are allowed in LLVM, they just return undef, so if they implement the "freeze" operation, you could combine them (do a speculative read and then freeze the result) to non-atomically read from a racy memory location without introducing UB.

Lokathor commented 5 years ago

You're not the first to want a "freeze" operation ;3

comex commented 5 years ago

In theory, LLVM's atomic memcpy can express the desired semantics, and a memcpy of constant size equal to a SIMD register width could then be optimized to a SIMD load. But LLVM doesn't know how to do that optimization.

Volatile loads will work in practice, and I claim (to some dispute) that it is correct to rely on them working; see also this thread about the interaction between atomics and volatile. But they have stronger semantics than necessary – e.g. they can't be elided if unused.

RalfJung commented 5 years ago

Volatile loads will work in practice, and I claim (to some dispute) that it is correct to rely on them working

Specifically, part of this dispute is that LLVM does not actually say that they are guaranteed to work that way, and LLVM devs were not very sympathetic to adding such a guarantee to their language. We cannot reasonably rely on things not guaranteed to us by LLVM.

But yes it seems like what you want is a relaxed SIMD load here, one that's permitted to tear but not subject to the UB rules. That's the "right semantics" here, I think we all agree?

gnzlbg commented 5 years ago

However, if you want to do a SIMD or other "wide" read then it becomes impossible in Rust. You would have to use assembly to avoid introducing UB.

Which SIMD reads are you talking about? IIRC, on stable Rust, the aligned load / store core::arch::x86_64 intrinsics all synchronize with Release / Acquire and are atomic because they are specified to map to some x86_64 instruction, and the semantics of that instruction hold. If LLVM is not taking their ordering into account, then we might want to open an LLVM bug.

Diggsey commented 5 years ago

But yes it seems like what you want is a relaxed SIMD load here, one that's permitted to tear but not subject to the UB rules. That's the "right semantics" here, I think we all agree?

Sounds right to me.

Which SIMD reads are you talking about? IIRC, on stable Rust, the aligned load / store core::arch::x86_64 intrinsics all synchronize with Release / Acquire and are atomic because they are specified to map to some x86_64 instruction, and the semantics of that instruction hold. If LLVM is not taking their ordering into account, then we might want to open an LLVM bug.

I am talking about those intrinsics, but to be clear: there's no actual bug here, I'm just trying to work out what I can rely on. What you're saying sounds plausible, but it's super unclear to me that this is actually guaranteed, and we may need to relax the wording around what constitutes a data race, or else define these intrinsics as being "synchronised".

RalfJung commented 5 years ago

the aligned load / store core::arch::x86_64 intrinsics all synchronize with Release / Acquire and are atomic because they are specified to map to some x86_64 instruction, and the semantics of that instruction hold.

No, that is certainly not the case. These all count as non-atomic accesses unless explicitly stated otherwise. The only time you may use x86-level reasoning for anything is volatile. I don't think you can even rely on two of these intrinsics synchronizing with each other as that would seriously constrain reordering in the optimizer -- usually the opposite of what you want with SIMD.

If LLVM is not taking their ordering into account, then we might want to open an LLVM bug.

I see no bug here. Only and exclusively atomic operations synchronize with anything. These operations are not atomic, so the compiler can reorder, duplicate and eliminate them however it wants.

they are specified to map to some x86_64 instruction

So are they volatile? Because that is the only mechanism I know to actually get HW-level semantics into the surface language in some sense. For these things, I think the x86 instruction is mostly meant as demonstration to avoid having to spell out SIMD in the abstract machine -- but from what I learned, the entire purpose of using these intrinsics is to make the compiler optimize them, which immediately implies they are not volatile and thus do not guarantee to map to anything.

hanna-kruppe commented 5 years ago

@gnzlbg That's just plain wrong.

Many std::arch intrinsics map to plain old LLVM IR constructs which have all the usual rules, including loads and stores for intrinsics such as _mm_load_ps. This is also consistent with how other compilers (C and C++) handle such intrinsics.

It is naively appealing to think of intrinsics as "mapping to some hardware instruction" but in reality intrinsics are and should be subject to compiler optimizations and have behavior that balances optimization power with allowing programmers to achieve good codegen. They provide access to some hardware capabilities but do not automatically import all aspects of the underlying hardware into the language semantics. That is often not even reasonably possible (eg, recall the now deprecated intrinsics for reading and writing EFLAGS). Guaranteeing specific sequences of binary code is the domain of online assembly, not of intrinsics.

gnzlbg commented 5 years ago

These operations are not atomic, so the compiler can reorder, duplicate and eliminate them however it wants.

The Intel spec for _mm_load_ps says that it generates a movaps instruction, which is an atomic instruction (no tearing) and specifies its semantics also as:

dst[127:0] := MEM[mem_addr+127:mem_addr]

which performs a single atomic read of 16 bytes from memory.

So AFAICT, _mm_load_ps it is at least an atomic load with an Unordered ordering, and it would be unsound for LLVM to transform it into a load that tears.

So are they volatile?

The spec does not mention "volatile" anywhere nor if these intrinsics are ordered with respect to other intrinsics or not.

Many std::arch intrinsics map to plain old LLVM IR constructs which have all the usual rules, including loads and stores for intrinsics such as _mm_load_ps.

Sure, AFAICT the implementation of _mm_load_ps is incorrect, because the current implementation technically allows the load to tear, even though currently that does not happen in practice.

hanna-kruppe commented 5 years ago

The Intel spec for _mm_load_ps says that it generates a movaps instruction, which is an atomic instruction (no tearing) and specifies its semantics also as:

dst[127:0] := MEM[mem_addr+127:mem_addr]

which performs a single atomic read of 16 bytes from memory.

So AFAICT, _mm_load_ps it is at least an atomic load with an Unordered ordering, and it would be unsound for LLVM to transform it into a load that tears.

If this is a correct reading of Intel's documentation, then not even ICC implements this behavior. A strict "intrinsic used => instruction occurs in binary" reading is plainly false, and even if you retreat to some kind of behavioral as-if rule (difficult to define but let's assume you manage) you'll find discrepancies such as _mm_load_ps(NULL) loading just one byte from address 0 into al and then calling abort() even though that is certainly not how movaps from address 0 behaves.

The simplest explanation for this and many other observations is that you're interpreting intrinsics in an overly literal way. "Generate this specific instruction and graft whatever it does in hardware onto the program semantics" is almost never a reasonable approach for an optimizing compiler. It's always some middle ground between preserving ability to reason about the program and being able to generate code that makes use of the hardware functional units that the programmer wants to program. This is not some radical new idea but, as I've said and illustrated before, how intrinsics work even in the C compilers that set all the precedents we're trying to emulate in Rust.

RalfJung commented 5 years ago

So AFAICT, _mm_load_ps it is at least an atomic load with an Unordered ordering, and it would be unsound for LLVM to transform it into a load that tears.

Would it also be unsound for LLVM to reorder two _mm_load_ps to different locations? Or to remove a dead load? Or to do any other optimization that would be illegal on volatile? If the answer to any of these questions is "yes", then the semantics of _mm_load_ps is very far from that of movaps. All of these transformations can change the observable behavior of an assembly program (when observed by other assembly programs or IO registers).

I think you are just experiencing a not-very-carefully written spec here. The spec works when looking at compiler backends, but by applying it to the surface language you are basically pretending the Rust memory model is the same as that of the hardware, when that is clearly not the case.

gnzlbg commented 5 years ago

Would it also be unsound for LLVM to reorder two _mm_load_ps to different locations? Or to remove a dead load? Or to do any other optimization that would be illegal on volatile?

Those are good questions, the spec doesn't say.

you'll find discrepancies such as _mm_load_ps(NULL) loading just one byte from address 0 into al and then calling abort() even though that is certainly not how movaps from address 0 behaves.

For _mm_load_ps(NULL), the Intel spec also provides an attempt at "operational semantics" for that intrinsic ("reads 16 bytes atomically from address 0") and this also does not match to the codegen that gets produced.

I see your point that probably the spec only provides movaps as an example of one instruction sequence that might get generated, but I doubt that the operational semantics provided in the spec for the intrinsics are just examples as well, and for _mm_load_ps I read those as performing a single atomic load.

nd even if you retreat to some kind of behavioral as-if rule (difficult to define but let's assume you manage)

FWIW for the intrinsics we do not guarantee that the same instruction sequence that intel prescribes is generated (even though many users open issues related to that), but we do guarantee that whatever we generate is semantically equivalent to whatever the Intel documentation specifies (EDIT: whatever that might be).

RalfJung commented 5 years ago

for _mm_load_ps I read those as performing a single atomic load.

The load is likely guaranteed to not tear but that does not mean it is atomic in the sense of the C11 memory model.

but we do guarantee that whatever we generate is semantically equivalent to whatever the Intel documentation specifies (EDIT: whatever that might be).

Crucially, this is using equivalence on the Rust Abstract Machine, not the x86 spec. And for the Rust Abstract Machine, the fact that these instructions are not "atomic" in the sense of the concurrency memory model makes a huge difference.

gnzlbg commented 5 years ago

The load is likely guaranteed to not tear but that does not mean it is atomic in the sense of the C11 memory model.

FWIW by atomic I mean in the LLVM memory model sense, where the load would be atomic with at least an unordered ordering.

RalfJung commented 5 years ago

I see no reason why it should be treated like that. And I see a lot of reasons it should not be treated like that, as it will inhibit optimizations.

gnzlbg commented 5 years ago

I see no reason why it should be treated like that.

So IIUC what you are saying is that we do not / cannot guarantee that _mm_load_ps does not tear ?

RalfJung commented 5 years ago

I said nothing like that. Tearing is not even a concept in the Rust Abstract Machine -- there is no way, on that level, to observe if an access teared or not.

The way I see it, these intrinsics are hints to the compiler to please use certain efficient HW instructions and registers to achieve the specified effect. But the only guaranteed part is that, e.g., this thing will load 128 bytes (or whatever it is) of memory into the return value somehow. If the compiler thinks it can do that more efficiently than by emitting movaps (such as doing const-prop because it knows the resulting value), it can do that. If the register is spilled and the current thread does not write to the affected memory, the compiler is free to-reload the thing into the register again later (which assumes absence of data races). And so on, just like for "normal" loads.

comex commented 5 years ago

movaps is not guaranteed to be atomic at an ISA level. It is atomic in practice on most x86 CPUs, as a side effect how they implement memory caching, but the link describes a CPU that does allow it to tear. Only cmpxchg16b can perform guaranteed-atomic 128-bit accesses.

gnzlbg commented 5 years ago

The way I see it, these intrinsics are hints to the compiler to please use certain efficient HW instructions and registers to achieve the specified effect. But the only guaranteed part is that, e.g., this thing will load 128 bytes (or whatever it is) of memory into the return value somehow.

These intrinsics must be more than hints because as you mention we do guarantee some semantics for them (e.g. loading 16 bytes from memory), and right now, we have RFC'ed that if we expose the intrinsic we guarantee whatever the vendor spec guarantees for them. For the cases in which the intrinsic does not currently work or cannot work according to the vendor spec (e.g. the cases @rkruppe mentioned above: EFLAGS, MXCRS, etc.), we just don't provide the intrinsic. Unsafe code is allowed to and often does rely on the guaranteed semantics of the intrinsics for correctness, e.g., there is quite a bit of stable Rust crypto code out there relying on the AES intrinsics actually using the CPU crypto instructions - if those were "just hints" those intrinsics would be useless and all that code would become insecure. (EDIT: another example is how the ARM intrinsics are being used as a replacement for inline assembly, where the intrinsic semantics are "do a volatile write of value X to register Y").

If the compiler thinks it can do that more efficiently than by emitting movaps (such as doing const-prop because it knows the resulting value), it can do that.

For the semantics that we do guarantee, currently, we consider it correct for the compiler to emit any instruction sequence that preserves those semantics and that it is at least as efficient as the instruction sequence "recommended" (or specified, depending on how you read the spec) by the vendor. That allows const propagation, but it wouldn't allow, e.g., breaking the semantics of the program by emitting a store instead of a load, or by replacing an atomic load by two smaller loads iff the vendor specifies that the load must be atomic, etc.

If the register is spilled and the current thread does not write to the affected memory, the compiler is free to-reload the thing into the register again later (which assumes absence of data races)

Sure, iff the vendor intrinsic does not synchronize in any way, then that would be a valid program transformation.

HadrienG2 commented 5 years ago

@gnzlbg IMO, there is a certain threshold in desire for machine control where intrinsics shouldn't be used, and (possibly inline) assembly should be used instead. But I'm not skilled enough at language lawyering to tell where exactly this point lies.

To give some context, in my professional community (scientific computation), intrinsics are typically used to perform vectorization by hand when the compiler's optimizer is not smart enough to do it on its own. In this context, definining intrinsics as mere syntactic sugar for inline assembly which provides little more than automatic register allocation would be unhelpful, because it would have the effect of killing almost all other compiler optimizations, forcing even more hardware-specific micro-optimization work to be done by hand.

For example, if compilers cannot be trusted to optimize a SIMD store to a memory location immediately followed by a SIMD load from the same location into nothingness, then some SIMD abstractions like linear algebra libraries cannot be made as performant as reasonably well-written custom assembly at all, at least not without resorting to horrible code generation hacks like C++'s expression templates.

In this respect, I'm more sympathetic to the view expressed by @RalfJung here, although I will readily admit that many current compilers treat intrinsics in a very cautious fashion that is more aligned with the view that you are expressing here.

Of course, if the vendor intrinsic spec is very narrow-minded and forces very specific hardware semantics upon the compiler, that's a different problem, and then compliant compilers to behave as you describe...

gnzlbg commented 5 years ago

@HadrienG2 It all boils down to what the particular intrinsic guarantees to its users. For example, the embedded WG got team blessing for adding intrinsics to core::arch whose guaranteed semantics are "emits ASM instruction X operating on registers Y and Z in a volatile way" with the objective of being able to write embedded Rust on stable without having to wait for inline assembly (e.g. instead of writing inline assembly, you just call intrinsics instead... which AFAICT doesn't necessarily work but if you only need one instruction that's fine). Saying that such semantics are a "hint" and that the compiler is allowed to generate a different instruction that operates on different registers makes them all useless.

There are dozens of such intrinsics exposed in core::arch::arm/aarch64 (e.g. see this module: https://github.com/rust-lang/stdarch/tree/master/crates/core_arch/src/acle).

HadrienG2 commented 5 years ago

Speaking personally, I see the existence of those intrinsics as a hint that the need for inline assembly is becoming more pressing in this particular community... But I'm well aware that finite person-power and design tradeoffs are a thing :wink:

Lokathor commented 5 years ago

I think there's a need for both "do this exactly" and also "just do a semantic effect, I don't care how". And there's also a need to be mor clear about which is which.

HadrienG2 commented 5 years ago

@Lokathor Agreed ! I think inline assembly (once stabilized) should become the tool of choice for "do this exactly", freeing up most intrinsics to just "do something that has similar semantics".

Lokathor commented 5 years ago

(Who needs inline assembly when you can just link an extern object file into the build? ;3 )

HadrienG2 commented 5 years ago

(It allows extra optimizations for code around the inline assembly snippet, e.g. the compiler may not need to spill all registers to memory and reload them after the fact. Also, there are times where you don't want to have a CALL instruction around, such as when aiming for maximal performance or doing very low level stuff that messes with the active thread's stack)

comex commented 5 years ago

Unsafe code is allowed to and often does rely on the guaranteed semantics of the intrinsics for correctness, e.g., there is quite a bit of stable Rust crypto code out there relying on the AES intrinsics actually using the CPU crypto instructions - if those were "just hints" those intrinsics would be useless and all that code would become insecure.

Why, because of timing attacks? What if the compiler (for some reason) emitted a code sequence that did not use the CPU crypto instructions but was guaranteed to be constant-time?

gnzlbg commented 5 years ago

@comex

Yes, the AES-NI instructions guarantee constant-time. Depending on how we interpret the spec and the hint/guarantee, the compiler might need to generate the exact same instruction in a "volatile way", might be able to generate a different instruction sequence that's also constant time (e.g. maybe just doing const prop), or it might be ok to generate an instruction sequence that is not constant time.

FWIW I don't think one can use Rust to write constant time code at all, but these intrinsics are available, people are using them, and they are reading the same unclear docs that we are.

elichai commented 5 years ago

This is an interesting and important discussion which I don't want to derail, but @gnzlbg why don't you think one can write constant time in rust? why do you think it's so much harder than in C? Thanks.

gnzlbg commented 5 years ago

@elichai The moment you have two intrinsics calls, like:

fn foo(x: ...) -> ... {
    let y = constant_time_intrinsic0(x);
    constant_time_intrinsic1(y)
}

even if they are both guaranteed to be constant-time, we provide no guarantees about foo itself being constant-time, and for example, it would be valid to transform foo to this:

fn foo(x: ...) -> ... {
    let y = constant_time_intrinsic0(x);
    if some_flag { some_register = 0; }
    constant_time_intrinsic1(y)
}

which isn't constant-time. For this same reason, you can't really write code that's guaranteed to be constant-time in C either. None of this stops people from trying, and if they only need a single constant time operation and have one intrinsic that provides it with that guarantee (e.g. much like an inline asm block) they can even succeed.

elichai commented 5 years ago

@gnzlbg are you suggesting that inline asm can guarantee constant timeness?

about generally "guaranteeing" I agree with you, all of the ANDs,XORs etc people do is just "best efforts" both in rust and in C. and most good libraries continue to investigate generated assemblies to see if that has changed.

gnzlbg commented 5 years ago

@elichai yes, that is what I was suggesting. The different forms of inline assembly that we have (global_asm!, asm!, ...) can be used to write assembly that is inserted into the binary more or less "as is".

RalfJung commented 5 years ago

These intrinsics must be more than hints because as you mention we do guarantee some semantics for them (e.g. loading 16 bytes from memory)

But that can also be achieved through 16 separate loads. The hint part is to use SSE instead of something more naive. Generally, except for performance, all SSE operations can be expressed with lower-leveloeprations, right? Just do everything N times. Using the intrinsics (a) compactly expresses "do this for every element of the vector" as a single instruction (having these semantics is the guaranteed part), and (b) hints at the compiler to use a particular hardware instruction to realize these Rust Abstract Machine semantics.

That allows const propagation, but it wouldn't allow, e.g., breaking the semantics of the program by emitting a store instead of a load, or by replacing an atomic load by two smaller loads iff the vendor specifies that the load must be atomic, etc.

Given the lack of clarity in the vendor spec and the precedent in existing compilers, I think we can treat it as a given that these are not specified to be atomic.

e.g., there is quite a bit of stable Rust crypto code out there relying on the AES intrinsics actually using the CPU crypto instructions - if those were "just hints" those intrinsics would be useless and all that code would become insecure

Why insecure? Are you referring to timing side-channels? Those are outside the scope of the Rust Abstract Machine, I am afraid.

we consider it correct for the compiler to emit any instruction sequence that preserves those semantics and that it is at least as efficient as the instruction sequence "recommended"

"efficiency" or performance is not a guarantee that we are making anywhere. We just promise to try hard to not screw that up, but sometimes we do -- and when we do, that's a bug, but not a spec violation.

but it wouldn't allow, e.g., breaking the semantics of the program by emitting a store instead of a load,

WTF.... yeah nobody suggested that. This discussion is getting quite exhausting I have to say.

the embedded WG got team blessing for adding intrinsics to core::arch whose guaranteed semantics are "emits ASM instruction X operating on registers Y and Z in a volatile way" with the objective of being able to write embedded Rust on stable without having to wait for inline assembly

volatile is a totally different game as mentioned above. Though this does sound like the lang team and preferably the UCG should be involved.

Also please stop changing goalposts; we were talking about SIMD intrinsics and that mm_load one in particular -- what is true for SIMD intrinsics does not have to be true for all intrinsics (obviously); constant-time concerns and whatever we did for some embedded ARM needs are not very related.

So coming back on-topic -- both precedent in other compilers and the way these SIMD intrinsics are used indicates that there is no guarantee at all that any particular hardware instruction is executed. Instead, they merely make it easier for the compiler to recognize certain patterns. For many intrinsics, I hope you agree that volatile-style guarantees are very much not desirable as they inhbitit optimizations. In fact, when I was introduced to these intrinsics, the main argument for them was that they are better than inline assembly because they don't kill optimizations! Hence I am quite surprised when @gnzlbg says they want to specify these intrinsics as basically sugar for inline assembly (and I am specifically referring to things like _mm_load_ps). That would, from all I know, make them useless.

Oh and can we please stop de-railing into constant-time execution? That's a huge subject on its own, but not very related to SIMD at all.

gnzlbg commented 5 years ago

Hence I am quite surprised when @gnzlbg says they want to specify these intrinsics as basically sugar for inline assembly

Not at all, but whatever.

But yes it seems like what you want is a relaxed SIMD load here, one that's permitted to tear but not subject to the UB rules.

Do we currently guarantee that data-race free programs never observe partial writes ?

RalfJung commented 5 years ago

Not at all, but whatever.

Well you suggested to specify them as some hardware instruction. If that's not the same as using inline assembly, then what is it?

Do we currently guarantee that data-race free programs never observe partial writes ?

Partial writes do not exist as a concept in the Rust Abstract Machine. So, yes.

gnzlbg commented 5 years ago

Is the operation you are proposing atomic?

RalfJung commented 5 years ago

Since you seem to use "atomic" in a different sense than me, please specify the question.

"Atomic" in surface Rust refers to these operations. SIMD is not atomic in the sense of the C++ or Rust concurrency memory models. What this means is data races are UB.

RalfJung commented 5 years ago

In particular, an operation cannot be just "atomic"; every atomic operation comes with an ordering parameter indicating how it behaves. So a concrete operation is either "sequentially consistent atomic" or "release atomic" or so, but never just "atomic".

SIMD instructions do not have such a parameter. So they cannot reasonably be atomic. (We could say they are all using some fixed ordering, say, they could all be "relaxed atomic", but that would just raise the question why the ordering cannot be picked by the user.)

gnzlbg commented 5 years ago

Since you seem to use "atomic" in a different sense than me, please specify the question.

Could you maybe instead provide precise semantics for the operation that you are proposing instead?

In your proposal, you left out the word atomic, and used "relaxed with tearing" instead. I really have no idea what semantics you want to give it.

RalfJung commented 5 years ago

In your proposal, you left out the word atomic, and used "relaxed with tearing" instead. I really have no idea what semantics you want to give it.

That particular comment was proposing a new, not-yet-existing operation to solve the OP's problem. Aren't we still is the business of trying to clarify the semantics of the currently existing operations, or are we in agreement for that? No SIMD operation that currently exists (that I know of) is relaxed or otherwise atomic (assuming my definition of "atomic" as above).

Could you maybe instead provide precise semantics for the operation that you are proposing instead?

In my vocabulary, "precise" requires a mathematical formal definition. So I'm afraid I cannot, as we first need a precise definition of what the state (in particular, the memory) of the Rust Abstract Machine looks like.

But I can try to clarify, once I understood your question.

gnzlbg commented 5 years ago

That particular comment was proposing a new, not-yet-existing operation to solve the OP's problem.

That's the only thing I'm asking about: What are the semantics of the operation you are proposing?

Aren't we still is the business of trying to clarify the semantics of the currently existing operations, or are we in agreement for that?

I'm not in agreement with that but resolving that issue is orthogonal to solving @Diggsey problem.

RalfJung commented 5 years ago

I'm not in agreement with that but resolving that issue is orthogonal to solving @Diggsey problem.

I do not see how it is orthogonal -- from what you said here, you seem to suggest that the existing intrinsics already solve their problem.

What are the semantics of the operation you are proposing?

Basically the same as atomic memcpy, but with a hint to the compiler to please use a vector load. And "atomic memcpy" has the semantics of copying the elements in a loop using relaxed accesses. (This is the Rust semantics. In LLVM the accesses are "unordered atomic", which is weaker than "relaxed atomic", but Rust currently does not have "unordered".)

That said, we cold also reasonably expect LLVM to use SIMD for that by itself and report this as a bug to LLVM.

gnzlbg commented 5 years ago

@Diggsey

For most cases, you can work around this by just doing a relaxed atomic read instead. The data race goes away and you just have to worry about normal race conditions. As a bonus, you avoid seeing partial results.

However, if you want to do a SIMD or other "wide" read then it becomes impossible in Rust. You would have to use assembly to avoid introducing UB.

I'm not sure if this is what you meant, but if you use assembly to introduce a data-race, that's still a data-race.

My question is: if one wanted to want to support this in Rust, what would be a valid way to do it? Is this even OK to do at the assembly level? We can't introduce atomic versions of SIMD types, because SIMD is inherently not atomic.

What do you want to achieve?

For example, today, you can write

#[repr(align(16))]
pub struct Atomics(AtomicU64, AtomicU64);

pub extern "C" fn foo(x: &Atomics) -> __m128 {
    unsafe { transmute(
        [x.0.load(Relaxed), x.1.load(Relaxed)]
    )}
}

and LLVM not emitting a single movaps for this looks like an LLVM bug (https://llvm.godbolt.org/z/57zkef). This extends to 256-bit and 512-bit vectors using vmovaps instead. IIUC what @comex proposes here might be a more robust way for enabling this optimization (EDIT: which LLVM also doesn't do, yet), but if you only care about relaxed loads, something like the code above should work fine once the bug is fixed.

If instead what your algorithm needs is to truly perform a single 128-bit load into a SIMD register, as @comex mentioned you can use core::arch::x86_64::cmpxchg16b for that, however, there is no way to do the same for 256-bit and 512-bit vector. This isn't a Rust limitation, but a hardware limitation.

Would there need to be "volatile" versions of all SIMD operations whose only difference is that they don't introduce UB?

If your program has a data-race, volatile doesn't fix it.

HadrienG2 commented 5 years ago

I'm not sure if this is what you meant, but if you use assembly to introduce a data-race, that's still a data-race.

I think the point is that data races have reasonably well-defined semantics at the hardware layer, even though they are undefined at the Rust Abstract Machine layer. So a data race between two snippets of assembly code (inline or otherwise) should probably be allowed.

(Which, if we take it just a bit further, also means that data races involving only volatile reads and writes should probably be allowed as well, given that volatile means "just defer to hardware semantics". But then volatile would have hardware-defined data race semantics, which is not terribly helpful in the context of a portable high-level program.)

RalfJung commented 5 years ago

So a data race between two snippets of assembly code (inline or otherwise) should probably be allowed.

I could imagine a way to specify inline assembly and/or volatile that makes this work, but I have no idea if LLVM implements that. In particular for volatile, I am fairly sure that LLVM is free to reorder non-volatile operations wrt. volatile operations; that demonstrates that data races are not allowed.

HadrienG2 commented 5 years ago

This is taking us dangerously close to #152 territory, so to avoid derailing the present thread further I will reply there.

gnzlbg commented 5 years ago

@HadrienG2

I think the point is that data races have reasonably well-defined semantics at the hardware layer, even though they are undefined at the Rust Abstract Machine layer. So a data race between two snippets of assembly code (inline or otherwise) should probably be allowed.

Unknown code is not allowed to introduce errors in the Rust abstract machine. For example, if you pass a &T to some unknown code, and the code writes through it, that would be an error. (*)

Now, if Rust passes a &T to some Atomic_ to two different threads of execution, each of which uses unknown code to perform "unsynchronized" access, that would also be an error in the Rust abstract machine. If we were to allow that, you could just implement an intrinsic that allows performing unsynchronized accesses on top of unknown code.

So I think that a data race between unknown code is UB and we should not allow that.

Now, "calling" into an inline assembly block isn't just like calling into an FFI function, because for the inline assembly block we currently can specify things like "clobbers all memory" (is that a data-race?), is "volatile", etc. that we can't specify for FFI functions (yet - one can always open an inline assembly block that does an FFI call..). If among other things no loads and stores of any kind can be re-order around certain calls into unknown code, those calls "synchronize" in certain ways. We could, e.g., have a clobber/constraint or function attribute that says that certain unknown code performs an atomic read/write of a particular variable with a particular ordering. AFAIK such a thing doesn't exist in LLVM but at the language level we could do that.

(*) There are trickier cases, like say you pass a &mut bool to unknown code, and that code writes 42 to it, but before returning back to Rust, it writes 0 to it. I think this is also an error in the Rust abstract machine, and if a Rust binary using unknown code were running on top of something valgrind-like, we would be able to tell.

HadrienG2 commented 5 years ago

@gnzlbg Please correct me if I'm wrong, but this sounds like a continuation of the "volatile and data race" topics that I've brought back to #152 , so I'll reply there with a full quote.

Diggsey commented 5 years ago

I'm not sure if this is what you meant, but if you use assembly to introduce a data-race, that's still a data-race.

Data races are defined in terms of the memory model: in this case, the Rust memory model. When writing assembly that memory model is irrelevant because you're writing code for actual hardware, not an abstract machine.

What do you want to achieve? ...

I was originally interested in whether the approach used by hashbrown to efficiently scan hashmap buckets could be used to implement a concurrent hashmap (coincidentally, the exact use-case was one where elements would never be removed from the hashmap, simplifying things considerably).

However, I think this question is relevant even without such a concrete example: I mean it's clear from the amount of disagreement here that this area could do with some clarification. Everybody seems to have different expectations.

If your program has a data-race, volatile doesn't fix it.

I feel like this is rehashing the discussion with @comex and @Lokathor

Now, if Rust passes a &T to some Atomic_ to two different threads of execution, each of which uses unknown code to perform "unsynchronized" access, that would also be an error in the Rust abstract machine. If we were to allow that, you could just implement an intrinsic that allows performing unsynchronized accesses on top of unknown code.

So I think that a data race between unknown code is UB and we should not allow that.

This doesn't make much sense. How is it an error in the rust abstract machine, when it's not running on the rust abstract machine - it could be a language with an entirely different memory model? If it's unknown code, then the optimizer is not allowed to assume anything about it.

Anyway, to get back on topic, it sounds like either the atomic memcpy or the atomic load/transmute option might be able to do the right thing assuming LLVM gets the requisite optimisations... It would also need to figure out that the "destination" of the memcpy can be promoted to a register, and doesn't actually need to be written back to memory.

However, it's rather unsatisfying to hope that the optimiser is doing the right thing (hence why the SIMD instructions exist at all!) and I could definitely see cases where you might want to have eg. an "Acquire" memory ordering on the operation. That would be very problematic because I don't see how LLVM could ever optimise a sequence of "acquire" atomic loads into a SIMD load because the atomic loads would have slightly stronger guarantees.

rust-lang / unsafe-code-guidelines

SIMD + Data races #209