What about: volatile, concurrency, and interaction with untrusted threads

rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do

https://rust-lang.github.io/unsafe-code-guidelines

Apache License 2.0

666 stars 58 forks source link

What about: volatile, concurrency, and interaction with untrusted threads #152

Open hsivonen opened 5 years ago

hsivonen commented 5 years ago

Designate outside-of-program changes to memory accessed by volatile as non-UB

Context: An internals thread.

Use case: Rust is used to write privileged code (host services provided by a runtime environment to a JITed language, OS kernel providing syscall services to userland code, or a hypervisor providing emulated devices to a guest system) that needs to access memory that also unprivileged code can access and the unprivileged code can have multiple threads such that while unprivileged thread A has requested a service from the host such that the host service is running logically on A's thread of execution, a separate unprivileged thread of execution B could, if it is behaving badly, concurrently access the same memory from another CPU core. The unprivileged thread of execution must not be allowed to cause the privileged code written in Rust to experience UB. (It's fine for the unprivileged code to cause itself to experience UB within the bounds of its sandbox.)

The memory model itself is a whole-program model, so it doesn't apply, since in order to provide the guarantees it pledges to our thread, we must pledge the absence of data races from other threads of execution, which we can't do in this case. Hence, we need a way to access memory that is outside the memory model in the sense that there could exist an adversarial additional thread of execution that doesn't adhere to the DRF requirement. We're not trying to communicate with that thread of execution. The issue is just not letting it cause security bugs on us.

The C++ paper P1152R0 "Deprecating volatile" gives this use case as the very first item on its list of legitimate uses of volatile in C and C++. This makes sense, since if volatile works when external changes are caused by memory-mapped IO (the use case documented for std::ptr::read_volatile and the original use case motivating the existence of volatile in C), given the codegen for volatile and codegen for relaxed atomics on architectures presently supported by Rust, it makes sense for it to also work also when external changes are caused by a rogue thread of execution.

Yet, the documentation says: "a race between a read_volatile and any write operation to the same location is undefined behavior". I believe it's unnecessary and harmful to designate this as UB and it would be sufficient to merely say that the values returned by read_volatile are unpredictable in that case. This makes sense in the light of an IO-like view of volatile: You need to be prepared to receive any byte from an IO stream, so not knowing at compile time what you are going to get does not have to be program-destroying UB if you are prepared to receive value not predicted at compile time.

I suggest that a) the documentation be changed not to designate concurrent external modification of memory locations that a Rust program only accesses as volatile to be UB and b) to state in the Unsafe Code Guidelines that it's legitimate to use volatile accesses to access memory that a thread of execution external to the Rust program might change concurrently. That is, while you may read garbage, the optimizer won't assume that two volatile reads from the same location yield the same value and won't invent reads from memory locations written to using volatile writes (i.e. the memory locations are considered shared and, therefore, ineligible to be used as spill space by the compiler).

Replies to the thread linked to above indicate that this should already be the case despite the documentation suggesting otherwise.

Also see https://github.com/rust-lang/unsafe-code-guidelines/issues/152#issuecomment-506027424 which tries to summarize the discussion-until-then a bit.

RalfJung commented 5 years ago

I finally got around to read the original thread on IRLO. There is one important aspect there that I don't think has been mentioned explicitly here yet: all this talk about volatile only comes up as a kludge to work around bad LLVM codegen for relaxed/unordered accesses. This is not the first time that problem came up.

As @comex pointed out, there exists an intrinsic in LLVM that is a much better match for this: llvm.memcpy.element.unordered.atomic is atomic (no data race problems), but it allows optimizations such as removing redundant adjacent stores or loads, and it allows tearing. Exposing that intrinsic in Rust is discussed at https://github.com/rust-lang/rust/issues/58599.

Unfortunately, its codegen is so bad currently that it is not actually useful. We should be aware that everything we are doing here with volatile is just a work-around for bad LLVM codegen, and the real fix is to improve that codegen!

hsivonen commented 5 years ago

What's "invisibility" here? Do you mean "indivisibility"? I am confused now though, do you want or do you not want indivisibility? If you want it, relaxed seems fine, if you don't want it then the outcome of the discussion about it doesn't matter...

"Invisibility" is a typo. I meant "indivisibility". I don't want indivisibility, because I'm not trying to communicate with the other thread. Relaxed in LLVM provides lock-based indivisibility for SIMD.

There is one important aspect there that I don't think has been mentioned explicitly here yet: all this talk about volatile only comes up as a kludge to work around bad LLVM codegen for relaxed/unordered accesses.

I acknowledged this in my previous comment, though with the typo. Still, as far as the topic of dealing with outside memory model things go, volatile in C++ seems to try to be for that purpose while with atomics one is relying on the compiler not being able to tell that the other thread of execution might not adhere to the rules.

hsivonen commented 5 years ago

They seem to be guaranteed by this C++ proposal, which I believe to try to capture the behavior of the LLVM intrinsics that Rust builds upon here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1382r0.pdf

My only issue with that paper is that its item "i" says that certain volatile accesses don't constitute data races, which implies there are others that might. I'd be happier if volatile accesses categorically didn't constitute data races. (Other types of accesses to the same memory locations could still constitute data races resulting in asymmetric UB, but asymmetric UB is exactly what I'm looking for here.)

There exists a new version of the paper: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1382r1.pdf . It modifies the definition of "data race" to carve out exclusions of UB for volatile loads:

The execution of a program contains a data race if it contains two potentially concurrent conflicting actions,neither of which is a volatile load,at least one of which is not atomic or volatile,and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.

RalfJung commented 5 years ago

I acknowledged this in my previous comment, though with the typo. Still, as far as the topic of dealing with outside memory model things go, volatile in C++ seems to try to be for that purpose while with atomics one is relying on the compiler not being able to tell that the other thread of execution might not adhere to the rules.

I think it's incorrect to say that this change in C++ accounts for "the other thread not adhering to the rules". UB is still a program-global concept in C++. When you do a volatile write that races with a non-volatile non-atomic access in the untrusted sandbox, you still have UB. None of what we are talking about here changes the complexity of specifying the behavior of a sandbox.

Sure, it lets you do a read that can never cause UB (assuming it is in-bounds and so on). But that's not making any measurable progress towards answering the question "what if the other thread divides by 0".

at least one of which is not atomic or volatile

Wait so they make volatile-write-volatile-write races defined? oO

comex commented 5 years ago

I think it's incorrect to say that this change in C++ accounts for "the other thread not adhering to the rules". UB is still a program-global concept in C++. When you do a volatile write that races with a non-volatile non-atomic access in the untrusted sandbox, you still have UB. None of what we are talking about here changes the complexity of specifying the behavior of a sandbox.

I personally agree that atomic but non-volatile accesses to shared memory are a viable approach, or at least a viable starting point. But there are some potential roadblocks:

To quote cppreference:

The C++ standard recommends (but does not require) that lock-free atomic operations are also address-free, that is, suitable for communication between processes using shared memory.

But there don't seem to be any actual implementations where they're not.
In theory, the compiler could perform insane optimizations that assume atomics aren't used with shared memory. For example: "there are no atomic loads in this entire program that could potentially alias with this atomic store, so drop the store". Are such optimizations permissible? I think the answer is "nobody knows", since as you mention, inter-process concurrency is out of scope for the C++ spec.
But at least some C++ spec people think volatile is needed for shared memory. From JF Bastien's No Sane Compiler Would Optimize Atomics:

Atomic operations are unsuitable to express that memory locations can be externally modified. Indeed, volatile (or volatile atomic) should be used in these circumstances.

Similarly, the 'Deprecating volatile' paper linked in the OP, by the same author, states that "volatile is the right way to avoid time-of-check time-of-use (ToCToU) races" for "shared memory with untrusted code".

comex commented 5 years ago

(My previous post mentions shared memory. Admittedly, the OP's use case isn't exactly shared memory, but (a) shared memory is a common use case and we need a story for it; (b) the OP's use case looks a lot like shared memory, in the sense that both boil down to "the untrusted program can do arbitrary things to a designated memory region, but cannot affect the rest of the trusted program's memory, not even if the untrusted program internally executes UB".)

RalfJung commented 5 years ago

In theory, the compiler could perform insane optimizations that assume atomics aren't used with shared memory. For example: "there are no atomic loads in this entire program that could potentially alias with this atomic store, so drop the store". Are such optimizations permissible? I think the answer is "nobody knows", since as you mention, inter-process concurrency is out of scope for the C++ spec.

I am pretty sure that "de-atomization" of atomic accesses is allowed, if the compiler can show that that does not introduce a race. But that only works if the compiler can "see" all code that touches a location. Which, with a sandbox, it never can.

A correct modular compiler (compiling parts of a program) has to assume there are other parts, and these parts may do anything that a legal C++ program may do. So arguing based on "other modules linked with this could do X" is not playing "optimization chicken", this is actually a proper way of arguing about the boundaries of what a compiler may do.

The hard part here is the thing I said as if it was completely natural: the other parts may do anything a legal C++ program may do. Fixing C++ as the language here can easily become a problem, but saying "any language" is not a thing I know how to make precise. ;) Gladly, for the concurrency part of the discussion here, I think C++ is good enough.

But at least some C++ spec people think volatile is needed for shared memory. From JF Bastien's No Sane Compiler Would Optimize Atomics:

I was very confused here for a second, in my vocabulary a concurrent program with two threads accessing the same lock is "shared memory". It is very common to say "shared memory" to distinguish from "message passing".

But I think you mean "shared across process boundaries", in contrast to "shared across thread boundaries". It seems strange to me that this would make any difference. In particular, I would be interested what they mean by

A sufficiently advanced compiler, performing some of the optimizations described above, can seriously harm code which uses shared memory naïvely.

That would basically mean assuming that memory returned by mmap is private to the process. Sure a compiler could treat mmap like malloc, but I doubt the spec would back that.

On the other hand, they clearly have though about this a lot.^^ I just see no arguments that actually support their claim.

The optimizations they describe, such as load-forwarding for atomics, are fine for memory shared accros thread boundaries (otherwise they would be incorrect), and they are just as fine for memory shared across process boundaries. I have yet to see an example that would be harmful here, short of compilers treating mmap like malloc.

comex commented 5 years ago

But I think you mean "shared across process boundaries", in contrast to "shared across thread boundaries".

(Yes, that's the sense in which I meant it. This sense is also a common usage, e.g. on Unix, Windows.)

I need to think some more about the rest of your post.

RalfJung commented 5 years ago

(Yes, that's the sense in which I meant it. This sense is also a common usage, e.g. on Unix, Windows.)

Yeah I am aware, I just never realized before that I used the same term in two different ways.^^ So far it was always clear from context for me.

Wikipedia:

In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. [...}

Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory.

RalfJung commented 5 years ago

Btw, I have since then learned that LLVM developers do not consider write-write races UB. They consider them to basically leave the memory as undef (uninitialized).

To my knowledge, there is no theoretical/formal work at all studying the properties of this memory model.

HadrienG2 commented 5 years ago

Replying to @RalfJung replying to me in #209.

I think the point is that data races have reasonably well-defined semantics at the hardware layer, even though they are undefined at the Rust Abstract Machine layer. So a data race between two snippets of assembly code (inline or otherwise) should probably be allowed.

(Which, if we take it just a bit further, also means that data races involving only volatile reads and writes should probably be allowed as well, given that volatile means "just defer to hardware semantics". But then volatile would have hardware-defined data race semantics, which is not terribly helpful in the context of a portable high-level program.)

I could imagine a way to specify inline assembly and/or volatile that makes this work, but I have no idea if LLVM implements that. In particular for volatile, I am fairly sure that LLVM is free to reorder non-volatile operations wrt. volatile operations; that demonstrates that data races are not allowed.

Just to clarify, I was talking about a scenario where the memory region where the data race occurs is solely accessed via volatile accesses and raw pointers (no Rust references nor non-volatile loads and stores to that specific memory location).

Do you maintain that in this situation, LLVM reordering non-volatile operations on other memory locations around racey volatile operations is an obstacle to making volatile data races hardware-defined (where "hardware-defined" means "they do whatever hardware loads and stores do in the event of a data race", including tearing writes and reads).

RalfJung commented 5 years ago

Do you maintain that in this situation, LLVM reordering non-volatile operations on other memory locations around racey volatile operations is an obstacle to making volatile data races hardware-defined

For now I still maintain that. So here's an example, and please let me know if this is in-scope for how you want to make hardware-level memory models apply to volatile operations: We have two threads doing the classical message-passing via release-acquire (DATA and FLAG are global variables and initially 0, the rest are local):

// Thread 1
DATA = 42;
volatile_write(FLAG, 1);

// Thread 2:
while volatile_read(FLAG) != 1 {} // wait for FLAG
println!("{}", DATA);

If we are on x86, hardware semantics definitely says that all reads and writes have at least release-acquire semantics. So the FLAG read in thread 2 syncs-with the write in thread 1. Thus if we fully applied the hardware memory model, we'd expect the program to always output 42.

And yet if LLVM reorders the two stores in thread 1, that's not what happens.

IOW, the issue with concurrency is not just the races itself, but also how making a certain observation in one read instruction affects possible observations on subsequent operations, including subsequent non-atomic operations (and similar, how some write implicitly "publish" non-atomic writes that happened before). Or to put it differently, "synchronizes with something" is a possible side-effect that a function/operation can have, and that LLVM (I think) assumes volatile operations do not have (because that effect would preclude reorderings). And at that point it doesn't matter if the hardware says that there is synchronization.

HadrienG2 commented 5 years ago

That program would remain UB under the volatile data race semantics that I have in mind. Informally, my rationale would be that...

DATA is a accessed using non-volatile loads and stores, and therefore subjected to regular Rust semantics (not hardware ones).
Rust semantics demand that Acquire and Release (or stronger) memory orderings be used in order to synchronize access to variables.
Volatile loads and stores do not count as Acquire and Release barriers at the Rust layer, no matter what hardware says at its own layer.

Formalizing this, however, may prove to be particularly difficult. Maybe it could be worded like this: "Volatile accesses defer to hardware semantics, including when it comes to how data races are handled. When hardware specifies that loads and stores entail certain memory barriers, however, it should be noted that those memory barriers only apply to surrounding volatile accesses, and not to non-volatile accesses."

HadrienG2 commented 5 years ago

But to clarify, this variant of your program would be considered valid, and on x86 (but not on Arm, POWER, Alpha...), it would be guaranteed to systematically print "42":

// Thread 1
volatile_write(DATA, 42);
volatile_write(FLAG, 1);

// Thread 2:
while volatile_read(FLAG) != 1 {} // wait for FLAG
println!("{}", volatile_read(DATA));

HadrienG2 commented 5 years ago

Synchronization between the hardware memory model and the Rust memory model, if ever desired, can be implemented today by using compiler fences to prevent the kind of undesirable reordering that LLVM likes to engage in. I'm not sure if more is needed.

gnzlbg commented 5 years ago

Formalizing this, however, may prove to be particularly difficult

Why isn't "volatile does not synchronize." enough ?

Also, the claim that "volatile reads / writes do whatever the hardware does" is kind of useless IMO because we do not guarantee to which hardware instructions these reads and writes lower to, and LLVM doesn't either. In particular, LLVM does not guarantee that, e.g., a volatile read / write performs a single memory operation on hardware. So something like this:

// Initialization
Data = 0

// Thread 1
volatile_write(Data, MAX)

// Thread 2
value = volatile_read(Data)
if value == 0 {
  // ...
} else {
  assert!(value, MAX); // Can FAIL
}

can fail because the volatile_write and volatile_read can be a sequence of writes or reads on hardware, so the volatile_read can observe a value different from 0 or MAX.

joshlf commented 5 years ago

Given that a) it seems like volatile operations might (depending on what we end up deciding for the Rust spec) provide stronger guarantees than are needed for the inter-process shared memory use case and that, b) volatile operations might (again depending on what we end up deciding for the Rust spec) end up being defined as architecture-dependent, an alternative could be to introduce new machinery (perhaps in the form of functions in core, perhaps in the form of new wrapper types a la Cell, etc) that provide only the weakest guarantees necessary to prevent UB in the case of a trusted process operating on memory that is concurrently modified by an untrusted process? The semantics of such machinery would be platform-agnostic, and we could define those semantics today, and then let the debates about the exact semantics of volatile and about Rust's memory model in general continue to play out. We could adjust the implementation details of this machinery as those discussions progress without needing to change the machinery's API or spec.

Does that sound like something folks would be interested in? If so, I can open a new issue in this repo to discuss it.

HadrienG2 commented 5 years ago

@gnzlbg Actually, LLVM does guarantee what you need here:

Platforms may rely on volatile loads and stores of natively supported data width to be executed as single instruction. For example, in C this holds for an l-value of volatile primitive type with native hardware support, but not necessarily for aggregate types. The frontend upholds these expectations, which are intentionally unspecified in the IR. The rules above ensure that IR transformations do not violate the frontend’s contract with the language.

And the reason is that although the C language has AFAIK never guaranteed that about volatile, all low-level C code in existence relies on it, and compilers therefore cannot change it without causing massive havoc. The LLVM/clang community just managed to build the Linux kernel recently, I don't think they would be happy about breaking that particular build again so soon.

I also wish that volatile operations of non-natively-supported width would be a compilation error, but I guess we must be bugward-compatible with C here...

Why isn't "volatile does not synchronize." enough ?

Well, volatile does synchronize other volatile accesses, if your hardware says it does. Therefore, I think "volatile does not synchronize" would be too strong.

@joshlf That totally sounds like something I could get behind! The less people need to defer to hardware semantics with volatile the better, IMO. Making two unrelated memory models (hardware's and rust's) cohabitate is a very delicate exercise that should only be needed for very low level use cases like hardware driver development.

gnzlbg commented 5 years ago

@HadrienG2

Actually, LLVM does guarantee what you need here:

No, it does not guarantee that the following code works, and well, it doesn't work (this just sets DATA in my example to be an u128 on x64): https://rust.godbolt.org/z/Pc61d- . The problem is that Rust supports volatile operations for types that do not have native hardware support.

I also wish that volatile operations of non-natively-supported width would be a compilation error, but I guess we must be bugward-compatible with C here...

What for? We could just deprecate the read_volatile/write_volatile methods and provide ones that are guaranteed to work, just like we do for atomics.

Well, volatile does synchronize other volatile accesses, if your hardware says it does.

My hardware doesn't really have to say anything about volatile accesses, Rust does. Deferring to hardware semantics, but without guaranteeing how many instructions, which instructions, whether they are always the same instructions, what the native widths are, etc. is IMO not very useful. I think it would be more useful to try to establish "happens-before" relations for volatile accesses with respect to other accesses depending on target and width or similar, and require the backend to preserve them, but I don't know if it can be done.

HadrienG2 commented 5 years ago

My hardware doesn't really have to say anything about volatile accesses, Rust does. Deferring to hardware semantics, but without guaranteeing how many instructions, which instructions, whether they are always the same instructions, what the native widths are, etc. is IMO not very useful.

Well, as long as you only use volatile accesses of natively supported width, Rust and LLVM guarantee that each of your volatile loads will translate into exactly one hardware load of the same width, and each of your volatile stores will translate into exactly one hardware store of the same width.

The spec hole which allows large volatile accesses to translate into multiple hardware instructions is definitely not great, and should ideally be closed. But volatile in its current C-like state is workable enough that people have been successfully building C hardware drivers relying on MMIO using volatile for decades. And MMIO can be really picky about the number and width of memory accesses...

But let's not go on too much about this specific topic, that's what #33 is for.

I think it would be more useful to try to establish "happens-before" relations for volatile accesses with respect to other accesses depending on target and width or similar, and require the backend to preserve them, but I don't know if it can be done.

We already have the required synchronization primitive for enforcing a certain order between volatile and non-volatile ops at the assembly level. When combined with volatile accesses that are sufficiently small to be guaranteed to reduce to individual hardware loads and stores, I think that might actually be good enough for what you have in mind.

gnzlbg commented 5 years ago

Well, as long as you only use volatile accesses of natively supported width, Rust and LLVM guarantee that each of your volatile loads will translate into exactly one hardware load of the same width, and each of your volatile stores will translate into exactly one hardware store of the same width.

Where does Rust guarantee this?

We already have the required synchronization primitive for enforcing a certain order between volatile and non-volatile ops at the assembly level. When combined with volatile accesses that are sufficiently small to be guaranteed to reduce to individual hardware loads and stores, I think that might actually be good enough for what you have in mind.

Yes, I think this and an appropriate guarantee for how volatile synchronizes with volatile might be enough.

HadrienG2 commented 5 years ago

@gnzlbg

@HadrienG2

I think the point is that data races have reasonably well-defined semantics at the hardware layer, even though they are undefined at the Rust Abstract Machine layer. So a data race between two snippets of assembly code (inline or otherwise) should probably be allowed.

Unknown code is not allowed to introduce errors in the Rust abstract machine. For example, if you pass a &T to some unknown code, and the code writes through it, that would be an error. (*)

Now, if Rust passes a &T to some Atomic_ to two different threads of execution, each of which uses unknown code to perform "unsynchronized" access, that would also be an error in the Rust abstract machine. If we were to allow that, you could just implement an intrinsic that allows performing unsynchronized accesses on top of unknown code.

So I think that a data race between unknown code is UB and we should not allow that.

Now, "calling" into an inline assembly block isn't just like calling into an FFI function, because for the inline assembly block we currently can specify things like "clobbers all memory" (is that a data-race?), is "volatile", etc. that we can't specify for FFI functions (yet - one can always open an inline assembly block that does an FFI call..). If among other things no loads and stores of any kind can be re-order around certain calls into unknown code, those calls "synchronize" in certain ways. We could, e.g., have a clobber/constraint or function attribute that says that certain unknown code performs an atomic read/write of a particular variable with a particular ordering. AFAIK such a thing doesn't exist in LLVM but at the language level we could do that.

(*) There are trickier cases, like say you pass a &mut bool to unknown code, and that code writes 42 to it, but before returning back to Rust, it writes 0 to it. I think this is also an error in the Rust abstract machine, and if a Rust binary using unknown code were running on top of something valgrind-like, we would be able to tell.

In every programming language that has either of volatile, inline assembly, or FFI to native code, the abstract machine has a sizeable hole punctured into it through which one can observe hardware semantics that differ from the abstract machine semantics. Unfortunate as it can be when it comes to devising robust and portable memory models, this is a fact that we need to live with.

It is, in my opinion, not a tenable position to say that such hardware-level operations may not engage in any behavior that is forbidden by the Rust abstract machine. After all, why should they be bound by the rules of Rust, if they are not written in Rust, and do stuff which is not observable to Rust? Therefore, we must instead clarify what exactly is the boundary which hardware-level operations must not cross in order to avoid interfering with the Rust abstract machine, or in other words when exactly hardware-level operations become observable to Rust.

Here's a highly imperfect 5-minutes attempt at this that we can iterate on:

"Hardware-level memory operations are said to interact with the Rust abstract machine as soon as they modify memory which is accessible via Rust references, accessed by Rust code using non-volatile pointer operations, or subjected to data validity invariants other than that of being initialized.

Any hardware-level memory operation which interacts with the Rust abstract machine must do so in a manner which is compatible with the rules of said abstract machine, or else the program's behavior is undefined."

From this perspective:

Using assembly or volatile to write to an &T (where T has no interior mutability) is not okay, because an &T reference exists, which binds the assembly code to the reference aliasing rules of the Rust abstract machine.
Using assembly or volatile to write 42 to an *mut bool is not okay, because bool is subjected to a data validity invariant other than initialized-ness, and this data validity invariant binds code which manipulates rust bools to the data validity rules of the Rust abstract machine.
Using volatile loads and stores to engage in a data race on a *mut u8 which is only accessed via volatiles loads and stores and has no Rust references to it is okay, because this memory location does not interact with the Rust abstract machine and memory operations affecting it are therefore not subjected to the rules of said abstract machine.

Interestingly enough, engaging in a data race by attempting to simultaneously store true and false into an *mut bool _ via volatile operations may or may not be UB, in a hardware-dependent fashion, because these operations are bound to the rules of the Rust abstract machine by bool's data validity invariant, and whether they actually follow these rules in the event of a data race is hardware dependent if volatile means "defer to hardware". That's what you get for introducing hardware-specified memory operations in a language with a hardware-agnostic memory model...

HadrienG2 commented 5 years ago

Well, as long as you only use volatile accesses of natively supported width, Rust and LLVM guarantee that each of your volatile loads will translate into exactly one hardware load of the same width, and each of your volatile stores will translate into exactly one hardware store of the same width.

Where does Rust guarantee this?

Actually, Rust does provide this guarantee if you look hard enough at the current definition:

Rust does not currently have a rigorously and formally defined memory model, so the precise semantics of what "volatile" means here is subject to change over time. That being said, the semantics will almost always end up pretty similar to C11's definition of volatile.

The compiler shouldn't change the relative order or number of volatile memory operations. However, volatile memory operations on zero-sized types (e.g., if a zero-sized type is passed to write_volatile) are noops and may be ignored.

If the Rust compiler cannot change the number of volatile memory operations, then it cannot turn a 32-bit volatile operation into two 16-bit volatile operations. Only LLVM may engage in this sort of things (and on its side, it guarantees that it will only do so, if you ask a large enough volatile op).

Arguably, the wording above even makes LLVM's splitting of large volatile ops illegal according to Rust's rules ;) But I don't think that will be enough to convince the lang team to deprecate oversized volatile ops in the name of informal spec violation.

We already have the required synchronization primitive for enforcing a certain order between volatile and non-volatile ops at the assembly level. When combined with volatile accesses that are sufficiently small to be guaranteed to reduce to individual hardware loads and stores, I think that might actually be good enough for what you have in mind.

Yes, I think this and an appropriate guarantee for how volatile synchronizes with volatile might be enough.

Is "the compiler shouldn't change the relative order or number of volatile memory operations" good enough as a guarantee of how volatile synchronizes with volatile?

gnzlbg commented 5 years ago

Do you consider "the compiler shouldn't change the relative order or number of volatile memory operations" to be an appropriate guarantee for how volatile synchronizes with volatile?

I'm not sure if I would say that this is a "synchronizes with"-kind of guarantee, since "synchronizes with" is often understood to mean that also other operations are not reordered across these, but that's not what this guarantee says (I've also used the "synchronizes with" expression for volatile, so sorry about the confusion).

There is an ongoing internal thread about adding an Ordering::Unordered to the atomics. Since this guarantee would only hold for "types with native width", which are the only widths for which we provide the target-dependent Atomic_ types, this feels a bit like adding an Ordering::Volatile to the atomics as well (just using different syntax if we do so via read_volatile/write_volatile).

HadrienG2 commented 5 years ago

Note that as the name suggests, Ordering::Unordered allows the compiler to reorder memory operations to a single memory location. Although potentially usable for synchronization, this is too permissive for several use cases where volatile is used today, such as MMIO.

Volatile accesses are more like SeqCst with each other, Unordered with the rest (but they are actually even stronger than SeqCst because e.g. duplicate store elimination is not allowed).

HadrienG2 commented 5 years ago

Okay, I think I have a better idea than the one presented above. One that will require less language spec churn, and that we can easily adopt today without pressuring LLVM to clarify their semantics.

In a nutshell, the idea is that LLVM's non-atomic volatile operations have unnecessarily weird semantics that are almost never actually desired, and that we should therefore...

Expose (possibly partial) support for LLVM's atomic volatile operations.
Soft-deprecate our current non-atomic volatile operations by modifying the documentation of ptr::read_volatile and ptr::write_volatile to strongly suggest using Relaxed atomic volatile operations instead.

Doing so will have the benefit of...

Matching hardware-level data race semantics, which is what we want if volatile is to mean "defer to hardware semantics".
Eliminating the current footgun where a large volatile operations can be split into multiple hardware operations, replacing it with a compilation error which is what is usually desired.

In the future, atomic volatile semantics could also be used in other areas of the language memory model that interact with the hardware memory model, such as inline assembly.

If that sounds interesting, read on for details.

First of all, let's take a step back and ponder what volatile operations mean to the programmer and why they are typically used.

A stream of sufficiently narrow volatile accesses is guaranteed to translate into an identical stream of hardware loads and stores at the hardware assembly level, without any splitting, merging, elimination or reordering of the volatile operations themselves (though non-atomic operations around them remain subjected to the usual optimization rules). If appropriate precautions are taken (no Rust references, no non-volatile accesses), it is also possible to guarantee that no extra access will be added by the compiler to the corresponding memory location.

Volatile accesses are typically used in two circumstances:

To handle MMIO use cases where hardware on the memory bus must be accessed using very specific patterns at the assembly level, and hardware memory semantics are therefore desired as opposed to Rust memory semantics. See #33 for more discussion.
To handle shared memory use cases where the Rust compiler must assume the possibility of arbitrary external accesses to certain regions of memory, or arbitrary side-effects from memory loads and stores originating from virtual memory tricks, which are not part of the program under study. This makes most memory access optimizations unsound.

It could be argued that volatile is only really appropriate for the first use case, and a little bit too strong for the second use case. But in the interest of getting a reasonable solution out of the door quicker, requesting that volatile be used for shared memory accesses for now seems like a reasonable stance. It's totally possible to introduce an abstraction that's better suited to the need of shared memory communication, such as atomic memcpy, in a later iteration.

If volatile is to mean "defer to hardware semantics", which everyone seems to agree on so far, then volatile should also defer to hardware semantics when it comes to data races. This is especially sensible when one considers that MMIO, which is the killer use case for volatile, is essentially a data race between the CPU and other hardware on the computer's memory bus.

However, LLVM's volatile (which is essentially inherited from C) doesn't do this. Instead, it specifies data races between volatile accesses as UB. In fact, LLVM's volatile does not even guarantee the single property that is most important to memory-mapped IO, which is that every volatile access will translate into one single hardware load or store instruction of identical width.

Specifying volatile data races as UB is overly pessimistic with respect to the memory model of every real hardware that I am aware of, and therefore an impediment to the intended "defer to hardware memory model" semantics of volatile. It is also a security issue when volatile accesses are used for shared-memory interactions with untrusted code, which is the topic of this issue.

I argue that instead, exclusive use LLVM's atomic volatile with Relaxed ordering, which on cache-coherent hardware has almost exactly the same semantics as today's volatile, but without the "data race are UB" and "overly wide memory accesse can be split without a compilation error" issues, is a better match for real hardware semantics, and therefore more desirable where volatile is to mean "defer to hardware semantics".

(Actually, Unordered semantics would probably be just fine too for volatile's purposes, but there are good reasons to avoid exposing these in stable Rust for now.)

It could be countered that the security issue remains for interaction with untrusted code, as said code could purposely use non-atomic non-volatile writes to shared memory in order to trigger undefined behavior in the trusted code. Fully addressing this will require fixing LLVM's volatile semantics so that e.g. volatile reads of data with no validity invariants cannot return undef. But on all current hardware and assuming absence of LTO between trusted and untrusted code (which could go wrong in so many other ways), I believe that this undefined behavior currently cannot result in any observable consequence (unwanted compilation output, etc). So we can probably live with it for now, and just wait for LLVM to fix their memory model.

When I mentioned this idea on Zulip, I received several interesting comments:

@comex pointed out that although real cache-coherent hardware does not have "data race are UB" semantics and provides Relaxed ordering for free, this may not be true of simulated hardware like the WASM and JVM virtual machines. Further, for all we know, a crazy hardware manufacturer might decide to release non-cache-coherent hardware where regular loads and stores do not have Relaxed semantics someday.
- Personally, I see this as an argument in favor of keeping the current read_volatile and write_volatile instructions around for specialized use cases, but not as an argument against soft-deprecating these for all classic use cases of volatile in C.
@comex and @gnzlbg also discussed the fact that non-Relaxed atomic ordering on shared memory will only work if all binaries involved agree on a common ASM translation of the required memory barriers. We can handle this in several way:
- Only stabilize Relaxed atomic volatile initially, giving ourselves room to add more later.
- Only stabilize non-Relaxed atomic volatile for communication between Rust programs using the same version and configuration of rustc, i.e. make it part of repr(Rust).
- Allow non-Relaxed atomic volatile in a more permissive setting that most closely matches what is actually guaranteed to work, such as when everyone agrees on a single ABI/compiler/libc (request for prior art: how does linking of C programs using non-Relaxed atomic ordering actually work?).

Other thoughts? Sounds like a plan?

RalfJung commented 5 years ago

But to clarify, this variant of your program would be considered valid, and on x86 (but not on Arm, POWER, Alpha...), it would be guaranteed to systematically print "42":

(Example that use volatile for everything.)

Yeah that makes sense I think.

Synchronization between the hardware memory model and the Rust memory model, if ever desired, can be implemented today by using compiler fences to prevent the kind of undesirable reordering that LLVM likes to engage in. I'm not sure if more is needed.

AFAIK we don't have a proper model of compiler fences, so I am not sure how to make that work formally.

Platforms may rely on volatile loads and stores of natively supported data width to be executed as single instruction.

The trouble is, Rust doesn't expose what the "natively supported data width" is. @gnzlbg's proposal before was to replace the existing volatile intrinsics by ones that are guaranteed not to tear, similar to the atomic intrinsics. At that point, volatile becomes basically just another "atomic ordering", and the arbitrarily-sized volatile operations can be implemented with a loop. AFAIK that was our consensus for the best way forward. We just don't have anyone to write that up...

Given that a) it seems like volatile operations might (depending on what we end up deciding for the Rust spec) provide stronger guarantees than are needed for the inter-process shared memory use case and that, b) volatile operations might (again depending on what we end up deciding for the Rust spec) end up being defined as architecture-dependent, an alternative could be to introduce new machinery (perhaps in the form of functions in core, perhaps in the form of new wrapper types a la Cell, etc) that provide only the weakest guarantees necessary to prevent UB in the case of a trusted process operating on memory that is concurrently modified by an untrusted process?

The thing, is, I think volatile is pretty much the weakest thing you need. Interaction with foreign untrusted code, formally speaking, has to be done on the hardware/assembly level as we cannot trust that code to uphold the invariants of the Rust Abstract Machine.

One big concern here is reading undef, but even if we added a dedicated intrinsic to Rust, we still couldn't prevent that -- LLVM quite simply does not provide the required primitives to avoid undef here.

I think it would be more useful to try to establish "happens-before" relations for volatile accesses with respect to other accesses depending on target and width or similar, and require the backend to preserve them, but I don't know if it can be done.

Well, if we had an operational model of concurrency (something like this -- an excellent paper btw that I can only recommend to anyone interested in learning more about weak memory concurrency), we could bound the effect of volatile operations in terms of which part of the state they are allowed to change. Like, in terms of that paper's terminology they can do whatever on the location that is being accessed, but they can not affect the thread-local release or acquire view. That's basically a formal way of saying "they do not synchronize".

I don't know how to do this with an axiomatic, whole-program model like the C++11 one.

In every programming language that has either of volatile, inline assembly, or FFI to native code, the abstract machine has a sizeable hole punctured into it through which one can observe hardware semantics that differ from the abstract machine semantics. Unfortunate as it can be when it comes to devising robust and portable memory models, this is a fact that we need to live with.

Actually IMO it's more the other way around: the inline assembly / FFI can observe the "inner workings" of how the compiler implements the abstract machine in terms of lower-level primitives. And whatever it does, it is required to maintain the abstractions on which this implementation rests. But there cannot be a hole in the machine abstraction; if there is, all optimizations just have to be removed as they are incorrect.

It is, in my opinion, not a tenable position to say that such hardware-level operations may not engage in any behavior that is forbidden by the Rust abstract machine. After all, why should they be bound by the rules of Rust, if they are not written in Rust, and do stuff which is not observable to Rust?

The "not observable" part is the key here. For anything that is observable to Rust, these operations must follow the abstract machine's rules.

Expose (possibly partial) support for LLVM's atomic volatile operations.

Soft-deprecate our current non-atomic volatile operations by modifying the documentation of ptr::read_volatile and ptr::write_volatile to strongly suggest using Relaxed atomic volatile operations instead.

Yes please! I've said this before somewhere. There is no reason to even have non-atomic volatile operations, IMO. I think we actually should just make the existing volatile intrinsics be elementwise atomic as well. As I said above:

Let's make Volatile another concurrent ordering -- so these operations will all be concurrent (relaxed, I suppose) and also volatile.
Remove the non-atomic volatile intrinsics.
Implement the existing volatile functions in terms of a loop doing element-wise atomic-volatile accesses.

Matching hardware-level data race semantics, which is what we want if volatile is to mean "defer to hardware semantics".

We have to be extremely careful with the wording here to exclude cases of mixed "normal" and volatile accesses, as well as making sure that volatile accesses can never be used to synchronize non-volatile accesses.

I argue that instead, exclusive use LLVM's atomic volatile with Relaxed ordering [...] (Actually, Unordered semantics would probably be just fine too for volatile's purposes, but there are good reasons to avoid exposing these in stable Rust for now.)

We are in full agreement. :)

It could be countered that the security issue remains for interaction with untrusted code, as said code could purposely use non-atomic non-volatile writes to shared memory in order to trigger undefined behavior in the trusted code.

Untrusted code will need to be sandboxed somehow anyway. And formally speaking, one can show (though it's hard and the formal methods for that are in early stages) that a proper sandbox will "contain" UB within the sandbox.

@comex and @gnzlbg also discussed the fact that non-Relaxed atomic ordering on shared memory will only work if all binaries involved agree on a common ASM translation of the required memory barriers. We can handle this in several way:

This is true for so much more than just atomics though, e.g. calling convention.

HadrienG2 commented 5 years ago

@RalfJung I think we are in general agreement here when it comes to semantics, so maybe this is getting close to reaching the pre-RFC stage. If you agree, then I can do the initial writeup on internals for the purpose of subsequent iteration.

One question which must be resolved before producing a full RFC, however, is how this should be integrated in the std API. Volatile cannot be just another AtomicXyz memory ordering, because the AtomicXyz API is based on &self and the mere existence of such a reference is incompatible with volatile operations being unobservable to the Rust abstract machine. That's basically the reason why @japaric's nice volatile cell type is unsound under Rust's current rules.

A possible solution would be to have something like AtomicXyz::load_volatile(self_: *const Self, o: Ordering). Or perhaps the ordering-less variant AtomicXyz::load_volatile(self_: *const Self), that's a possible point for RFC discussion. For shared memory use cases, I think being able to specify atomic orderings on volatile ops would be useful.

Such an API is admittedly clunky without allowing pointers as self types, but people might be satisfied with that for now. And if Rust ever gets pointer self types, removing the trailing dash in self_ (which is not compatibility-breaking for a Rust API) will be all it takes to make the proposed API use them.

EDIT: @gnzlbg pointed out that with arbitrary_self_types, this kind of API can already be built. So I guess we could do the right thing from the start, and block stabilization of this feature on stabilization of the "pointer" part of arbitrary_self_types.

Also, regarding this point...

There is no reason to even have non-atomic volatile operations, IMO.

@comex was worried about the possibility of a non-cache-coherent platform (real or VM) where basic loads and stores would not have Relaxed semantics. It seemed to me that this could be a justification for keeping non-atomic volatile operations as long as we do not have weaker atomic orderings that match those platforms.

Of course, an alternative course of action would be to say that Rust's volatile only supports platforms with Relaxed or stronger native loads and stores, and wait for people coming with ~~pitchforks~~ currently unforeseen use cases to add the required extra atomic orderings when there is a clear need for it.

PS: Thanks for pointing me back to this thread, I had forgotten about it and consequently had to re-reach some of its conclusions on my own.

I disagree with the idea proposed there of offering an arch-specific solution, though. It seems to me that it should be possible to solve this problem not just for x86, but also for other archs, without much more extra work. And from a software engineering / compiler maintainability perspective, that's definitely a lot more satisfying.

HadrienG2 commented 5 years ago

Courtesy of @gnzlbg, I now have an example of what LLVM's non-atomic volatile might still be useful for.

NVidia GPUs, which can be programmed in Rust via the ill-loved nvptx target, have atomic operations whose semantics are weaker than LLVM's Relaxed. Therefore, they cannot support atomic volatile operations without extending Rust's memory model to allow weaker-than-Relaxed memory ordering. However, they can totally support "non-atomic" volatile memory operations today.

Basically, this is a practical case of the non-cache-coherent hardware whose existence @comex was wondering about.

RalfJung commented 5 years ago

Volatile cannot be just another AtomicXyz memory ordering, because the AtomicXyz API is based on &self and the mere existence of such a reference is incompatible with volatile operations being unobservable to the Rust abstract machine. That's basically the reason why @japaric's nice volatile cell type is unsound under Rust's current rules.

Oh dang. I mean that's mostly an orthogonal issue but you are right. Notice however that this issue even plagues non-volatile atomics: https://github.com/rust-lang/rust/issues/55005.

NVidia GPUs, which can be programmed in Rust via the ill-loved nvptx target, have atomic operations whose semantics are weaker than LLVM's Relaxed. Therefore, they cannot support atomic volatile operations without extending Rust's memory model to allow weaker-than-Relaxed memory ordering. However, they can totally support "non-atomic" volatile memory operations today.

Uff, I hate hardware. :( Why can't we run our software directly on the subliminal fabric of mathematics...?

However, even then the current volatile ABI is ill-suited because it doesn't specify tearing. So at least in principle, you'd want an API "somewhat like" a NonAtomicVolatile ordering, where the primitive operations all don't tear (and then we can have volatile equivalents of element-wise atomic memcpy). Maybe, after all, we should just have Volatile* types in libcore matching the Atomic* types, and they have some kind of enum to switch between atomic (relaxed) and non-atomic, or so?

bjorn3 commented 5 years ago

Why can't we run our software directly on the subliminal fabric of mathematics...?

Maybe because that would cause a major disruption in space time? :)

Lokathor commented 5 years ago

I actually have a sound crate for volatile work, voladdress, and it actually works properly because it doesn't take &T an MMIO location.

It doesn't prevent tearing though, that's still up to you.

HadrienG2 commented 5 years ago

@RalfJung I could get behind either volatile ops on AtomicXyz or dedicated Volatile types, there are arguments in favor of both. Since the semantics are the same, I think this part can be left up to the output of the Great RFC Bikeshedding.

So, shall I get started with that?

@Lokathor Cool! Thanks for the prior art!

RalfJung commented 5 years ago

So, shall I get started with that?

Yes please. :) So happy to see some movement here. :D

What's your plan of action? In the UCG we usually write up documents that are not even RFCs, but more explaining the current behavior of the language and mapping out some decisions that need to be made. Then we try to get consensus for that within the WG.

But in this case maybe it is indeed better to try and write a pre-RFC in this repo and get WG consesus on that, similar to https://github.com/rust-lang/unsafe-code-guidelines/pull/197.

Lokathor commented 5 years ago

Please take care to remember that some platforms need volatile but don't otherwise have atomics, so there needs to be some answer there too.

HadrienG2 commented 5 years ago

Well, the tricky bit here is that this is proposing a new language feature, which I think is a first in the unsafe code WG. Usually, this group is more focused on formalizing the semantics of language features that already exist... So I'm not sure if there is an existing example to take inspiration from. Process guidance is much welcome!

I guess I see two possibilities right now:

Keep this in the Unsafe Code WG for now, submit an RFC PR in the spirit of https://github.com/rust-lang/unsafe-code-guidelines/pull/197 , or...
Go right away for the "language feature" pipeline, which AFAIK is typically opening an pre-RFC thread on internals, bikeshedding and collecting edge cases, and submitting something to rust-lang/rfcs once things start to calm down.

Ultimately, I suspect that which one I pick does not really matter. The important part is getting an RFC-style text in shape and getting some review on it, and that part can be sorted out while we figure out what the right submission process should be.

HadrienG2 commented 5 years ago

In any case, here's an insomnia-induced rough draft, which I'll proofread, cross-check and improve upon later on. I can submit it to any of the aforementioned submission channels for further discussion and refinement... Just tell me where.

RalfJung commented 5 years ago

Process guidance is much welcome!

We are mostly making up the process as we go here, so...^^

Either of your proposals seems fine to me. If you want feedback from a small group of people first, make it a PR here; if you want to go immediately for IRLO and get feedback from everyone there that's fine as well.

GitHub is pretty bad for discussions but OTOH one can attach a comment to a particular line in the RFC which is nice... tradeoffs everywhere.^^

HadrienG2 commented 5 years ago

I kind of like the idea of starting with a smaller group. Crowds make a lot of noise, it's best to only request their opinion when the proposal is already in a pretty solid state to avoid getting an obvious defect being pointed out / commented on 100 times. Hence I'll submit this here first ;)

RalfJung commented 5 years ago

Possibly related: https://github.com/asajeffrey/shared-data

asajeffrey commented 5 years ago

Also https://github.com/elast0ny/shared_memory-rs/ and https://github.com/standard-ai/shmem.

asajeffrey commented 5 years ago

The really nasty issue that I can't see how to work around is if the shared memory file gets truncated so access results in a SIGBUS. c.f. https://github.com/asajeffrey/shared-data/issues/7

gnzlbg commented 5 years ago

The really nasty issue that I can't see how to work around is if the shared memory file gets truncated so access results in a SIGBUS. c.f. asajeffrey/shared-data#7

How does this happen?

AFAICT this "use-after-free" can only happen if you have a data-race, e.g., one thread-of-execution truncates the shared memory without synchronizing with another thread-of-execution reading from the file, which is already UB.

asajeffrey commented 5 years ago

The problem is that the other process may not be written in Rust, it's arbitrary other code. The Rust code uses read_volatile and write_volatile to access the memory so (assuming attackers can't write poison) race conditions aren't an issue, but sadly truncation is.

asajeffrey commented 5 years ago

An interesting question raised in https://github.com/standard-ai/interprocess-traits/issues/3#issuecomment-551362224 is volatile padding. Even if &AtomicUsize and &AtomicU8 are safe to take pointing into shared memory, is &(AtomicU8, AtomicUsize) safe? The issue being that there's some padding in between the two atomics, and the padding might be written by another process. I don't think this is a safety issue, that padding is allowed to include poison, and so it's OK for it to be volatile, but other people might have other opinions!

gnzlbg commented 5 years ago

The problem is that the other process may not be written in Rust, it's arbitrary other code.

This doesn't matter, the data-race is already UB.

The Rust code uses read_volatile and write_volatile to access the memory so (assuming attackers can't write poison) race conditions aren't an issue, but sadly truncation is.

The code uses read_volatile and write_volatile on the file data, which is a data-race, but thinking of the file as a Box<[T]>, it also access the [T]::{ptr, len} fields without any synchronization, which is also a data-race.

While you can use something like #212 to avoid the data-races on the accesses to the elements of the [T], you can't avoid the data-race on the [T]::{len, ptr} fields with #212. You need to synchronize the access to the file "somehow". There are many options for that, and none is great, so your best bet is to just not let attackers spawn or take control of processes in your system.

The issue being that there's some padding in between the two atomics, and the padding might be written by another process.

Padding doesn't matter, that looks fine.

Ekleog commented 5 years ago

@gnzlbg I was convinced that writing to the padding of non-repr(C) structs would lead to (technical-)UB if fields of the struct were read after that without rewriting the whole struct first, was I wrong? (things like “the padding of the struct might be reused for storing a wrapping enum discriminant,” and so on)

asajeffrey commented 5 years ago

@gnzlbg you appear to be assuming that a Vec is being stored in shared memory, it's not it's a SharedVec https://github.com/asajeffrey/shared-data/blob/master/src/shared_vec.rs which stores the len in an AtomicUsize, not a usize.

gnzlbg commented 5 years ago

you appear to be assuming that a Vec is being stored in shared memory,

Since you didn't mention precisely what you were doing, I was assuming that you were just using a file descriptor for a shared memory segment (e.g. the result of memfd_create) and such a file descriptor is like a Vec, in that it stores a len and a pointer, and you can mmap it, ftruncate it, etc.

it's not it's a SharedVec https://github.com/asajeffrey/shared-data/blob/master/src/shared_vec.rs which stores the len in an AtomicUsize, not a usize.

Isn't that length process local ? The length that matters is the one that gets put in shared memory and can be accessed by the different processes.

asajeffrey commented 5 years ago

There interesting case is when the Vec is in shared memory, in particular the length is shared.

Previous Next