What about: volatile, concurrency, and interaction with untrusted threads

rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do

https://rust-lang.github.io/unsafe-code-guidelines

Apache License 2.0

670 stars 58 forks source link

What about: volatile, concurrency, and interaction with untrusted threads #152

Open hsivonen opened 5 years ago

hsivonen commented 5 years ago

Designate outside-of-program changes to memory accessed by volatile as non-UB

Context: An internals thread.

Use case: Rust is used to write privileged code (host services provided by a runtime environment to a JITed language, OS kernel providing syscall services to userland code, or a hypervisor providing emulated devices to a guest system) that needs to access memory that also unprivileged code can access and the unprivileged code can have multiple threads such that while unprivileged thread A has requested a service from the host such that the host service is running logically on A's thread of execution, a separate unprivileged thread of execution B could, if it is behaving badly, concurrently access the same memory from another CPU core. The unprivileged thread of execution must not be allowed to cause the privileged code written in Rust to experience UB. (It's fine for the unprivileged code to cause itself to experience UB within the bounds of its sandbox.)

The memory model itself is a whole-program model, so it doesn't apply, since in order to provide the guarantees it pledges to our thread, we must pledge the absence of data races from other threads of execution, which we can't do in this case. Hence, we need a way to access memory that is outside the memory model in the sense that there could exist an adversarial additional thread of execution that doesn't adhere to the DRF requirement. We're not trying to communicate with that thread of execution. The issue is just not letting it cause security bugs on us.

The C++ paper P1152R0 "Deprecating volatile" gives this use case as the very first item on its list of legitimate uses of volatile in C and C++. This makes sense, since if volatile works when external changes are caused by memory-mapped IO (the use case documented for std::ptr::read_volatile and the original use case motivating the existence of volatile in C), given the codegen for volatile and codegen for relaxed atomics on architectures presently supported by Rust, it makes sense for it to also work also when external changes are caused by a rogue thread of execution.

Yet, the documentation says: "a race between a read_volatile and any write operation to the same location is undefined behavior". I believe it's unnecessary and harmful to designate this as UB and it would be sufficient to merely say that the values returned by read_volatile are unpredictable in that case. This makes sense in the light of an IO-like view of volatile: You need to be prepared to receive any byte from an IO stream, so not knowing at compile time what you are going to get does not have to be program-destroying UB if you are prepared to receive value not predicted at compile time.

I suggest that a) the documentation be changed not to designate concurrent external modification of memory locations that a Rust program only accesses as volatile to be UB and b) to state in the Unsafe Code Guidelines that it's legitimate to use volatile accesses to access memory that a thread of execution external to the Rust program might change concurrently. That is, while you may read garbage, the optimizer won't assume that two volatile reads from the same location yield the same value and won't invent reads from memory locations written to using volatile writes (i.e. the memory locations are considered shared and, therefore, ineligible to be used as spill space by the compiler).

Replies to the thread linked to above indicate that this should already be the case despite the documentation suggesting otherwise.

Also see https://github.com/rust-lang/unsafe-code-guidelines/issues/152#issuecomment-506027424 which tries to summarize the discussion-until-then a bit.

gnzlbg commented 5 years ago

I don't think we need to distinguish between internal and external threads of execution. A thread of execution is just that, and the concurrent semantics of volatile should be specified for those in general.

Volatile loads and stores are not guaranteed to be atomic (and synchronize in any particular way), that is, using them to concurrently read/write to some memory could introduce a data-race.

Suppose you have an [u16] in shared memory, and two threads of execution modify it by writing either 0x00 or 0xff to it. If you use properly synchronized memory accesses, you are guaranteed to always read either 0x00 or 0xff from the array in any of the processes. If you use volatile loads / stores, one thread of execution could read 0xf0 or 0x0f. This can easily result in mis-optimizations even if the two threads of execution are run in different processes, e.g. if the program uses a type that properly expresses these semantics (e.g. a NonZero-like type), hint::assume annotations, etc.

Now, when people use volatile loads / stores to access IO registers, they exploit the platform-specific knowledge that the volatile load / store is atomic for that particular register or memory access size on that particular architecture.

For example, when reading a temperature from a sensor by using a volatile load on a 32-bit register, the hardware often guarantees atomicity for memory loads, making it impossible to observe a partially modified result. Here you can get away with a volatile load because the toolchain and the hardware conspire together to avoid a data-race.

We could document that whether volatile loads and stores are atomic and synchronize in any particular way for a particular memory addresses, memory access sizes, target, target features, target cpu, etc. is unspecified. That is, implementations are not required to document this behavior, and the volatile loads and stores are not guaranteed to be atomic. If misuse introduces a data-race, the behavior is undefined.

AFAICT, what we cannot do is guarantee that what you are trying to accomplish always works (for all Ts, on all targets, etc.).

HadrienG2 commented 5 years ago

I think one additional thing which @hsivonen needs is for data races to produce sane effects ("garbage data" and that's it) when the race occurs on a "bag of bits" data type which has neither padding nor invalid bit patterns, such as u8, u16...

It's a topic that comes up frequently in unsafe code discussions though, so perhaps there's already an UCG topic on this kind of matter that we can study for prior art.

HadrienG2 commented 5 years ago

A related issue, which is mostly of theoretical interest in multi-process scenarios as current implementations don't really have another option than to do the right thing, is that an adversarial thread writing invalid data into the shared memory space (e.g. mem::uninitialized::<u8>()) should not trigger UB in a privileged thread reading from it with proper precautions, given again a restriction to data types for which all bit patterns are valid.

comex commented 5 years ago

I think volatile should be documented and specified along these general lines:

Unlike almost everything else in the language, volatile defers to hardware semantics. volatile_load means "emit exactly one load instruction which cannot be removed or reordered, and don't make assumptions about the loaded value", but the meaning of a "load instruction" is implementation-defined (and in practice architecture-dependent).
Accordingly, any properties involving atomicity, memory ordering, or handling of uninitialized data are guaranteed iff the underlying architecture guarantees them. Rust does not require its supported architectures to make any such guarantees; an architecture could, e.g., make racing volatile accesses undefined behavior, although no current Rust architectures do so. On the other hand, when compiling for an architecture that does provide guarantees, the compiler will not perform optimizations that break code that relies on those guarantees.

[What about miri?]

comex commented 5 years ago

(Sorry for double post -)

But I also think we should have a better answer for shared memory than volatile.

In particular, as discussed in the internals thread, we may want to guarantee that the UB caused by races between atomic and non-atomic accesses, if the accesses are in different processes, only affects the process performing the non-atomic access. In other words, you can safely use atomic accesses on shared memory even if your communication partner might be malicious – at least with regards to that particular source of UB. That seems like a reasonable guarantee.

On the other hand, there are other, more plausible ways that an architecture could hypothetically break this sort of IPC. For example, it could give each byte of memory an "initializedness" status, such that if process A writes uninitialized data to memory and process B reads it, process B gets an uninitialized value and traps if it tries to use it. (Note that Itanium does not do this; it tracks initializedness for registers, but not for memory.)

HadrienG2 commented 5 years ago

In an ideal world, there would be a way for the untrusting thread to state "I know that I might be reading uninitialized or otherwise fishy memory, please let me do so and return arbitrary bytes on incorrect usage".

Kind of like the ptr::freeze that was proposed there, but with transactional semantics to account for the racy nature of the situation, i.e. fn(*mut T) -> T.

However, I have no idea how to make that actually interact correctly with emulations of hardware that can track use of uninitialized memory but provide no way for software to announce voluntary access to uninitialized memory, such as valgrind.

gnzlbg commented 5 years ago

"emit exactly one load instruction which cannot be removed or reordered, and don't make assumptions about the loaded value", but the meaning of a "load instruction" is implementation-defined (and in practice architecture-dependent).

I find that using "emit exactly one load instruction" to denote potentially many actual hardware load instructions confusing.

comex commented 5 years ago

Feel free to improve the wording :)

I mean of course that there is one load instruction per volatile_load call.

Though even that is slightly imprecise. The compiler can still inline or otherwise duplicate code, in which case a given binary could contain multiple locations corresponding to a single call to volatile_load in the source code, each of which would have its own load instruction.

Perhaps it's better to say: Any given execution of volatile_load will be performed by executing a single load instruction.

gnzlbg commented 5 years ago

@comex

I mean of course that there is one load instruction per volatile_load call.

This:

#[no_mangle] pub unsafe fn foo(x: *mut [u8; 4096]) -> [u8; 4096] { x.read_volatile() }

generates ~16000 instructions on godbolt. I don't know of any architecture in which this could actually be lowered to exactly one load instruction.

petrochenkov commented 5 years ago

[u8; 4096]

I assume @comex would expect this to be written as a loop. If this model is followed, it would actually be nice to report a warning/error from the backend if the read_volatile/write_volatile cannot be lowered into a single load/store with the given size and alignment.

gnzlbg commented 5 years ago

@petrochenkov How would this be implemented?

No matter how I look at this, I see a lot of issues.

If we guarantee that (*mut T).read_volatile() either lowers to a single instruction or compilation fails, unsafe code will rely on that for safety, so this would need to be a guaranteed compilation error.

One issue is that we can only emit this error at monomorphization time. I'm not sure how we could fix that.

Another issue is that this would be a breaking change, but I suppose that we could either add new APIs and deprecate the old ones, or somehow emit this error only in newer editions (these operations are almost intrinsics).

I wonder how the implementation would look like. AFAIK only the LLVM target backend can know to how many instructions the load lowers to, and requiring this kind of cooperation from all LLVM backends (and Cranelift) seems unrealistic. I suppose we could generates "tests" during compilation, in which we count the instructions, but that seems brittle.

Then I wonder how this could work on Wasm SharedArrayBuffer. Even if we lower a volatile load to a single WASM instruction, the machine code generator might lower that into multiple loads depending on the host, and if that's the case, there is no way for it to report an error.

comex commented 5 years ago

@gnzlbg

Good point – volatile accesses of types that don't correspond to machine register types are somewhat ill-defined, AFAIK. But, e.g.

#[no_mangle] pub unsafe fn foo(x: *mut u64) -> u64 { x.read_volatile() }

should definitely be guaranteed to be a single load instruction on x86-64; same goes for smaller integer types.

Of course there's limited room to make decisions here since we're largely at the mercy of LLVM, but I'd say the rule is roughly "if the 'obvious' way to translate this load is with a single instruction, it has to be a single instruction".

Personally, I'd prefer if volatile produced a compile error in other cases, but that ship has sailed.

The rule gets less clear when SIMD gets involved. x86-64 can perform 128-bit loads via SIMD, and currently, a volatile_load of packed_simd::f32x4 does indeed expand to such an instruction:

    movaps  (%rdi), %xmm0

On the other hand, a volatile_load of u128 expands to two 64-bit regular loads:

    movq    (%rdi), %rax
    movq    8(%rdi), %rdx

I'd say this is defensible, because even if the architecture has some way to perform a 128-bit load, that's not the same as there being an 'obvious' way to load a 128-bit integer. On the other hand, with f32x4 we are explicitly requesting SIMD, so it's arguably reasonable to guarantee that it uses an appropriate SIMD instruction.

But in any case, it doesn't really matter whether a SIMD load is guaranteed, since movaps is not guaranteed to be atomic at an architectural level, so the difference between the two is not really observable.

comex commented 5 years ago

Then I wonder how this could work on Wasm SharedArrayBuffer. Even if we lower a volatile load to a single WASM instruction, the machine code generator might lower that into multiple loads depending on the host, and if that's the case, there is no way for it to report an error.

I'd say that the 'architecture' here is Wasm, not whatever it's ultimately compiled into. Just as x86-64 has 128-bit load instructions that aren't guaranteed to be atomic, Wasm apparently doesn't guarantee that loads of any size are atomic, unless you use the special atomic instructions. But that's fine; it's already established that the set of guarantees provided depends on the architecture.

However, Wasm is arguably a motivating use case for exposing additional intrinsics for loads and stores marked both atomic and volatile (which LLVM supports).

petrochenkov commented 5 years ago

AFAIK only the LLVM target backend can know to how many instructions the load lowers to, and requiring this kind of cooperation from all LLVM backends (and Cranelift) seems unrealistic.

That's exactly what I meant by "from the backend" - a target-specific LLVM backend, before that point no one knows about instruction specifics. I don't know whether LLVM has necessary infrastructure or not, I'm just saying that having it would be useful. (This certainly shouldn't be any kind of language-level guarantee, only a lint-like diagnostic, even if it's highly reliable.)

comex commented 5 years ago

I'm a little confused what the argument is about, but to be clear – for MMIO to work correctly, volatile loads and stores of basic integer types (specifically, ones that can fit into a register) must be done using a single load/store instruction of the correct width. So the behavior with those types needs to be a language-level guarantee. That should probably also apply to #[repr(transparent)] wrappers around those types.

The behavior of volatile with other types, as I said, is relatively ill-defined. In most cases it does seem fairly likely to be a mistake – e.g. if the [u8; 4096] example were in real code, it would be unlikely that the author actually meant to generate that giant function that copies the bytes one-by-one. Or if, say, someone used a struct to represent a block of MMIO registers, they might accidentally write my_struct.volatile_load().field instead of (&raw my_struct.field).volatile_load(). So it might make sense to produce some sort of diagnostic. On the other hand, I could imagine volatile accesses with large types being done intentionally for the shared-memory use case.

Lokathor commented 5 years ago

(Speaking as one of the two "rust on the GBA" devs) Integer types: yes. Transparent types: absolutely must also be yes. For MMIO to be an approachable issue you have to be able to newtype all the integers involved.

Centril commented 5 years ago

Unlike almost everything else in the language, volatile defers to hardware semantics. volatile_load means "emit exactly one load instruction which cannot be removed or reordered, and don't make assumptions about the loaded value", but the meaning of a "load instruction" is implementation-defined (and in practice architecture-dependent).

It seems to me that if you want to provide functionality that is specified as deferring to hardware semantics, and which is inherently architecture-dependent, then it is better to provide these through an explicitly architecture-dependent mechanism instead of calling it "implementation defined". For this, we have the core::arch module. If a few architectures provide the same guarantee for the same family of functions, then an interface can be built over that in a third-party crate.

hsivonen commented 5 years ago

Volatile loads and stores are not guaranteed to be atomic (and synchronize in any particular way), that is, using them to concurrently read/write to some memory could introduce a data-race.

It's unclear to me if you are using "atomic" and "data-race" colloquially or as specific memory model terms. For my use case, I don't need colloquial atomic: that is, I don't need indivisibility. In particular, I want to do lockless unaligned SIMD loads/stores, and I don't care if they tear in the presence of an adversarial second thread. However, I need "atomic" in the memory model sense that colloquially there is a data race but it must not be a "data race" for the purpose of "data races are UB": Just like relaxed atomics race in practice but that race is defined not to constitute a "data race" for the purpose of "data races are UB".

Suppose you have an [u16] in shared memory, and two threads of execution modify it by writing either 0x00 or 0xff to it. If you use properly synchronized memory accesses, you are guaranteed to always read either 0x00 or 0xff from the array in any of the processes. If you use volatile loads / stores, one thread of execution could read 0xf0 or 0x0f.

This is fine for my use case. My use case needs to read or write sensible values only in the single-threaded scenario. If there's another thread, the other thread is an error on the part of the unprivileged code and adversarial from the point of view of the privileged code, at which point I'm fine with the unprivileged code UBing itself and getting garbage results from the host service. I don't want it to UB the privileged host service code: the privileged code may see garbage values that are even inconsistent garbage between two reads from the same memory location, but there must not be optimizations that would introduce security bugs that were not in the source code if source code had no security bugs if every load behaved like a call to a random number generator. (An example of a compiler-introduced security bug would be an elision bound checks on the assumption that two loads from the same memory location yield consistent values.)

This can easily result in mis-optimizations even if the two threads of execution are run in different processes, e.g. if the program uses a type that properly expresses these semantics (e.g. a NonZero-like type), hint::assume annotations, etc.

For clarity, I only intend to read types for which all bit patterns are valid values (and only on architectures that cannot track "uninitialized" bytes in RAM and yield some bit patterns for memory locations that are uninitialized for the purpose of high-level language semantics), and the thing I'm asking for is being able to turn off optimizations that could be dangerous in the presence of an adversarial other thread. AFAICT, this means that 1) the compiler must not use memory locations that the get written as spill space (i.e. must not invent reads that expect to read back previous writes intact) and 2) if the compiler generates two loads from the same memory location (either on its own initiative or because the source code showed two loads), the compiler must not assume that the two loads yield mutually-consistent values (i.e. the compiler must not optimize on the assumption that values read from the same memory location are mutually consistent).

gnzlbg commented 5 years ago

@comex

I'm a little confused what the argument is about, but to be clear – for MMIO to work correctly, volatile loads and stores of basic integer types (specifically, ones that can fit into a register

Volatile reads and writes are generic over T so while we could guarantee more things for concrete Ts, the semantics being argued about in the OP are the generic ones that hold for all T.

Also, you just showed that even though x86 has 128-bit registers with atomic instructions, volatile reads and writes to u128 are not atomic.

So we can't say that "If T is an integer and it fits in a register in the target, volatile reads / writes are atomic".

In a 32-bit architecture, reads / writes to 64-bit integers might not be atomic, in a 16-bit architecture, reads / writes to 32-bit integers might not be atomic, etc. At best, because we only support platforms with CHAR_BITS == 8, we might be able to guarantee that 8-bit volatile reads/writes are atomic and relaxed everywhere.

@petrochenkov

(This certainly shouldn't be any kind of language-level guarantee, only a lint-like diagnostic, even if it's highly reliable.)

The unsafe code guidelines specify what guarantees is unsafe code allowed to rely on. That is, if you write generic unsafe code, can it rely on volatile reads and writes of T being atomic and relaxed ? What if T = u64 ? AFAIK the answer to both question is "No, unsafe code cannot rely on that", so I don't know how we could guarantee something that users are not allowed to rely on. I still don't know how we can write anything better than (this):

We could document that whether volatile loads and stores are atomic and synchronize in any particular way (for a particular memory addresses, memory access sizes, target, target features, target cpu, etc.) is unspecified. That is, implementations are not required to document this behavior, and the volatile loads and stores are not guaranteed to be atomic. If misuse introduces a data-race, the behavior is undefined.

This allows users that check what the backend does for a particular architecture to rely on that information, and if they mess up, the behavior is undefined.

@comex mentions that we could guarantee that this works for "integer that fit in a register", but they showed above that this is not true, e.g., on x86, where u128 fits on many x86 registers (up to 512-bit wide), yet volatile loads and stores to u128 are not atomic relaxed.

AFAICT, on a 32-bit arch 64-bit volatile loads/stores might not be atomic; on a16-bit arch, 32-bit loads/stores might not be atomic either, etc. Maybe at best we can guarantee that 8-bit wide volatile loads and stores are always atomic, by stating that we only will ever support platforms where this is the case, and if some platform does not satisfy this, we'll never support it.

Please do suggest specific text about the guarantees that unsafe code is allowed to always rely on when working with volatile loads / stores. Talking about a concrete snippet of wording is IMO easier than talking on the "abstract", because one can more easily show counter-examples that prove the wording incorrect (e.g. *mut [u8; 4096], u128, etc. ).

gnzlbg commented 5 years ago

For my use case, I don't need colloquial atomic: that is, I don't need indivisibility. [...] This is fine for my use case [u16 example].

If we make the behavior unspecified, and you know that u8 volatile loads/stores are always "relaxed atomic" in the platforms you are targeting, you can probably just use those. AFAIK the unaligned SIMD vector load is not atomic on x86 (they generate a couple of uops), but you can use it to read [u8; 16] bytes from a *mut u8, where the platform guarantees you that it won't give you partially modified bytes.

hsivonen commented 5 years ago

If we make the behavior unspecified, and you know that u8 volatile loads/stores are always "relaxed atomic" in the platforms you are targeting, you can probably just use those.

Is there a concrete need to make them all the way "unspecified" as opposed to "may return unpredictable values if the memory locations are concurrently written to"?

AFAIK the unaligned SIMD vector load is not atomic on x86 (they generate a couple of uops), but you can use it to read [u8; 16] bytes from a *mut u8, where the platform guarantees you that it won't give you partially modified bytes.

I'm OK with receiving partially modified bytes. I'm just not OK with optimizations that would introduce security bugs in that case. As soon as there's a second thread writing to the memory that I'm reading, I no longer care about what values I read and only care about not having a security bug.

hsivonen commented 5 years ago

Please do suggest specific text about the guarantees that unsafe code is allowed to always rely on when working with volatile loads / stores.

Would the following be true given what LLVM provides (and has to keep providing for real-world C use cases)?

My use case needs:

The compiler will not assume that it can read back a value from a location that is being written to by a volatile write (it will not use locations written to as volatile as spill space and will assume the memory may have been externally modified).
The compiler will not assume that two loads from a locations read by a volatile read yield the same value (it will assume the memory location may have been externally modified).
The compiler will not generate synchronization for volatile reads and writes (no implicit locks).

My use case doesn't need, but I gather the original purpose of volatile does:

For volatile reads and writes of aligned integer types of the size of usize or smaller, the load or store is indivisible (other cases may tear).
For volatile reads and writes of aligned integer types of the size of usize or smaller, each read or write visible in the source code results in exactly one load or store instruction generated without widening or narrowing (no compiler-invented extra loads or stores and no elimination of dead stores or repeated loads).
The compiler will not reorder volatile operations relative to each other, but will not prevent the CPU from performing the kind of reorderings that the CPU is permitted to perform for plain loads and stores.

gnzlbg commented 5 years ago

Is there a concrete need to make them all the way "unspecified" as opposed to "may return unpredictable values if the memory locations are concurrently written to"?

Unspecified just means that we don't specify what happens. How are "unpredictable values" any more specific than that? AFAICT "unpredictable" allows any value. Trying to be more specific here would probably require introducing a new atomic memory ordering weaker than relaxed (e.g. with support for word tearing due to concurrent writes, and specifying which values each word is allowed to take).

My use case needs:

AFAIK all of these are guaranteed by the current specification of read/write volatile in Rust.

My use case doesn't need,

AFAIK the first two are not guaranteed, don't know about the third one. I don't know if "usize or smaller" is something that we could guarantee (e.g. CHERI has 128-bit wide usize).

RalfJung commented 5 years ago

(I don't have time to read this exploding thread fully now, hopefully I'll get to it tonight. But please keep the discussion here focused on the interaction of volatile accesses and concurrency. Things like tearing and specifying the semantics of volatile while avoiding low-level concepts such as "load instructions" already have a topic at https://github.com/rust-lang/unsafe-code-guidelines/issues/33, let's not duplicate that discussion.)

comex commented 5 years ago

@Centril

It seems to me that if you want to provide functionality that is specified as deferring to hardware semantics, and which is inherently architecture-dependent, then it is better to provide these through an explicitly architecture-dependent mechanism instead of calling it "implementation defined". For this, we have the core::arch module. If a few architectures provide the same guarantee for the same family of functions, then an interface can be built over that in a third-party crate.

If volatile didn't exist already, I might agree with you. But read_volatile and write_volatile are stable. They are explicitly intended for interacting with MMIO registers, and existing code uses them for that purpose. Correctly interacting with MMIO requires the single-instruction guarantee for applicable integer types, so Rust must provide that guarantee, or something equivalent to it.

It could perhaps be worded in terms of a single "memory access" rather than a single "instruction", but I don't see much difference; neither of those concepts can be defined without some reference to a machine model.

@gnzlbg

I still don't know how we can write anything better than this: [..] @comex mentions that we could guarantee that this works for "integer that fit in a register", but they showed above that this is not true, e.g., on x86, where u128 fits on many x86 registers (up to 512-bit wide), yet volatile loads and stores to u128 are not atomic relaxed.

I believe that the correct definition is inherently architecture-specific. Better than "size of a register" is "size of a general-purpose register": that works for most architectures, but not all, e.g. Wasm doesn't even have a concept of a register.

But it should be possible to establish rules like: "On x86_64, calls to volatile_load and volatile_store with integer types from 8 to 64 bits are guaranteed to perform a single access at the architectural level, of the correct size."

This certainly shouldn't be unspecified, and I don't think it should even be implementation-defined per se, in the sense that some alternate backend could decide to behave differently. It should be required for any Rust implementation targeting x86_64. (At least, unless their interpretation of "x86_64" is something so weird and nonstandard that the rule somehow wouldn't make sense for that implementation.)

gnzlbg commented 5 years ago

They are explicitly intended for interacting with MMIO registers, and existing code uses them for that purpose. Correctly interacting with MMIO requires the single-instruction guarantee for applicable integer types, so Rust must provide that guarantee, or something equivalent to it.

Does any backend guarantee this (LLVM, GCC, or Cranelift)?

This certainly shouldn't be unspecified, and I don't think it should even be implementation-defined per se, in the sense that some alternate backend could decide to behave differently. It should be required for any Rust implementation targeting x86_64. (At least, unless their interpretation of "x86_64" is something so weird and nonstandard that the rule somehow wouldn't make sense for that implementation.)

Allowing the behavior to change across implementations (targets, other backends, other toolchains) and being required to document the behavior reads like the definition of implementation defined.

"On x86_64, calls to volatile_load and volatile_store with integer types from 8 to 64 bits are guaranteed to perform a single access at the architectural level, of the correct size."

Is this true for all x86_64 hardware? EDIT: I think so, but note that this is true independently of whether the load is volatile or not.

hsivonen commented 5 years ago

Unspecified just means that we don't specify what happens. How are "unpredictable values" any more specific than that? AFAICT "unpredictable" allows any value.

I'm trying to capture that the values may be weird but there is no UB. I'm not sure what the implications of "unspecified", which I believe to be a special term, are.

AFAIK the first two are not guaranteed, don't know about the third one.

They seem to be guaranteed by this C++ proposal, which I believe to try to capture the behavior of the LLVM intrinsics that Rust builds upon here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1382r0.pdf

My only issue with that paper is that its item "i" says that certain volatile accesses don't constitute data races, which implies there are others that might. I'd be happier if volatile accesses categorically didn't constitute data races. (Other types of accesses to the same memory locations could still constitute data races resulting in asymmetric UB, but asymmetric UB is exactly what I'm looking for here.)

Lokathor commented 5 years ago

They are explicitly intended for interacting with MMIO registers, and existing code uses them for that purpose. Correctly interacting with MMIO requires the single-instruction guarantee for applicable integer types, so Rust must provide that guarantee, or something equivalent to it.

Does any backend guarantee this (LLVM, GCC, or Cranelift)?

Yes, LLVM does at least.

IR-level volatile loads and stores cannot safely be optimized into llvm.memcpy or llvm.memmove intrinsics even when those intrinsics are flagged volatile. Likewise, the backend should never split or merge target-legal volatile load/store instructions.

And then there's another paragraph that makes it extra clear that they intend you to be allowed to execute native width loads and stores as a single volatile instruction

Rationale Platforms may rely on volatile loads and stores of natively supported data width to be executed as single instruction. For example, in C this holds for an l-value of volatile primitive type with native hardware support, but not necessarily for aggregate types. The frontend upholds these expectations, which are intentionally unspecified in the IR. The rules above ensure that IR transformations do not violate the frontend’s contract with the language.

rpjohnst commented 5 years ago

@Centril

It seems to me that if you want to provide functionality that is specified as deferring to hardware semantics, and which is inherently architecture-dependent, then it is better to provide these through an explicitly architecture-dependent mechanism instead of calling it "implementation defined". For this, we have the core::arch module. If a few architectures provide the same guarantee for the same family of functions, then an interface can be built over that in a third-party crate.

On top of @comex's point that this is a no-go for stability reasons, volatile isn't really arch-specific anyway. This sounds like a similar problem to guaranteed copy/move elision- we care about something that's maybe outside the abstract machine, that normally the optimizer could "as-if" away.

In this case I don't think it's too difficult to fix that, because the reason for volatile is that loads and stores can have arbitrary side effects. The implementation ("please actually emit this, don't reorder it, assume it can mutate shared state, etc") is the same thing you need for any other opaque side-effecting operation- calling a function through FFI or an unknown function pointer, etc.

There's certainly a lot of wiggle room there around which optimizations you want to enable vs which actual side effects volatile operations will perform, but it hardly belongs in core::arch.

gnzlbg commented 5 years ago

@Lokathor the text itself does not guarantee that volatile load / stores lower to a single atomic instruction, and instead only says:

A volatile load or store may have additional target-specific semantics.

When the "Rationale" section says that "volatile load / stores of primitive types with native hardware support lower to a single instruction" it is providing a guarantee that the text above does not provide. It still does not say that this instruction needs to be atomic (e.g. cannot tear), and I asked in #llvm@freenode today and was told that LLVM does not allow frontends to query whether a type is a primitive type with native hardware support for a particular target.

@rpjohnst

There's certainly a lot of wiggle room there around which optimizations you want to enable vs which actual side effects volatile operations will perform, but it hardly belongs in core::arch.

With @Centril's solution, we could add to core::arch:

arch::wasm: no intrinsics
arch::x86: only volatile load/stores for u8/16/32
arch::x86_64 volatile load/store for u8/16/32/64 without tearing, and volatile load/store u128/256/512 with tearing (using different descriptive names),
etc.

These intrinsics could guarantee that only a single instruction is emitted, whether the load/stores are atomic or they can tear, etc. Unsafe code could then safely rely on these guarantees, and we could provide implementations of these that are independent from what LLVM lowers volatile load/stores to (e.g. using inline assembly).

The generic ptr::read_volatile<T>/write_volatile<T> intrinsics would first require generic guarantees that hold for all Ts, and then platform specific guarantees that only hold for some targets and some Ts, and users would at best need to read the docs to figure out these guarantees (and at worst, read the backend, inspect the generated code, etc.).

comex commented 5 years ago

These intrinsics could guarantee that only a single instruction is emitted, whether the load/stores are atomic or they can tear, etc. Unsafe code could then safely rely on these guarantees, and we could provide implementations of these that are independent from what LLVM lowers volatile load/stores to (e.g. using inline assembly).

I can get behind this if the existing read_volatile and write_volatile are then changed to use those intrinsics when available. I just don't want to leave existing users in a grey area of "this will keep working with the LLVM backend, but we're not going to specify anything about it, so future Rust backends could do whatever they want".

I was going to say this could be implemented using specialization, but that wouldn't work for #[repr(transparent)] wrappers around basic integer types (which apparently people already use and expect to work). So it would really have to be done in the compiler.

On the other hand, just because LLVM doesn't have a way to query the list of supported primitive types doesn't mean rustc can't make assumptions about it. In practice it should be fine to add intrinsics to core::arch, but have the implementations just lower to the same old volatile loads/stores, with the set of intrinsics being based on which types we know LLVM produces single loads/stores for on a given architecture. After all, although the difference should be negligible in most cases, volatile loads/stores should produce slightly better code than inline assembly for a few reasons (e.g. LLVM can't incorporate inline asm into its cost model for instruction scheduling AFAIK). And read_volatile and write_volatile could keep their existing implementation, saving effort. But from a specification perspective, read_volatile and write_volatile should still be defined as equivalent to the new intrinsics for all types that have them (as well as, again, transparent wrappers around the same).

Lokathor commented 5 years ago

LLVM will lower to single instruction if the type is legal for it on that target:

Likewise, the backend should never split or merge target-legal volatile load/store instructions.

So, it's not for all T, it's only for specific sized and aligned types, by platform

EDIT:

@comex yeah integer newtyping is a really big deal and needs to be cleanly supported.

RalfJung commented 5 years ago

I see at least two discussions being intervoven here, which doesn't make this any easier to follow.^^ It would really help if we could separate the "how the heck to we specify that volatile is a single instruction, whatever that means" part from "what do we want to say about the interaction of volatile accesses and concurrency".

How to specify the part about not "duplicating" or "removing" volatile loads/stores

I've said my part on this at https://github.com/rust-lang/unsafe-code-guidelines/issues/33#issuecomment-429112051:

Basically, I think the way to think about volatile accesses is as follows: A volatile *x = val really is a call to write_volatile(x, $SIZE, val), and the compiler doesn't exactly know what that function does. It can make some assumptions ("doesn't mutate any memory I know about that is not aliased with x..x+size"), but no more than that. It's otherwise an unknown function call, or a syscall, or whatever you like to call this in your mental model. Of course you cannot just duplicate or remove that. For all you know, write_volatile might send your data over the network. Similar for volatile reads.

To my knowledge, this is how e.g. CompCert models volatile. I think this is an adequate formal model that exactly captures that volatile accesses cannot be eliminated even if they otherwise seem redundant, and cannot be duplicated either. It also intuitively captures the idea that volatile memory accesses are really a form of "external communication", much like syscalls.

What is left is specifying "tearing": when does a volatile load/store become a single access, vs. many accesses because it is too big. This C++ proposal suggests a volatile_non_tearing type-level predicate to reflect that distinction into the language. We could do something similar.

But really this is already being discussed at https://github.com/rust-lang/unsafe-code-guidelines/issues/33 and now these issues became a horrible mess because this discussion got intermixed with an entirely orthogonal topic... so can we please please stop talking about "single instructions" here and move that part to https://github.com/rust-lang/unsafe-code-guidelines/issues/33, so that this thread can become a place to talk about the atomic aspect?

How do volatile accesses interact with concurrency

The "seed" of the discussion here seems to be that a race between a non-atomic write and a volatile read cause UB. @HadrienG2 explained why:

In theory, concurrent volatile accesses are UB because in the C++11 memory model, concurrent non-atomic access to memory is a data race (notice that volatile is not part of this definition), and data races are UB.

This is fixed by the aforementioned proposal that exempts volatile loads from the definition of a data race. And anyway in LLVM read-write races are not UB, so our backend is already prepared for weakening the unsafety here. (Instead LLVM says racy reads return undef.)

But as @gnzlbg noted, none of this has anything to do with changes being "outside of the program". In other words, this

b) to state in the Unsafe Code Guidelines that it's legitimate to use volatile accesses to access memory that a thread of execution external to the Rust program might change concurrently.

is unnecessarily weak considering that you are asking for volatile loads to be exempt from UB -- in which case it immediately follows that "it's legitimate to use volatile to access memory that any thread might change concurrently".

The question, of course, is what the volatile read would return if there is a race. Under current LLVM semantics, it returns undef. (And again this does not and cannot depend on whether the writing thread that causes the race is "external" or not.) undef is worse than "unpredictable", undef is the same as uninitialized memory (think mem::uninitialized()) and padding. It's weird memory where even x == x might return false (and no we are not talking about floating points).

Basically, imagine that every bit in the Rust abstract machine has three states: 0, 1 and uninitialized. And doing pretty much anything with an uninitialized bit (other than copying it) is UB.

So, on top of exempting volatile reads from the data race UB (which we already do de-facto because LLVM does, but do not guarantee), you are asking that volatile reads should be required to return "stable" or "frozen" data where all bits are initialized to some fixed value (which I do not think LLVM currently ensures). That seems like a reasonable spec to me, but can we get LLVM to commit to that? In LLVM this would mean guaranteeing that volatile reads never return undef. As argued by others, volatile reads cannot be duplicated and that's where the "data race reads are undef" part in the LLVM semantics comes from. But to ensure that we never see undef, we also have to cover the case where the bits in memory legitimately are undef, e.g. because they never got initialized. LLVM would have to guarantee that those get "frozen" in the result of a volatile read. ("freezing" a value means replacing all uninitialized bits by some initialized bits, non-deterministically.)

I thought I had seen wording for C++ that suggested this, but I can't find it right now.

If we do both of these things (no data races and no undef), then volatile reads become the thing @HadrienG2 asked for: "I know that I might be reading uninitialized or otherwise fishy memory, please let me do so and return arbitrary initialized bytes on incorrect usage"

Collected notes

@comex

Unlike almost everything else in the language, volatile defers to hardware semantics.

No please no. I don't even think this is possible. You can't mix two memory models in one spec. That's not a spec, that's hand-waving and giving up. ;)

@hsivonen

The memory model itself is a whole-program model

Also see https://github.com/rust-lang/rust/issues/58599#issuecomment-502598073.

gnzlbg commented 5 years ago

I can get behind this if the existing read_volatile and write_volatile are then changed to use those intrinsics when available

If we do that, then we can guarantee that:

if T fits in a native size, only one atomic relaxed load instruction is generated - in case of concurrent writes, no data-races are introduced, and no tearing,
if T does not fit in a native size, it will be read by a sequence of atomic relaxed loads of a appropriate size - in case of concurrent writes, no data-races are introduced, but there might be tearing (this would allow using unaligned vector loads internally on x86_64 like @hsivonen was interested in using).

AFAICT this is subtly different from LLVM volatile load / stores, in that if T does not fit in a native size:

and there are concurrent writes, there is a data-race (the volatile load / stores are not synchronized, or at least not guaranteed to be synchronized in any way like above),
the load / stores don't use a loop to read sequentially, but do so by generating N consecutive load/store instructions (like in the [u8; 4096] example above that blew up).

If LLVM would define the semantics of volatile load/stores in the same way that we do, we wouldn't really need to change anything implementation-wise.

RalfJung commented 5 years ago

Or should I just open a new topic for the concurrency thing, and then @hsivonen and @HadrienG2 and me can talk about what this thread here was originally asking, and y'all can go on about how to specify load/store tearing (which has nothing to do with the original question!) here?

RalfJung commented 5 years ago

But to ensure that we never see undef, we also have to cover the case where the bits in memory legitimately are undef, e.g. because they never got initialized. LLVM would have to guarantee that those get "frozen" in the result of a volatile read. ("freezing" a value means replacing all uninitialized bits by some initialized bits, non-deterministically.)

At least this quick experiment indicates that LLVM does not try to predict which value will get read by a volatile load, which is a good sign. But still, before we add any statement like that to our docs, LLVM should state in their LangRef that a volatile load will never return undef, even if the memory it reads from is undef.

And even then, this is basically adding "freeze" to Rust. Freeze has been discussed in-depth recently and there have historically been some objections to that. The problem is that this makes it well-defined behavior to observe the values of uninitialized memory.

comex commented 5 years ago

@gnzlbg

If LLVM would define the semantics of volatile load/stores in the same way that we do, we wouldn't really need to change anything implementation-wise.

Well, for the native-size case, LLVM has load atomic volatile and store atomic volatile, which should address the concern that data races are UB. As I see it, I don't think atomic is actually necessary in this case*, because volatile inherently has hardware-defined semantics and those semantics include behavior in the face of concurrent accesses. But it shouldn't hurt to add it either. It shouldn't affect the instructions being generated.

* Cases where it would be necessary include: (a) when using an ordering other than relaxed; (b) when dealing with types that are wider than general-purpose registers but for which the architecture provides special atomic load/store instructions, such as 64-bit loads on 32-bit ARM.

@RalfJung

Unlike almost everything else in the language, volatile defers to hardware semantics. No please no. I don't even think this is possible. You can't mix two memory models in one spec. That's not a spec, that's hand-waving and giving up. ;)

We've had this discussion before in the case of inline assembly. The way I see it, inline assembly and volatile are unique as the only two core language features that cannot be defined without reference to a target machine. But it sounds like you'd appreciate @gnzlbg's proposal to (as I understand it) basically deprecate volatile in favor of architecture-specific intrinsics, at least for the purpose of interacting with MMIO registers.

When it comes to the shared memory use case... I have some thoughts about that which I'll post later.

RalfJung commented 5 years ago

We've had this discussion before in the case of inline assembly. The way I see it, inline assembly and volatile are unique as the only two core language features that cannot be defined without reference to a target machine.

volatile can and has been specified without reference to a target machine, see my comment above and the CompCert verified C compiler.

Inline assembly has never been specified at all to my knowledge, at least not in anything I would call a "specification" (which, as a minimum standard, has to be precise enough to prove something about it).

(Also everyone seems to insist on continuing the discussion about tearing here even though it does not connect to the OP's question. @hsivonen do you want me to open a new issue and rename this one?)

HadrienG2 commented 5 years ago

And even then, this is basically adding "freeze" to Rust. Freeze has been discussed in-depth recently and there have historically been some objections to that. The problem is that this makes it well-defined behavior to observe the values of uninitialized memory.

As a minor nitpick, I'd like to reiterate the point I made earlier that this is something slightly different from the freeze() that was proposed recently, in the sense that it has transactional semantics in the face of concurrent access.

This is unlike the original freeze(), which separated the operation of freezing memory from that of loading from it, and was thus vulnerable to concurrent insertion of invalid values back into the previously frozen shared memory region.

On bizarre architectures which have the ability to track unitialized memory (and undef data in general), it may also means that what I am proposing could have worse codegen than a freeze + read sequence, due to the additional atomicity requirement and the inability to eliminate redundant freeze operations. Therefore, there may be value in having both.

Of course, this does not invalidate your point regarding the very existence of freeze-like operations being considered worrisome by some people.

RalfJung commented 5 years ago

As a minor nitpick, I'd like to reiterate the point I made earlier that this is something slightly different from the freeze() that was proposed recently, in the sense that it has transactional semantics in the face of concurrent access.

That seems like a weird way to put it. I'd say what happens is that first you are making a local copy of whatever is in memory, copying all undef, and even adding more undef if there was a racing write -- and then you freeze your local copy, which is local and hence there cannot be anybody else writing more undef in parallel.

No need to talk about "transactions". And in particular, the data in memory remains unfrozen, which is important. That's very unlike ptr::freeze which was supposed to freeze some data in memory, but in a non-atomic way.

HadrienG2 commented 5 years ago

Ah, yes, that makes perfect sense. Then ptr::freeze is actually enough, you only need to run it on your local copy of whatever strange value you extracted from shared memory.

RalfJung commented 5 years ago

Then ptr::freeze is actually enough, you only need to run it on your local copy of whatever strange value you extracted from shared memory.

Yes. You can obtain the operation you want from the volatile read we already have "de facto" (but not "de jure") that returns undef in face of a race, and then calling freeze on the copy of the data that you got.

comex commented 5 years ago

volatile can and has been specified without reference to a target machine, see my comment above and the CompCert verified C compiler.

Well, that's "specified" in the sense of "what it does is outside the scope of this spec". That's fine if the spec in question is limited to the portable subset of the language. But there needs to be a guarantee somewhere about what it does on a particular architecture; if I'm using volatile for MMIO, I won't be pleased if a compiler upgrade suddenly changes write_volatile to start sending data over the network. :) And the only way to define what it does on a given architecture with adequate precision is in terms of hardware semantics.

I suspect that I don't actually disagree with you, but am just using different terminology.

gnzlbg commented 5 years ago

That seems like a reasonable spec to me, but can we get LLVM to commit to that?

We'd need to ask. If they were to specify things this way, then there would be little to do on our end beyond writing down these guarantees.

comex commented 5 years ago

For the record, I think "volatile_read implies freeze" should not be specified as a portable guarantee. There could be an architecture that tracks uninitializedness of memory, and on that architecture there might be a "freeze" or "load possibly uninitialized" instruction. But there would be no need to use that instruction when accessing MMIO registers, and it might not even work on them; therefore, volatile_read should not use it. Instead it should be a separate intrinsic.

Thus I think "volatile_read implies freeze" should be something that just happens to be implied by the architecture-specific definition of volatile_read on each of the current architectures.

Well, except for Valgrind. That would actually be an interesting use case for a separate "load potentially uninitialized" or "freeze" intrinsic, because Valgrind has an equivalent of freeze called VALGRIND_MAKE_MEM_DEFINED. In theory the compiler could have a "I'm building this for Valgrind" flag that automatically calls VALGRIND_MAKE_MEM_DEFINED for this new intrinsic.

Interestingly enough, this happens to be required in some cases for shared memory, even when the other process isn't hostile. From the Valgrind manual:

As explained above, Memcheck maintains 8 V bits for each byte in your process, including for bytes that are in shared memory. However, the same piece of shared memory can be mapped multiple times, by several processes or even by the same process (for example, if the process wants a read-only and a read-write mapping of the same page). For such multiple mappings, Memcheck tracks the V bits for each mapping independently. This can lead to false positive errors, as the shared memory can be initialised via a first mapping, and accessed via another mapping. The access via this other mapping will have its own V bits, which have not been changed when the memory was initialised via the first mapping. The bypass for these false positives is to use Memcheck's client requests VALGRIND_MAKE_MEM_DEFINED and VALGRIND_MAKE_MEM_UNDEFINED to inform Memcheck about what your program does (or what another process does) to these shared memory mappings.

RalfJung commented 5 years ago

Well, that's "specified" in the sense of "what it does is outside the scope of this spec". That's fine if the spec in question is limited to the portable subset of the language. But there needs to be a guarantee somewhere about what it does on a particular architecture; if I'm using volatile for MMIO, I won't be pleased if a compiler upgrade suddenly changes write_volatile to start sending data over the network. :) And the only way to define what it does on a given architecture with adequate precision is in terms of hardware semantics.

That's the same thing with syscalls though -- you expect them to do certain things, i.e., you expect the kernel to be well-behaved.

I suspect that I don't actually disagree with you, but am just using different terminology.

To some extend. But I'd argue that there is a huge difference between volatile and inline assembly, as the latter can't even be specified in a way that is good enough to reason about compiler optimizations (or, if it can, I haven't seen it).

For the record, I think "volatile_read implies freeze" should not be specified as a portable guarantee. There could be an architecture that tracks uninitializedness of memory, and on that architecture there might be a "freeze" or "load possibly uninitialized" instruction.

There could also be a 7-bit architecture. And unlike what you are describing, those actually exist/existed, and yet Rust does not support them. What I am saying is: I can imagine many crazy things. That's not a good argument to make it impossible to write things like what the OP asked for.

comex commented 5 years ago

That's the same thing with syscalls though -- you expect them to do certain things, i.e., you expect the kernel to be well-behaved.

Yes, and "what the kernel does for this syscall" is usually documented or specified somewhere (e.g. POSIX). But in our case there is no external definition of "what a Rust volatile access does on X architecture" to point to. There's the LLVM documentation, but that only applies to the LLVM backend. You could say – well, it's implementation defined, and LLVM is an implementation! But it doesn't make any sense that, e.g., the upcoming Cranelift backend, if and when it's completed, shouldn't be expected to provide the same guarantees.

What I am saying is: I can imagine many crazy things. That's not a good argument to make it impossible to write things like what the OP asked for.

I don't think it should be impossible. I think it should be nonportable, pending addition of better intrinsics that are better designed for the shared memory use case, which we want anyway because volatile is a poor fit in many ways. (It doesn't offer a memory ordering parameter, which is needed in many cases, though not OP's. And it provides guarantees that you never care about for shared memory, like forced access size, inability to remove dead loads, etc.)

hsivonen commented 5 years ago

The question, of course, is what the volatile read would return if there is a race. Under current LLVM semantics, it returns undef. (And again this does not and cannot depend on whether the writing thread that causes the race is "external" or not.)

How can LLVM statically distinguish a data race caused by write instructions by a different thread and a memory-mapped device changing what a mapped location returns? That is, does LLVM ever actually get to optimize on the assumption that the value was undef?

Also everyone seems to insist on continuing the discussion about tearing here even though it does not connect to the OP's question.

It is connected to my use case. The reason why I'm now looking at using volatile instead of relaxed is that relaxed provides invisibility and I want to do SIMD loads and store without locks.

@hsivonen do you want me to open a new issue and rename this one?

I think the name of this one still asks for what I'm trying to ask for.

pending addition of better intrinsics that are better designed for the shared memory use case, which we want anyway because volatile is a poor fit in many ways. (It doesn't offer a memory ordering parameter, which is needed in many cases, though not OP's. And it provides guarantees that you never care about for shared memory, like forced access size, inability to remove dead loads, etc.)

Indeed volatile provides guarantees that I don't want or need. However, I was hoping that volatile could provide at least the things I need with a mere documentation change while getting something like relaxed memory order without indivisibility would take much longer than having something this year. (And, presumably, with tearable atomics there'd still be the issue that they'd be defined with whole-program semantics and one would have to rely on the optimizer just not seeing the code for an external thread.)

RalfJung commented 5 years ago

@comex

Yes, and "what the kernel does for this syscall" is usually documented or specified somewhere (e.g. POSIX). But in our case there is no external definition of "what a Rust volatile access does on X architecture" to point to. There's the LLVM documentation, but that only applies to the LLVM backend. You could say – well, it's implementation defined, and LLVM is an implementation! But it doesn't make any sense that, e.g., the upcoming Cranelift backend, if and when it's completed, shouldn't be expected to provide the same guarantees.

Fair. That would basically be part of the "platform spec" and hence it might make some sense that it refers to things like CPU instructions.

I don't think it should be impossible. I think it should be nonportable

Nonportable = not (in general) possible. That's what I meant, sorry I should have been clearer.

@hsivonen

How can LLVM statically distinguish a data race caused by write instructions by a different thread and a memory-mapped device changing what a mapped location returns? That is, does LLVM ever actually get to optimize on the assumption that the value was undef?

That distinction makes little sense, that's exactly my point. It doesn't matter who caused the race. And yes, LLVM performs optimizations in non-atomic reads that are only sound because they return undef on a race; see for example section 2.3 of this paper.

However, LLVM does not perform any such optimization on volatile reads -- as I said, LLVM currently de facto makes it so that volatile reads return frozen data (to my knowledge). So if "de facto" is all you need, the primitive you are asking for already exists, and it is called read_volatile. (And you seem aware of that, so your question here confuses me, but whatever. ;)

But there is no guarantee that it will stay that way. I would not be surprised at all if the next version of LLVM started to optimize

fn foo() -> u32 { unsafe {
  let m = MaybeUninit::uninit();
  m.as_ptr().read_volatile() // can be assumed to read undef
} }

into a function that performs the load (it has to due to volatile), and then ignores what the load produced and returns undef. I don't think that would violate any spec of volatile that I have seen.

So, if we want to specify that in Rust, a volatile read returns frozen data, I'd say we should first get our backend(s) to commit to the same.

It is connected to my use case. The reason why I'm now looking at using volatile instead of relaxed is that relaxed provides invisibility and I want to do SIMD loads and store without locks.

What's "invisibility" here? Do you mean "indivisibility"? I am confused now though, do you want or do you not want indivisibility? If you want it, relaxed seems fine, if you don't want it then the outcome of the discussion about it doesn't matter...

However, I was hoping that volatile could provide at least the things I need with a mere documentation change

"mere" documentation change, that's funny. ;) You are asking for several fundamental changes in Rust's spec, and also for a guarantee that our backend does not currently commit to.

First of all, you are asking to switch Rust from the C++ concurrency memory model (read-write races are UB) to the LLVM memory model (read-write races make the read return undef), at least for volatile reads. Us following C++ is the reason why the documentation currently states that a volatile read racing with any other write is UB. I don't have a strong opinion either way, but the C++ memory model is much better studied. However it seems like C++ might get a similar rule for volatile accesses only; once that hits the standard we just have to update the version of C++ from which we copy our memory model, which seems unlikely to be very controversial.

Then you are asking to introduce a freeze operation into the language, and there are a bunch of good reasons not to do that. I am leaning in support of this, as are others, but this on its own is RFC-worthy. That's definitely not a "mere documentation change".

And then even if we decided that we are okay with having a freeze operation, I will argue against changing our documentation to state things that LLVM does not guarantee towards us. That seems like a mistake to me. We should work with them to be sure they can and will actually provide the guarantees we want. I'd definitely be on board in attempting that though!

RalfJung commented 5 years ago

(And, presumably, with tearable atomics there'd still be the issue that they'd be defined with whole-program semantics and one would have to rely on the optimizer just not seeing the code for an external thread.)

The discussion we are having about read_volatile not being UB with races and returning frozen data can be had without bringing up any of those whole-program / outside-of-program considerations. Under the proposed semantics for volatile reads, it doesn't matter if the writing thread is in the same address space, visible to the compiler, or whatever; a racing volatile read would be fine.

The part about C++ only having whole-program semantics came up another branch of this discussion I was recently having where people were worried about us (trusted) doing a write and the untrusted side using "too weak" synchronization and that being a problem. So far at least I have just been talking about volatile reads in this thread. If we do a write, that could race with a read or a write in the other thread and cause UB. Write-write races are UB even in LLVM, the only resort we have here is that "one half" of this race occurred "somewhere the compiler cannot see" and hence we should be fine. However, this issue is not at all unique to concurrency; UB is a whole-program condition, so this is basically the same question as "assuming the untrusted code does division by 0, can I be sure that UB does not leak into my code".

Really at this point, C/Rust is the wrong language to consider for the interaction between the trusted and the untrusted code. That interaction happens on the assembly level. So now we are getting into specifying multi-language semantics / linking semantics, which is an incredibly hard open research problem. The amount of work considering that in the context of concurrency in general and data races in particular is to my knowledge basically 0.

It would be instructive to figure out how it looks like when LLVM exploits that write-write races are UB, and whether that might even still be a concern with volatile writes -- and whether there is some way to specify the result of a racy volatile write in LLVM such that it does not cause whole-program UB.