## Pre-Pre-RFC: `core::arch::{load, store}` and stricter volatile semantics

rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do

https://rust-lang.github.io/unsafe-code-guidelines

Apache License 2.0

669 stars 58 forks source link

## Pre-Pre-RFC: `core::arch::{load, store}` and stricter volatile semantics #321

Open DemiMarie opened 2 years ago

DemiMarie commented 2 years ago

`core::arch::{load, store}`

This proposes new load and store functions (in core::arch), for raw hardware loads and stores, and a concept of an always-valid type that can safely be cast to a byte array. It also defines volatile accesses in terms of these functions.

The functions proposed here have the same semantics as raw machine load and store instructions. The compiler is not permitted to assume that the values loaded or stored are initialized, or even that they point to valid memory. However, it is permitted to assume that load and store do not violate Rust’s mutability rules.

In particular, it is valid to use these functions to manipulate memory that is being concurrently accessed or modified by any means whatsoever. Therefore, they can be used to access memory that is shared with untrusted code. For example, a kernel could use them to access userspace memory, and a user-mode server could use them to access memory shared with a less-privileged user-mode process. It is also safe to use these functions to manipulate memory that is being concurrently accessed via DMA, or that corresponds to a memory-mapped hardware register.

The core guarantee that makes load and store useful is this: A call to load or store is guaranteed to result in exactly one non-tearing non-interlocked load from or store to the exact address passed to the function, no matter what that address happens to be. To ensure this, load and store are considered partially opaque to the optimizer. The optimizer must consider them to be calls to functions that may or may not dereference their arguments. It is even possible that the operation triggers a hardware fault that some other code catches and recovers from. Hence, the compiler can never prove that a given call to core::arch::load and core::arch::store will have undefined behavior. In other ways, a call to load or store does not disable any optimizations that a call to an unknown function with the same argument would not also disable. In short: garbage in, garbage out.

The actual functions are as follows:

unsafe fn load<T>(ptr: *const T) -> T;
unsafe fn store<T>(ptr: *mut T, arg: T);

Performs a single memory access (of size size_of::<T>()) on ptr. The compiler must compile each these function calls into exactly one machine instruction. If this is not possible, it is a compile-time error. The types T for which a compiler can successfully generate code for these calls is dependent on the target architecture. Using a T that cannot safely be transmuted to or from a byte array is not forbidden, but is often erroneous, and thus triggers a lint (see below). Provided that ptr is properly aligned, these functions are guaranteed to not cause tearing. If ptr is not properly aligned, the results are architecture-dependent.

The optimizer is not permitted to assume that ptr is dereferenceable or that it is properly aligned. This allows these functions to be used for in-process debuggers, crash dumpers, and other applications that may need to access memory at addresses obtained from some external source, such as a debug console or /proc/self/maps. If load is used to violate the aliasing rules (by accessing memory the compiler thinks cannot be accessed), the value returned may be non-deterministic and may contain sensitive data. If store is used to overwrite memory the compiler can assume will not be modified, subsequent execution (after the call to store returns) has undefined behavior.

The semantics of `volatile`

A call to ptr::read_volatile desugars to one or more calls to load, and a call to ptr::write_volatile desugars to one or more calls to store. The compiler is required to minimize tearing to the extent possible, provided that doing so does not require the use of interlocked or otherwise synchronized instructions. const fn core::arch::volatile_non_tearing::<T>() -> bool returns true if T is such that tearing cannot occur for naturally-aligned accesses. It may still occur for non-aligned accesses (see below).

Unaligned volatile access

The compiler is not allowed to assume that the arguments of core::{ptr::{read_volatile, write_volatile}, arch::{load, store}} are aligned. However, it is also not required to generate code to handle unaligned access, if doing so would cause a performance penalty for the aligned case. In particular, whether the no-tearing guarantee applies to unaligned access is architecture dependent. On some architectures, it is even possible for unaligned access to cause a hardware trap.

New lints

Use of core::ptr::{read_volatile, write_volatile} with a type that cannot be safely transmuted to and from a byte slice will trigger a dubious_type_in_volatile lint. Use of core::arch::{load, store} with such types will trigger a dubious_type_in_load_or_store lint. Both are Warn by default. Thanks to @comex for the suggestion!

Lowering

LLVM volatile semantics are still unclear, and may turn out to be weaker than necessary. It is also possible that LLVM volatile requires dereferenceable or otherwise interacts poorly with some of the permitted corner-cases. Therefore, I recommend lowering core::{arch::{load, store}, ptr::{read_volatile, write_volatile}} to LLVM inline assembly instead, which is at least guaranteed to work. This may change in the future.

comex commented 2 years ago

I like this, except:

If such a transmute would be unsafe (in the sense of Project Safe Transmute), it is a compile-time error (regardless of the target platform)

This seems like it should be a trait bound.

write_volatile<T> is a compile-time error if T has any padding bits or an unspecified layout, as it as no useful semantics in that case. Similarly, read_volatile<T> is a compile-time error if T has any invalid representations.

This part is a breaking change and doesn't seem well motivated to me.

For writes: Writing padding bits is potentially a security concern due to the potential to leak memory contents, but it doesn't seem inherently unsound; any undefined bits should just be implicitly frozen to an arbitrary value. As for unspecified layout, if by that you mean things like repr(Rust), this layout can still be probed at runtime, or perhaps you don't care about the layout because you only need to read the value back later as the same type in the same program.

For reads: Just because volatile is well-suited for dealing with untrusted or potentially-corrupted memory doesn't mean that's the only possible use case. You may happen to know for whatever reason that the load will return a valid value. Perhaps you're reading it from an MMIO register; perhaps you're abusing volatile to implement atomics (bad idea but, in the right circumstances, not unsound); perhaps the load doesn't have to be volatile but is anyway due to some generics madness.

All of these cases seem dubious enough to be worth a lint, but I'm skeptical that they should be hard errors even with the new functions, let alone the existing already-stable functions.

DemiMarie commented 2 years ago

All of these cases seem dubious enough to be worth a lint, but I'm skeptical that they should be hard errors even with the new functions, let alone the existing already-stable functions.

Agreed, lint added.

Perhaps you're reading it from an MMIO register

I generally assume that MMIO devices are not automatically trustworthy, but your point stands.

bjorn3 commented 2 years ago

Therefore, the Rust compiler is never allowed to make assumptions about the memory accessed by these functions, or the results of such accesses.

That can't be done without forcing deoptimization of any program that may call this. To prevent deoptimization it would be better to say that it can access any memory which an opaque function that gets passed the pointer as argument may access. That would for example not include stack variables which don't have their address taken.

DemiMarie commented 2 years ago

Therefore, the Rust compiler is never allowed to make assumptions about the memory accessed by these functions, or the results of such accesses.

That can't be done without forcing deoptimization of any program that may call this. To prevent deoptimization it would be better to say that it can access any memory which an opaque function that gets passed the pointer as argument may access. That would for example not include stack variables which don't have their address taken.

Is this also the semantics of using inline assembly? The goal is that volatile operations will always operate on whatever happens to be at that memory address; the compiler can’t just say “I know this volatile load or store will have undefined behavior if X” and optimize accordingly. The situation you are referring to is supposed to be covered by, “At the same time, a call to load or store does not disable any optimizations that a call to an unknown function with the same argument would not also disable. In short: garbage in, garbage out.”

The reason for these seemingly contradictory requirements is that I want volatile memory access to be usable for in-process debuggers and crash dumpers. These programs need to be able to access whatever happens to be at an arbitrary caller-provided memory location and know the compiler will not try to outsmart them. This is also why using these functions to dereference null or dangling pointers is explicitly permitted. Just because you can use these functions to read from a piece of memory does not mean that Rust makes any guarantees whatsoever about what you will find there, or that your attempt won’t cause a hardware fault of some sort. Similarly, just because you can use these functions to write to a piece of memory does not mean that Rust makes any guarantees as to what impact that will have on other code. If you misuse them and something breaks, you get to keep both pieces.

DemiMarie commented 2 years ago

@bjorn3: do you have suggestions for better phrasing here? The intent is that you can use volatile memory access to peek and poke at whatever memory you want, but the consequences of doing so are entirely your responsibility. For instance, I might need to test that my program’s crash handler triggers successfully when I dereference a null pointer, or that a hardened memory allocator detects modification of freed memory and properly aborts.

bjorn3 commented 2 years ago

Is this also the semantics of using inline assembly?

I believe so.

the compiler can’t just say “I know this volatile load or store will have undefined behavior if X” and optimize accordingly.

It has to for any optimization to be possible.

The situation you are referring to is supposed to be covered by, “At the same time, a call to load or store does not disable any optimizations that a call to an unknown function with the same argument would not also disable. In short: garbage in, garbage out.”

Didn't see that sentence. I agree that covers my situation.

The reason for these seemingly contradictory requirements is that I want volatile memory access to be usable for in-process debuggers and crash dumpers.

Those things are UB either way. Individual compilers just do a best effort at trying to make them work the way a user expects them to work when optimizations are disabled. When optimizations are enabled it is even more in a best effort basis. For example it may not be possible to change function parameters if the compiler found that a function argument is constant and optimized accordingly.

DemiMarie commented 2 years ago

Is this also the semantics of using inline assembly?

I believe so.

Good to know!

the compiler can’t just say “I know this volatile load or store will have undefined behavior if X” and optimize accordingly.

It has to for any optimization to be possible.

To elaborate, what I am not okay with is the compiler optimizing out the entire basic block as unreachable code. Compilers have a nasty habit of doing that, so I wanted to be absolutely clear that is not permitted.

The situation you are referring to is supposed to be covered by, “At the same time, a call to load or store does not disable any optimizations that a call to an unknown function with the same argument would not also disable. In short: garbage in, garbage out.”

Didn't see that sentence. I agree that covers my situation.

Thank you.

The reason for these seemingly contradictory requirements is that I want volatile memory access to be usable for in-process debuggers and crash dumpers.

Those things are UB either way. Individual compilers just do a best effort at trying to make them work the way a user expects them to work when optimizations are disabled. When optimizations are enabled it is even more in a best effort basis. For example it may not be possible to change function parameters if the compiler found that a function argument is constant and optimized accordingly.

That behavior is perfectly acceptable (though ideally it would be reflected in the debug info). I wonder if our definitions of UB are slightly different. To mean, an execution with UB has no meaning at all, and the compiler is allowed to prune any basic block that invokes UB. core::arch::{load, store} never invoke this form of UB themselves, but are unsafe because they can cause a crash or even invoke UB in unrelated code.

Diggsey commented 2 years ago

To elaborate, what I am not okay with is the compiler optimizing out the entire basic block as unreachable code. Compilers have a nasty habit of doing that, so I wanted to be absolutely clear that is not permitted.

If the code is unreachable, then of course the compiler is permitted to optimize it away. The compiler is allowed to assume that no UB occurs within the program when determining what is reachable (and for all other optimizations). This is true for all code: there's no special case for atomics.

DemiMarie commented 2 years ago

To elaborate, what I am not okay with is the compiler optimizing out the entire basic block as unreachable code. Compilers have a nasty habit of doing that, so I wanted to be absolutely clear that is not permitted.

If the code is unreachable, then of course the compiler is permitted to optimize it away. The compiler is allowed to assume that no UB occurs within the program when determining what is reachable (and for all other optimizations). This is true for all code: there's no special case for atomics.

According to the model mentioned above (a volatile memory access is an opaque function call or inline assembler), the compiler does not actually know that std::ptr::read_volatile(x) actually dereferences x, merely that it may dereference x. So it cannot prove that std::ptr::read_volatile(x) is UB no matter what x is.

Diggsey commented 2 years ago

So it cannot prove that std::ptr::read_volatile(x) is UB no matter what x is.

Right, but whether or not x is dereferenceable is only one potential source of UB. The compiler could still decide that the call is unreachable for other reasons (say x is the result of an expression that invokes UB).

I'd also question whether it makes sense to be quite so lenient with x - it definitely makes sense that x might not be dereferenceable, but what if it's undef or poison? I don't really have the expertise to answer that though...

DemiMarie commented 2 years ago

So it cannot prove that std::ptr::read_volatile(x) is UB no matter what x is.

Right, but whether or not x is dereferenceable is only one potential source of UB. The compiler could still decide that the call is unreachable for other reasons (say x is the result of an expression that invokes UB).

That’s fine.

I'd also question whether it makes sense to be quite so lenient with x - it definitely makes sense that x might not be dereferenceable, but what if it's undef or poison? I don't really have the expertise to answer that though...

I decided to err on the side of the simplest possible semantics.

RalfJung commented 2 years ago

Thanks for writing this up!

I was confused what the difference to volatile would be. Is it fair to say that this is morally what volatile "ought to be", and the only reason you are avoiding the term volatile is that volatile has bad interactions with atomic accesses in C/C++/LLVM?

an always-valid type that can safely be cast to a byte array

This sounds contradictory: a byte array ([u8; N]) cannot hold all data (e.g., uninit memory), so an always-valid type cannot be safely cast to a byte array.

The functions proposed here have the same semantics as raw machine load and store instructions. The compiler is not permitted to assume that the values loaded or stored are initialized, or even that they point to valid memory. However, it is permitted to assume that loads via store do not violate Rust’s mutability rules.

"loads via store" seems odd; I assume "loads and stores" is what you meant?

I think this is too weak. In particular, I think the compiler should be able to assume that the given address is the only memory (that the compiler knows about) that this operation can access, and that it will not have other side-effects (like synchronization) either. So for example if arch::load is called on some *const i32 that does not alias some other pointer p, the compiler is allowed to reorder regular stores to p with those non-aliasing loads.

This also makes these operations quite different from inline assembly, which is allowed to do much more. With inline assembly, the intention is that even swapping out the assembly block by some other code at runtime (that still adheres to the same clobber annotations) is allowed. I don't think we want that for arch::{load, store}.

non-tearing non-interlocked load from or store to

What is an "interlocked" load/store?

The actual functions are as follows:

I am very strongly opposed to using usize to indicate memory accesses. In particular, the aliasing rules still matter (as you say), and hence provenance still matters. That means these functions should work on pointers, not integers.

T must be a type such that T can safely be transmuted to (resp. from) [u8; size_of::]. Otherwise, the behavior of this function is the same as that of inline assembly that contains a single non-atomic load (resp. store) instruction of the correct size.

"Otherwise" here sounds like "in case T cannot be safely transmuted as described", but I doubt that is what you mean.

"non-atomic" is a bad choice here, since assembly languages don't even distinguish atomic and non-atomic accesses -- that is a surface language thing. And the entire point of your proposal is that these accesses are atomic, in the sense of not introducing data races.

But this reminds me, the interactions with the concurrency model need to be specified better I think. If there are overlapping concurrent non-atomic writes to the same location, that is UB. If there are non-atomic reads concurrent to an overlapping arch::store, that is UB. And if there are concurrent overlapping atomic accesses, then what are the possible observed values on read accesses? If my hardware has strong consistency guarantees, like x86, can I use this to do the equivalent of an "acquire read"? That would be in contradiction with some of the reorderings I mentioned above.

To avoid problems with LLVM’s unclear volatile semantics, the LLVM backend should in fact lower this function to LLVM inline assembler.

What is wrong with using LLVM volatile atomic accesses?

Lokathor commented 2 years ago

Ralf I'm 99% certain that "always valid" is meant to be read as "all initialized bit patterns are valid", not that it's also allowing uninit memory.

RalfJung commented 2 years ago

Ralf I'm 99% certain that "always valid" is meant to be read as "all initialized bit patterns are valid", not that it's also allowing uninit memory.

That still leaves provenance as a concern -- transmuting pointers with provenance to integers is subtle at best, making (arrays of) integer types like u8 IMO unsuited as "generic data containers".

Lokathor commented 2 years ago

I'm pretty sure that everything you just said also applies if you "raw" read/write a pointer with memory, right? Because provenance isn't a physical concept that's actually stored in a device. So if you raw read a pointer from memory you already don't know the provenance, regardless of the array of u8 part or not. It would have to be treated as an int read with an automatic int2ptr cast, or something like that.

RalfJung commented 2 years ago

provenance isn't a physical concept that's actually stored in a device

Sure (except on CHERI ;). But that doesn't mean that the Abstract Machine model of an arch::load will always return provenance-free data. That would be a very strong assumption that I would not be willing to make. It would mean you couldn't use arch::load to load a pointer that has provenance attached to it, making a bunch of code UB that probably shouldn't be UB.

let mut loc = &mut 0i32;
let ptr = arch::load(&loc as *const _ as *const *mut i32);
let _val = *ptr; // UB, since this ptr does not have the right provenance to access `loc`

Lokathor commented 2 years ago

Hm. What are the provenance rules if asm! is involved? It seems that the intent of these load/store functions is to have things works "basically like the inline assembly would", and just limit off all the other things asm also does to have a simpler API for this particular use case.

RalfJung commented 2 years ago

Making asm! formally precise is a lot harder than making volatile accesses precise. ;) But I don't think we would want the "effect on the Abstract Machine state" that asm! can have to be limited to things not involving provenance. The equivalent of the above code written using asm! instead of arch::load should likely also be defined behavior, after all.

DemiMarie commented 2 years ago

@andyhhp what are your thoughts? What semantics do you want for volatile loads and stores?

DemiMarie commented 2 years ago

Thanks for writing this up!

You’re welcome!

I was confused what the difference to volatile would be. Is it fair to say that this is morally what volatile "ought to be", and the only reason you are avoiding the term volatile is that volatile has bad interactions with atomic accesses in C/C++/LLVM?

Yes and no. Volatile in C/C++/LLVM does have unclear semantics with respect to concurrency, but it also has unclear semantics in other ways as well, which means that those writing low-level code might find themselves resorting to asm!. I would rather avoid that.

an always-valid type that can safely be cast to a byte array

This sounds contradictory: a byte array ([u8; N]) cannot hold all data (e.g., uninit memory), so an always-valid type cannot be safely cast to a byte array.

In my mind, core::arch::load doesn’t know (or care) if the memory it accesses is initialized or not. It just grabs whatever happens to be at the given hardware address. So using it to access uninitialized memory isn’t UB, but rather produces an initialized but unpredictable value.

The functions proposed here have the same semantics as raw machine load and store instructions. The compiler is not permitted to assume that the values loaded or stored are initialized, or even that they point to valid memory. However, it is permitted to assume that loads via store do not violate Rust’s mutability rules.

"loads via store" seems odd; I assume "loads and stores" is what you meant?

Yes

I think this is too weak. In particular, I think the compiler should be able to assume that the given address is the only memory (that the compiler knows about) that this operation can access, and that it will not have other side-effects (like synchronization) either. So for example if arch::load is called on some *const i32 that does not alias some other pointer p, the compiler is allowed to reorder regular stores to p with those non-aliasing loads.

I’m not sure about that. @andyhhp @marmarek @HW42 thoughts?

This also makes these operations quite different from inline assembly, which is allowed to do much more. With inline assembly, the intention is that even swapping out the assembly block by some other code at runtime (that still adheres to the same clobber annotations) is allowed. I don't think we want that for arch::{load, store}.

That is a fair point. The reason I mentioned asm! is that asm! is the only thing I know of in current Rust that does what is needed here. In other words, I know asm! is sufficient, and I am not aware of anything else that is.

non-tearing non-interlocked load from or store to

What is an "interlocked" load/store?

core::arch::{load, store} should lower to a single mov instruction or similar, not cmpxchg or the like, and shouldn’t insert any memory barrier instructions.

The actual functions are as follows:

I am very strongly opposed to using usize to indicate memory accesses. In particular, the aliasing rules still matter (as you say), and hence provenance still matters. That means these functions should work on pointers, not integers.

That’s valid, though most use-cases I can think of will at very least want a newtype. What matters here is that I need to be able to write:

fn peek_u32(s: string_from_debug_prompt) -> Result<u32, std::num::TryFromIntError> {
    let s: usize = s.parse()?;
    Ok(core::arch::load(s as *const u32))
}

and not have the compiler second-guess me. I know that converting some user-provided string to a usize, casting it to a pointer, and then dereferencing the pointer is almost always wrong (and doesn’t work on CHERI), but kernel debuggers actually need to do it, and the compiler shouldn’t be breaking them.

T must be a type such that T can safely be transmuted to (resp. from) [u8; size_of::]. Otherwise, the behavior of this function is the same as that of inline assembly that contains a single non-atomic load (resp. store) instruction of the correct size.

"Otherwise" here sounds like "in case T cannot be safely transmuted as described", but I doubt that is what you mean.

Yeah, I wrote this when I was quite tired, so there are definitely some thinkos.

"non-atomic" is a bad choice here, since assembly languages don't even distinguish atomic and non-atomic accesses -- that is a surface language thing. And the entire point of your proposal is that these accesses are atomic, in the sense of not introducing data races.

Fair point. What I meant is that the compiler does not need to insert any memory barriers.

But this reminds me, the interactions with the concurrency model need to be specified better I think. If there are overlapping concurrent non-atomic writes to the same location, that is UB. If there are non-atomic reads concurrent to an overlapping arch::store, that is UB. And if there are concurrent overlapping atomic accesses, then what are the possible observed values on read accesses? If my hardware has strong consistency guarantees, like x86, can I use this to do the equivalent of an "acquire read"?

You can certainly use it as a “release read”; I am not sure about the others. My intuition is that a volatile load should imply a compile-time memory barrier, but I will leave that to the people who actually use them in anger.

That would be in contradiction with some of the reorderings I mentioned above.

To avoid problems with LLVM’s unclear volatile semantics, the LLVM backend should in fact lower this function to LLVM inline assembler.

What is wrong with using LLVM volatile atomic accesses?

Nothing, other than that I don’t known what the semantics of LLVM volatile atomic accesses are :slightly_smiling_face:.

More generally, the main uses I can think of for volatile operations are in machine-dependent code, which is why my proposal defers to hardware semantics in so many places. If I am manipulating page tables in a kernel or hypervisor, or accessing MMIO registers in a driver, I need to know that core::arch::{load, store} will lower to exactly one machine instruction, and that that instruction is guaranteed to not tear. I also need to know that the it will access whatever machine address I specify, even if that address is something that (from the compiler’s perspective) I have no business looking at. I might be running on a platform where address 0 really does have memory that I need to access, or I might be deliberately dereferencing NULL to make sure that my page fault handler works properly. There are almost certainly lots of other uses I haven’t considered here, but which should still work.

Making asm! formally precise is a lot harder than making volatile accesses precise. ;)

Welcome to the messy world of interacting with hardware :smile:.

But I don't think we would want the "effect on the Abstract Machine state" that asm! can have to be limited to things not involving provenance. The equivalent of the above code written using asm! instead of arch::load should likely also be defined behavior, after all.

Agreed.

To avoid problems with LLVM’s unclear volatile semantics, the LLVM backend should in fact lower this function to LLVM inline assembler.

What is wrong with using LLVM volatile atomic accesses?

The problem with LLVM volatile atomic accesses is that their semantics are not known to be what I would want for the kind of low-level applications I am envisioning. In particular, my understanding is that their semantics are architecture-independent, whereas I believe lots of use-cases explicitly want semantics that are architecture-dependent. Some synchronization primitives, for instance, explicitly rely on the fact that modern hardware (DEC Alpha is long dead) does not reorder dependent loads even if there is no hardware memory barrier.

RalfJung commented 2 years ago

Yes and no. Volatile in C/C++/LLVM does have unclear semantics with respect to concurrency, but it also has unclear semantics in other ways as well, which means that those writing low-level code might find themselves resorting to asm!. I would rather avoid that.

Which other ways are you referring to here? That seems worth discussing in depth in the RFC.

In my mind, core::arch::load doesn’t know (or care) if the memory it accesses is initialized or not. It just grabs whatever happens to be at the given hardware address. So using it to access uninitialized memory isn’t UB, but rather produces an initialized but unpredictable value.

We have to describe its effect in the Abstract Machine though. So it seems you are proposing that arch::load implicitly does something like freeze. This seems worth discussing explicitly. (And it doesn't entirely solve the problem as other "non-standard" Bytes remain, due to provenance -- see my last comments above.) Rust doesn't currently have freeze, so exposing this seems like a fundamental extension to the expressivity of the language. OTOH if you interact with untrusted parties and they can make that memory uninit, you have a problem -- for these cases you have to be able to rely on "no uninit".

That’s valid, though most use-cases I can think of will at very least want a newtype. What matters here is that I need to be able to write:

I think "where you get the pointer from" and having volatile (for lack of a better term) semantics on the access are entirely orthogonal concerns. So I don't think this should affect the type signature of these operations.

Integers have no provenance, so whether you parsed that integer from a string or got it via a ptr-to-int cast makes no difference, even with regular Rust memory accesses.

You can certainly use it as a “release read”

"release read"? In C11, release is for writes and acquire for reads, so a "release read" doesn't make a ton of sense to me.

Also after what you said earlier about "non-interlocked" semantics, I would strongly expect that these operations do not have any synchronizing side-effect (e.g., the compiler can treat them as nosync).

(Interestingly, I just noticed LLVM LangRef says volatile accesses are not nosync. I am very confused by this.)

The problem with LLVM volatile atomic accesses is that their semantics are not known to be what I would want for the kind of low-level applications I am envisioning. In particular, my understanding is that their semantics are architecture-independent, whereas I believe lots of use-cases explicitly want semantics that are architecture-dependent. Some synchronization primitives, for instance, explicitly rely on the fact that modern hardware (DEC Alpha is long dead) does not reorder dependent loads even if there is no hardware memory barrier.

My thinking was that we want to give an architecture-independent description that is enough to tell us what the compiler can and cannot do. The exact effect of course is architecture-dependent, but unlike asm! blocks I think here we can give a reasonable "envelope" of what the effect of an arch::{load,store} on the Abstract Machine state can be. (For example: can it ever return uninit bytes? can it return bytes with provenance? Which parts of the AM memory can be mutated?)

I fully agree re: accessing NULL and shenanigans like that. I am pretty sure that volatile atomic accesses are guaranteed not to tear (that is AFAIK generally true for atomic accesses). I think people want LLVM volatile (atomic) accesses to be what you describe here, so IMO it would make sense to use them and then work with LLVM in case things don't work out.

Either way, the Rust semantics should be described independently of LLVM, so we can change the implementation strategy as needed. But given that these things are so hard to describe, drawing comparisons with things that are at least somewhat established makes sense IMO.

RalfJung commented 2 years ago

Your example about synchronization patterns worries me quite a bit; I am not sure we want to encourage using this kind of operation as a replacement for our regular atomics. Relying on "dependent loads" establishing some sort of synchronization seems like a problem to me, and not least because the entire notion of a "dependency" is on very shaky grounds in a surface language; there's a reason consume isn't working out.

For example, it is generally a permissible transformation for a compiler to turn

let i = make_i();
work_on_i(i);

into

let i = make_i();
if i == 0 {
  work_on_i(0);
} else {
  work_on_i(i);
}

However, if i is initialized with a arch::load, and work_on_i does another arch::load on something like other_ptr.add(i), then this transformation will turn a data-dependent load into a merely control-dependent load (and some modern hardware does very shady things with control dependencies). This optimization is a universal principle, i.e., it works even when the code in make_i and work_on_i is not known to the compiler -- thus, no amount of wording in arch::load can make this optimization invalid.

In that sense, I don't think your idea of an architecture-dependent semantics works (or maybe I understood what you meant by it). The only way to exploit this kind of "dependent load" pattern is to have both loads in the same asm! block, so that the architecture notion of a "dependency" can be used. But there cannot be any "architecture dependency edges" between asm! blocks (or between arch::{load,store}); any kind of synchronization between those blocks must be fully captured by the Abstract Machine semantics.

DemiMarie commented 2 years ago

I fully agree re: accessing NULL and shenanigans like that.

Thanks. Good to know that we agree on that.

I am pretty sure that volatile atomic accesses are guaranteed not to tear (that is AFAIK generally true for atomic accesses).

It’s actually more subtle: volatile accesses should be guaranteed not to tear if the argument is properly aligned, but if the argument is not aligned, the behavior is hardware-dependent. On x86, accesses that cross a cache line may tear, while on some platforms unaligned access actually faults. Volatile accesses are a raw interface and should not try to hide this behavior.

I think people want LLVM volatile (atomic) accesses to be what you describe here, so IMO it would make sense to use them and then work with LLVM in case things don't work out.

That is a good idea, but I would like to make sure that the bugs are worked out before this gets pushed.

Relying on "dependent loads" establishing some sort of synchronization seems like a problem to me, and not least because the entire notion of a "dependency" is on very shaky grounds in a surface language; there's a reason consume isn't working out.

The problem is that Linux RCU has a hard dependency on consume to get unlimited read-side scaling. Right now they use relaxed and rely on GCC and Clang not breaking it. Rust ought to be able to express rcu_dereference without a huge performance penalty.

You can certainly use it as a “release read”

"release read"? In C11, release is for writes and acquire for reads, so a "release read" doesn't make a ton of sense to me.

Release loads are needed for seqlocks, and the alternatives available in Rust have such poor performance that in practice people wind up using an approach that is technically UB. Rust doesn’t have the atomic_thread_fence primitive needed to implement a high performing approach safely, and even that isn’t really what is wanted.

digama0 commented 2 years ago

You can certainly use it as a “release read”

"release read"? In C11, release is for writes and acquire for reads, so a "release read" doesn't make a ton of sense to me.

Release loads are needed for seqlocks, and the alternatives available in Rust have such poor performance that in practice people wind up using an approach that is technically UB. Rust doesn’t have the atomic_thread_fence primitive needed to implement a high performing approach safely, and even that isn’t really what is wanted.

It's a bit off topic but I got curious about what a "release load" is and whether we need one, since like Ralf this does not make any sense to me as a concept. Here's a simplified version of the code from the blog post focusing on order of operations:

fn read() {
  let seq1 = seq.load(Acquire); // R1
  let result = data.read_volatile(); // R2
  let seq2 = seq.load(Release); // R3
  // do stuff
}

static seq: AtomicUsize = _; // W0
static data: UnsafeCell<T> = _; // W0
fn write() {
  seq.compare_and_swap(_, _, Acquire); // W1
  data.write_volatile(_); // W2
  seq.store(_, Release); // W3
}

It is claimed that we want operation R3 to be a release load, because we want to ensure that R2 is not moved after R3. This makes some sense from the perspective of "roach motel" reorderings, but from the "message passing" interpretation a release load doesn't make sense since load operations can't broadcast information to synchronize on.

To break it down, the litmus test situation showing why it would be bad to reorder R2 and R3 is where R2 reads the result of W2, and R3 reads the result prior to W1, leading to a disallowed cycle since R3 <rf- W0 -mo> W2 -rf> R2 -sb> R3 is a violation of the coherence axiom of C11 atomics (where rf is reads-from (a write relates to a read that reads from it), mo is modification-order (subsequent writes to an atomic location), and sb is sequenced-before (program order within a thread)). However this requires the mo ordering on data, which requires that the accesses to data are atomic and should be release/acquire as well.

What is needed to make this work is to add a read-read fence between R2 and R3. This effectively upgrades the prior read_volatile operation to an acquire load. The nearest equivalent in rust is atomic::fence(Acquire), but this only works on atomic operations, so perhaps what we need is a stronger kind of fence that works on non-atomic operations as well. I don't think that a "release load" is the right way to solve the problem because the loads aren't publishing information to other threads (and in this example we explicitly don't want to incur the cost of getting exclusive ownership of the cache line in the reader thread).

DemiMarie commented 2 years ago

a stronger kind of fence that works on non-atomic operations

This should (hopefully!) be enough for seqlocks, but I am not sure if it is enough for RCU.

DemiMarie commented 2 years ago

I just made an edit that should hopefully clean up some stuff. I should probably make an actual RFC out of this.

comex commented 2 years ago

Relying on "dependent loads" establishing some sort of synchronization seems like a problem to me, and not least because the entire notion of a "dependency" is on very shaky grounds in a surface language; there's a reason consume isn't working out.

The problem is that Linux RCU has a hard dependency on consume to get unlimited read-side scaling. Right now they use relaxed and rely on GCC and Clang not breaking it. Rust ought to be able to express rcu_dereference without a huge performance penalty.

You may be fully aware of the following, but I'll say it anyway, since I've seen suggestions floating around before that RCU and seqlocks are uniquely problematic in Rust.

Today, rustc compiles atomics, volatile accesses (the existing ones), and inline assembly to the same LLVM IR as Clang does their C counterparts, at which point LLVM's optimizer applies the same optimizations to it. rustc does additionally have some MIR optimizations which are specific to Rust, but they're pretty minimal and unlikely to cause a problem here. So when it comes to relying on data dependencies being preserved, the Linux kernel's "wing it" approach should work equally well in Rust as in C.

Which is to say: it does work most of the time in practice, and most examples where it breaks are somewhat pathological/unrealistic code. But on the other hand, it's hard to rule out that there might be some real code where it breaks, and LLVM and GCC's optimizers will only get smarter in the future.

We shouldn't encourage typical Rust users to rely on it, but we also should make sure the Linux project doesn't see it as a reason to avoid adopting Rust.

This RFC's choice to use inline assembly rather than LLVM volatile accesses doesn't help: as Ralf showed, the main problematic optimization for data dependencies can happen regardless of what construct is used for the store to memory. I support this RFC's choice to use inline assembly, because LLVM's volatile semantics are unclear and might hypothetically be weaker than desired. But in practice I would be shocked if inline assembly blocked any optimizations that LLVM volatile doesn't, except by accident (e.g. due to a bug in LLVM volatile, or due to LLVM being overly conservative with inline assembly in a way that isn't related to the guarantees needed here). Though note: LLVM volatile doesn't have an implicit compiler barrier, so if we decide to add one to this API, the proper comparison would be to LLVM volatile combined with an explicit compiler barrier.

DemiMarie commented 2 years ago

Relying on "dependent loads" establishing some sort of synchronization seems like a problem to me, and not least because the entire notion of a "dependency" is on very shaky grounds in a surface language; there's a reason consume isn't working out.

The problem is that Linux RCU has a hard dependency on consume to get unlimited read-side scaling. Right now they use relaxed and rely on GCC and Clang not breaking it. Rust ought to be able to express rcu_dereference without a huge performance penalty.

You may be fully aware of the following, but I'll say it anyway, since I've seen suggestions floating around before that RCU and seqlocks are uniquely problematic in Rust.

Today, rustc compiles atomics, volatile accesses (the existing ones), and inline assembly to the same LLVM IR as Clang does their C counterparts, at which point LLVM's optimizer applies the same optimizations to it. rustc does additionally have some MIR optimizations which are specific to Rust, but they're pretty minimal and unlikely to cause a problem here. So when it comes to relying on data dependencies being preserved, the Linux kernel's "wing it" approach should work equally well in Rust as in C.

Which is to say: it does work most of the time in practice, and most examples where it breaks are somewhat pathological/unrealistic code. But on the other hand, it's hard to rule out that there might be some real code where it breaks, and LLVM and GCC's optimizers will only get smarter in the future.

We shouldn't encourage typical Rust users to rely on it, but we also should make sure the Linux project doesn't see it as a reason to avoid adopting Rust.

I agree, and I also think we should work on a solution for both Rust and C that doesn’t have undefined behavior.

This RFC's choice to use inline assembly rather than LLVM volatile accesses doesn't help: as Ralf showed, the main problematic optimization for data dependencies can happen regardless of what construct is used for the store to memory. I support this RFC's choice to use inline assembly, because LLVM's volatile semantics are unclear and might hypothetically be weaker than desired.

:+1:

But in practice I would be shocked if inline assembly blocked any optimizations that LLVM volatile doesn't, except by accident (e.g. due to a bug in LLVM volatile, or due to LLVM being overly conservative with inline assembly in a way th at isn't related to the guarantees needed here). Though note: LLVM volatile doesn't have an implicit compiler barrier, so if we decide to add one to this API, the proper comparison would be to LLVM volatile combined with an explicit compiler barrier.

@andyhhp @marmarek: would a compiler barrier be desired here?

RalfJung commented 2 years ago

volatile accesses should be guaranteed not to tear if the argument is properly aligned

That is certainly not the case in general as volatile accesses can be arbitrarily large. I doubt my read_volatile on [u64; 64] will be compiled in a non-tearing way. ;)

The problem is that Linux RCU has a hard dependency on consume to get unlimited read-side scaling. Right now they use relaxed and rely on GCC and Clang not breaking it. Rust ought to be able to express rcu_dereference without a huge performance penalty.

Is doing this with inline assembly an option (in the way I described above, where the syntactic dependency is all within the same asm block)? If not, then we'd still be no worse off than C if we tell people to use Relaxed...

Release loads are needed for seqlocks

I literally can't make sense of it -- I think of release/acquire in terms of models like this, and "release load" just doesn't make sense as a concept. 'Release' means that the write event generated by this operation captures the 'view' of the current thread such that when that write event is later observed by an 'acquire' read event, the attached 'view' is integrated into the 'view' of the thread that did the 'acquire'. A load does not generate a write event, so saying it is 'release' is a category error.

The blog post thinks of atomic operations as being define in terms of reorderings, but that is not how they actually work -- defining an operation in terms of its allowable optimization is a sure path to madness.

The main issue with seqlocks (as far as I am aware) is the data read, which people want to be non-atomic but that is not allowed under C11 because it's a data race. Using volatile reads is a gross and non-compliant hack to work around that. The proper solution IMO is to take a hint from the LLVM memory model (which is different from the C++ memory model) and say that a racy read returns undef, but is not UB. Then you could read self.data into a MaybeUninit, do the second self.seq load to verify that there was no race, and call assume_init. If all seq.load are acquire (and the RMWs that increment seq.load are RelAcq), is there still anything that goes wrong?

(After re-reading this old classic): okay I think I see the problem, but that paper also discusses solutions -- and a "release load" is not one of them. ;) It also mentions a fence-based solution that should work in Rust as well?

But anyway -- since "release read" doesn't make sense in the memory model, I don't think arch::load can possibly have an effect like that. There is no cheap way to close the gaps in our memory model, there is no shortcut around actually supporting whatever RCU and seqlocks need in our atomics. That is exactly what arch::{load,store} should explicitly not be allowed to do, since it simply doesn't work -- the compiler will still do language-based reasoning 'between' arch operations.

So I think we should keep "gaps in the concurrency memory model" entirely out of this thread, since those issues fundamentally cannot be solved by just adding a few new opaque functions in arch. Opaque functions still fundamentally have to have some behavior that is expressible in the Abstract Machine; the issues with seqlock and RCU are about behaviors the Abstract Machine cannot express; ergo the only way to fix that is by actually making the Abstract Machine more expressible -- e.g. by adding read-dont-modify-write operations.

IOW, please let's not mix up the following two mostly unrelated concerns:

expressing interactions with the "machine-level outside world" (MMIO registers, memory that is shared with untrusted other processes)
expressing some interesting concurrency patterns that cannot currently be expressed, for data structures that operate entirely inside the Rust Abstract Machine (but where that machine currently lacks some primitives)

DemiMarie commented 2 years ago

volatile accesses should be guaranteed not to tear if the argument is properly aligned

That is certainly not the case in general as volatile accesses can be arbitrarily large. I doubt my read_volatile on [u64; 64] will be compiled in a non-tearing way. ;)

Whoops! That guarantee applies for core::arch::{load, store}, but not for core::ptr::{read_volatile, write_volatile} unless a use of core::arch::{load, store} with the same type would compile, or (equivalently) if core::arch::volatile_non_tearing::<T>() returns true.

The problem is that Linux RCU has a hard dependency on consume to get unlimited read-side scaling. Right now they use relaxed and rely on GCC and Clang not breaking it. Rust ought to be able to express rcu_dereference without a huge performance penalty.

Is doing this with inline assembly an option (in the way I described above, where the syntactic dependency is all within the same asm block)?

Not that I am aware of.

If not, then we'd still be no worse off than C if we tell people to use Relaxed...

I agree (see also @comex’s comment).

Release loads are needed for seqlocks

I literally can't make sense of it -- I think of release/acquire in terms of models like this, and "release load" just doesn't make sense as a concept. 'Release' means that the write event generated by this operation captures the 'view' of the current thread such that when that write event is later observed by an 'acquire' read event, the attached 'view' is integrated into the 'view' of the thread that did the 'acquire'. A load does not generate a write event, so saying it is 'release' is a category error.

I’m no expert on memory models, and I am not surprised that “release load” is incorrect terminology.

The blog post thinks of atomic operations as being define in terms of reorderings, but that is not how they actually work -- defining an operation in terms of its allowable optimization is a sure path to madness.

Again, no expert on memory models.

The main issue with seqlocks (as far as I am aware) is the data read, which people want to be non-atomic but that is not allowed under C11 because it's a data race. Using volatile reads is a gross and non-compliant hack to work around that. The proper solution IMO is to take a hint from the LLVM memory model (which is different from the C++ memory model) and say that a racy read returns undef, but is not UB. Then you could read self.data into a MaybeUninit, do the second self.seq load to verify that there was no race, and call assume_init. If all seq.load are acquire (and the RMWs that increment seq.load are RelAcq), is there still anything that goes wrong?

No idea, but you should file a PR with the crate that actually implements seqlocks.

(After re-reading this old classic): okay I think I see the problem, but that paper also discusses solutions -- and a "release load" is not one of them. ;) It also mentions a fence-based solution that should work in Rust as well?

Unlike C++, Rust doesn’t have a fence that works on non-atomic memory accesses. That’s a limitation in Rust and has nothing to do with volatile, as you mentioned.

But anyway -- since "release read" doesn't make sense in the memory model, I don't think arch::load can possibly have an effect like that. There is no cheap way to close the gaps in our memory model, there is no shortcut around actually supporting whatever RCU and seqlocks need in our atomics. That is exactly what arch::{load,store} should explicitly not be allowed to do, since it simply doesn't work -- the compiler will still do language-based reasoning 'between' arch operations.

I agree, with a caveat: there are cases where one must perform atomic operations on memory shared with untrusted code, and so needs what is effectively a volatile atomic access unless I am missing something.

So I think we should keep "gaps in the concurrency memory model" entirely out of this thread, since those issues fundamentally cannot be solved by just adding a few new opaque functions in arch. Opaque functions still fundamentally have to have some behavior that is expressible in the Abstract Machine; the issues with seqlock and RCU are about behaviors the Abstract Machine cannot express; ergo the only way to fix that is by actually making the Abstract Machine more expressible -- e.g. by adding read-dont-modify-write operations.

IOW, please let's not mix up the following two mostly unrelated concerns:

expressing interactions with the "machine-level outside world" (MMIO registers, memory that is shared with untrusted other processes)

This is what core::{arch::{load, store}, ptr::{read_volatile, write_volatile}}} are actually for. There is no way their semantics can be expressed in the Rust Abstract Machine, because they are intended to have side effects outside of it. Fundamentally, these functions are as much I/O operations as the functions in std::io are.

expressing some interesting concurrency patterns that cannot currently be expressed, for data structures that operate entirely inside the Rust Abstract Machine (but where that machine currently lacks some primitives)

Agreed. Would you be willing to file a separate issue for those, or even make an RFC? This is very much off-topic.

Lokathor commented 2 years ago

I believe that compiler_fence applies to even non-atomic access.

RalfJung commented 2 years ago

This is what core::{arch::{load, store}, ptr::{read_volatile, write_volatile}}} are actually for. There is no way their semantics can be expressed in the Rust Abstract Machine, because they are intended to have side effects outside of it. Fundamentally, these functions are as much I/O operations as the functions in std::io are.

That's what I figured, so I was surprised when you brought up release reads, seqlocks, and RCU. ;)

I can try opening an issue for seqlocks (though probably not today), but I am actually not very firm on the exact needs of RCU so I will leave that to someone else.

DemiMarie commented 2 years ago

This is what core::{arch::{load, store}, ptr::{read_volatile, write_volatile}}} are actually for. There is no way their semantics can be expressed in the Rust Abstract Machine, because they are intended to have side effects outside of it. Fundamentally, these functions are as much I/O operations as the functions in std::io are.

That's what I figured, so I was surprised when you brought up release reads, seqlocks, and RCU. ;)

It’s mostly a case of “this is something that might be a (bad) workaround for a current limitation in Rust”.

I can try opening an issue for seqlocks (though probably not today), but I am actually not very firm on the exact needs of RCU so I will leave that to someone else.

Any kernel developers here?

RalfJung commented 2 years ago

So coming back to my question that started this entire tangent:

But this reminds me, the interactions with the concurrency model need to be specified better I think. If there are overlapping concurrent non-atomic writes to the same location, that is UB. If there are non-atomic reads concurrent to an overlapping arch::store, that is UB. And if there are concurrent overlapping atomic accesses, then what are the possible observed values on read accesses? If my hardware has strong consistency guarantees, like x86, can I use this to do the equivalent of an "acquire read"? That would be in contradiction with some of the reorderings I mentioned above.

I think the answer should be that these operations do not synchronize in any way. In particular, it is legal to re-order non-aliasing regular non-atomic memory accesses around them both ways (like it would be with std::io operations).

DemiMarie commented 2 years ago

So coming back to my question that started this entire tangent:

But this reminds me, the interactions with the concurrency model need to be specified better I think. If there are overlapping concurrent non-atomic writes to the same location, that is UB.

Personally, I would prefer a “if the values stored are different” caveat at the very least. But that is a totally different topic.

If there are non-atomic reads concurrent to an overlapping arch::store, that is UB.

Specifically, it is UB on the part of whoever is doing the non-atomic reads. Otherwise one could not safely use arch::store on memory shared with untrusted code, which is explicitly a supported use-case. Same for non-atomic writes concurrent to an overlapping arch::load: whoever is doing the writes has UB, but the arch::load does not (and doesn’t return undef, either).

And if there are concurrent overlapping atomic accesses, then what are the possible observed values on read accesses? If my hardware has strong consistency guarantees, like x86, can I use this to do the equivalent of an "acquire read"? That would be in contradiction with some of the reorderings I mentioned above. I think the answer should be that these operations do not synchronize in any way. In particular, it is legal to re-order non-aliasing regular non-atomic memory accesses around them both ways (like it would be with std::io operations).

I agree, since one can use atomic::compiler_fence to prevent such reordering if needed.

RalfJung commented 2 years ago

I believe that compiler_fence applies to even non-atomic access.

I wish I knew of a formal model of things like compiler_fence, but formally speaking that operation is a NOP so it's more of an (non-)optimization hint than something that can be used in a principled way to avoid UB.

Even disregarding the concurrency discussion, this becomes interesting in some MMIO situations where one first does a bunch of regular writes to initialize a data structure, and then uses arch::store to "publish" a pointer to that data structure such that 'the other side' may read it. There's nothing in the description of arch::store that would prevent the compiler from reordering the initialization writes to after the 'publish' write, which is obviously a problem.

Possible solutions I can imagine:

All writes that might be read by 'the other side' must be done using arch::store. That seems the most principled to me, but I also assume devs won't like it. ;)
Tell people to use a fence -- but I don't know of a way to actually say, in the Abstract Machine, why that is correct.
Adjust the spec and model of arch::store so that it effectively has 'release' semantics. The issue here is specifying what exactly 'the other side' can observe. With asm! (with the right clobber annotations) it is effectively the case that the asm block may observe all Abstract Machine memory (though not all of its contents are actually specified). Is that what we want here as well? This sounds like the most convenient option and might have a chance of being formalized, though I am not sure how to even precisely state this in the docs. Also, on hardware where release requires a fence, is it a problem if all arch::store emit such a fence?

asm! even becomes a full 2-way barrier, but the situation there is different since the asm code is running in the same thread. With arch::store, another operation that is just after this one might be ordered up and concurrent code cannot necessarily tell since it might just be the case that it happened really quickly after the arch::store.

The entirely symmetric situation also arises with arch::load, of course, though there it seems more reasonable to me to ask programmers to use arch::load for all the reads, not just the first one that 'discovers' the pointer. So via symmetry, maybe the same approach for writes is not such a bad idea after all.

RalfJung commented 2 years ago

Personally, I would prefer a “if the values stored are different” caveat at the very least. But that is a totally different topic.

No, a race is a race even if the values stored are the same. There are optimizations that rely on this, though I forgot the details. (But indeed this would be again a memory model question, not one for this thread.)

Specifically, it is UB on the part of whoever is doing the non-atomic reads.

Yes. I was implicitly assuming those other non-atomic read/writes are inside the same Rust Abstract Machine. "non-atomic" only makes sense as a concept as long as we stay inside the Abstract Machine. And inside the Abstract Machine, it's always the entire machine that has UB, so there's not really a question of "which side" has UB.

comex commented 2 years ago

Tell people to use a fence -- but I don't know of a way to actually say, in the Abstract Machine, why that is correct.

Intuitively I'd like to say that the Abstract Machine and the CPU-architecture-level state (call it the Concrete Machine) are two different observers. Just as atomic fences put ordering constraints on when other Abstract Machine threads observe accesses by the current Abstract Machine thread, compiler fences put ordering constraints on when the current Concrete Machine thread observes accesses by the current Abstract Machine thread. Each asm! block and FFI call can be considered to have implicit compiler barriers before and after it. Does that seem logical to you?

asm! even becomes a full 2-way barrier, but the situation there is different since the asm code is running in the same thread. With arch::store, another operation that is just after this one might be ordered up and concurrent code cannot necessarily tell since it might just be the case that it happened really quickly after the arch::store.

I don't understand the concern here. Can you elaborate?

Regardless, I would say that we only need to be concerned with how memory writes reach the current Concrete Machine thread; how they propagate from there to other Concrete Machine threads or non-thread memory observers (like DMA devices) is a question for the architecture.

DemiMarie commented 2 years ago

I believe that compiler_fence applies to even non-atomic access.

I wish I knew of a formal model of things like compiler_fence, but formally speaking that operation is a NOP so it's more of an (non-)optimization hint than something that can be used in a principled way to avoid UB.

Either it is an admission that the Rust Abstract Machine is incomplete, or it is has some effect on the Abstract Machine, even if it is a NOP at runtime.

Even disregarding the concurrency discussion, this becomes interesting in some MMIO situations where one first does a bunch of regular writes to initialize a data structure, and then uses arch::store to "publish" a pointer to that data structure such that 'the other side' may read it. There's nothing in the description of arch::store that would prevent the compiler from reordering the initialization writes to after the 'publish' write, which is obviously a problem.

Possible solutions I can imagine:

All writes that might be read by 'the other side' must be done using arch::store. That seems the most principled to me, but I also assume devs won't like it. ;)

Yeah, I highly doubt that will fly with the kernel devs. It’s also way more restrictive than is actually needed, because different calls to arch::store cannot be reordered.

Tell people to use a fence -- but I don't know of a way to actually say, in the Abstract Machine, why that is correct.

This is the approach I would take, but I have no idea of how to specify it except in terms of the generated assembler code.

Adjust the spec and model of arch::store so that it effectively has 'release' semantics. The issue here is specifying what exactly 'the other side' can observe. With asm! (with the right clobber annotations) it is effectively the case that the asm block may observe all Abstract Machine memory (though not all of its contents are actually specified). Is that what we want here as well? This sounds like the most convenient option and might have a chance of being formalized, though I am not sure how to even precisely state this in the docs. Also, on hardware where release requires a fence, is it a problem if all arch::store emit such a fence?

This is a question for kernel and driver developers.

asm! even becomes a full 2-way barrier, but the situation there is different since the asm code is running in the same thread. With arch::store, another operation that is just after this one might be ordered up and concurrent code cannot necessarily tell since it might just be the case that it happened really quickly after the arch::store.

The entirely symmetric situation also arises with arch::load, of course, though there it seems more reasonable to me to ask programmers to use arch::load for all the reads, not just the first one that 'discovers' the pointer. So via symmetry, maybe the same approach for writes is not such a bad idea after all.

I strongly disagree, mostly because it isn’t what driver devs actually do.

Specifically, it is UB on the part of whoever is doing the non-atomic reads.

Yes. I was implicitly assuming those other non-atomic read/writes are inside the same Rust Abstract Machine. "non-atomic" only makes sense as a concept as long as we stay inside the Abstract Machine. And inside the Abstract Machine, it's always the entire machine that has UB, so there's not really a question of "which side" has UB.

That should be made explicit so that the security people (like myself) don’t get nervous.

comex commented 2 years ago

I literally can't make sense of it -- I think of release/acquire in terms of models like this, and "release load" just doesn't make sense as a concept. 'Release' means that the write event generated by this operation captures the 'view' of the current thread such that when that write event is later observed by an 'acquire' read event, the attached 'view' is integrated into the 'view' of the thread that did the 'acquire'. A load does not generate a write event, so saying it is 'release' is a category error.

The semantics required here are something like: The total modification order of each atomic object is extended to become a total access order, including both reads and writes. A read is guaranteed to be placed somewhere between the write whose value it sees and the following write. Then, one access "observes" another access to the same object if it comes after it in the total access order. (So if you have two accesses A and B to the same object, it's guaranteed that either A observes B or B observes A.)

Common architectures do provide guarantees like this.

(After re-reading this old classic): okay I think I see the problem, but that paper also discusses solutions -- and a "release load" is not one of them. ;) It also mentions a fence-based solution that should work in Rust as well?

I think this paper's "read-dont-modify-write" operation is essentially a hack to work around the fact that objects have a total write order but not a total access order.

There is no cheap way to close the gaps in our memory model, there is no shortcut around actually supporting whatever RCU and seqlocks need in our atomics.

Well… I think volatile/asm can be soundly used for this, but only if the same concept of reads being observables is applied to the interaction between the Abstract Machine and the Concrete Machine. Which is essentially begging the question.

digama0 commented 2 years ago

The semantics required here are something like: The total modification order of each atomic object is extended to become a total access order, including both reads and writes. A read is guaranteed to be placed somewhere between the write whose value it sees and the following write. Then, one access "observes" another access to the same object if it comes after it in the total access order. (So if you have two accesses A and B to the same object, it's guaranteed that either A observes B or B observes A.)

Common architectures do provide guarantees like this.

This isn't good for the seqlock example since the atomic object isn't observing the write. The scenario (with the names from before) is that R1 reads 0, R3 reads 0, W1 writes 1, W2 writes data1 and R2 reads data1. In reordering terms, we have moved R2 after R3, which is legal for the acquire load. With your rule, the access order on the atomic object seq is observed: we have R1, R3, W1, W3. The issue is that a read to a different, non-atomic location is inconsistently placed in this order (R1 -sb> R2 -sb> R3 but also W1 -sb> W2 -rf> R2) and the atomics axioms do not prevent this.

I think this paper's "read-dont-modify-write" operation is essentially a hack to work around the fact that objects have a total write order but not a total access order.

The reason the RMW operation works is because it is a release store operation which synchronizes with the load on the writer's side. This induces a synchronizes-with from R2 to R3 (which is now an RMW) to W1 (which is a CAS and hence a read). Normally two reads on an atomic don't synchronize at all, but if one is an RMW then the other can read the written value, and setting the RMW to a release makes the value of data on the reader thread available to the writer (and more importantly, ensures that the reader's value of data can't be in the future from the writer's perspective). Even if you order the reads on the atomic itself, that will not synchronize anything else, and it is actually work to do this synchronization which is why the RMW is an expensive solution.

comex commented 2 years ago

This isn't good for the seqlock example since the atomic object isn't observing the write. The scenario (with the names from before) is that R1 reads 0, R3 reads 0, W1 writes 1, W2 writes data1 and R2 reads data1. In reordering terms, we have moved R2 after R3, which is legal for the acquire load. With your rule, the access order on the atomic object seq is observed: we have R1, R3, W1, W3. The issue is that a read to a different, non-atomic location is inconsistently placed in this order (R1 -sb> R2 -sb> R3 but also W1 -sb> W2 -rf> R2) and the atomics axioms do not prevent this.

You have to extend the definition of synchronizes-with. In the C++ spec or its formalization in Vafeiadis and Narayan 2013, synchronizes-with (sw) is defined as occurring between a release store and an acquire load if the acquire load reads-from (rf) the release store (or its release sequence, which isn't relevant here). When the modification order becomes an access order, rf and mo merge into one, since a load reads-from a store if and only if the load's position in the access order comes between between that store and any following store. Then we can say that a release access synchronizes-with an acquire access if the acquire is access-ordered-before the release. Since this definition doesn't depend on whether either access is a load or store, it naturally extends to load-release and store-acquire.

In this case, we know that R3 is access-ordered-before W1 because the alternative, R3 being access-ordered-after W1, would require it to read-from W1, which it doesn't. Therefore, since R3 is a release and W1 is an acquire, R3 -sw> W1. Since R2 -sb> R3 and W1 -sb> W2, we have R3 -hb> W2, so by transitivity R2 -hb> W2, so it would violate coherency to have W2 -rf> R2.

The hard part, I think, is proving that this stronger definition of synchronizes-with doesn't create unexpected obligations in other situations. In particular, you might be worried about fences. Regarding those, I would say that the current fences would continue to only allow a store to synchronize-with a load, not vice versa. If you wanted a fence that would allow a load to synchronize-with a store, that would require some hypothetical new fence operation. Here I'm inspired by RISC-V, whose fence instruction lets you independently select which types of predecessor accesses the fence applies to, and which types of successor accesses it applies to – in each case, loads, stores or both. This is in contrast to traditional barriers which apply to both types of accesses on both sides. According to this, a C/Rust acquire fence can be modeled as a fence instruction with only loads as predecessors (but both loads and stores as successors), while a release fence can be modeled as a fence instruction with only stores as successors (but both loads and stores as predecessors). Makes sense. But if you could use fences to achieve load-release and store-acquire, this would no longer be justified.

digama0 commented 2 years ago

I think that this proposal would work, but it is essentially equivalent to the read-dont-modify-write approach. There is clearly message passing from the reader to the writer in this setup (your R3 -sw> W1), and that's the costly operation. It implies, for example, that readers also synchronize with each other, i.e. R3 -sw> R1', which linearizes the accesses and corresponds to bounding an exclusive cache line around and a loss of the unlimited read-side scaling property that is the whole point of seqlocks.

Load-load fences can be used to express the requirement here, but this is not part of the C11 atomic lexicon, so I don't know how well they coexist. I'm glad to hear that RISC-V has direct support for this.

This is also straying somewhat from the topic, perhaps this should be continued in another thread.

DemiMarie commented 2 years ago

This is also straying somewhat from the topic, perhaps this should be continued in another thread.

Agreed, anyone else want to open an issue?

comex commented 2 years ago

This is also straying somewhat from the topic, perhaps this should be continued in another thread.

Agreed. I created a new issue:

What about: seqlocks, load-release/store-acquire?

DemiMarie commented 2 years ago

@RalfJung @comex is this at a point where an actual RFC should be made?

RalfJung commented 2 years ago

@comex

Intuitively I'd like to say that the Abstract Machine and the CPU-architecture-level state (call it the Concrete Machine) are two different observers. Just as atomic fences put ordering constraints on when other Abstract Machine threads observe accesses by the current Abstract Machine thread, compiler fences put ordering constraints on when the current Concrete Machine thread observes accesses by the current Abstract Machine thread. Each asm! block and FFI call can be considered to have implicit compiler barriers before and after it. Does that seem logical to you?

Yeah that makes sense. However what this does not explain is why compiler fences would have any effect if there are no asm! blocks or FFI calls in a thread.

Maybe another way to think about it is that a compiler fence marks a 'sync point' where the compiler has to ensure that the Abstract Machine state and the concrete machine state of this thread are in sync. Like, usually the compiler is allowed to temporarily break the relation of abstract state to concrete state as long as 'nobody can tell' (so e.g. we can delay actually doing a write if nobody could have seen it yet), but at the fence the relation has to hold

This means these fences have to be some kind of event in the trace produced by the program. Or maybe it's better to think of this as something like an FFI call, like some kind of observer may run now? But that needs careful specification of which state the observer is allowed to observe. Rust permits some memory accesses to be reordered around arbitrary unknown function calls; this necessarily also includes fences. IOW, a compiler fence does not protect against noalias-based optimizations.

I don't understand the concern here. Can you elaborate?

Basically, we can order other (relaxed or non-atomic) stores up across an arch::store (if they don't alias). We can't do that with asm! blocks.

Well… I think volatile/asm can be soundly used for this, but only if the same concept of reads being observables is applied to the interaction between the Abstract Machine and the Concrete Machine. Which is essentially begging the question.

The Abstract Machine doesn't have this concept to begin with, that is the entire point. So it also can't be applied to its interactions with the Concrete Machine.

@DemiMarie

Yeah, I highly doubt that will fly with the kernel devs. It’s also way more restrictive than is actually needed, because different calls to arch::store cannot be reordered.

But regular stores can be reordered with arch::store, so it's still needed to account for that.

is this at a point where an actual RFC should be made?

I've given all my comments. I think there are some open questions but they can probably be carried over.

scottmcm commented 2 years ago

The functions proposed here have the same semantics as raw machine load and store instructions.

Does this meant that to use them I have to understand the raw machine semantics of whichever instruction this picked to perform the load/store? It seems odd to offer a "portable" function that might have drastically different side-effects on different platforms.

If I have to understand the raw instruction semantics anyway, wouldn't it be fine to just have me write asm!?

DemiMarie commented 2 years ago

The functions proposed here have the same semantics as raw machine load and store instructions.

Does this meant that to use them I have to understand the raw machine semantics of whichever instruction this picked to perform the load/store? It seems odd to offer a "portable" function that might have drastically different side-effects on different platforms.

These functions are intended for working with memory that doesn’t behave the way memory normally does in Rust. It might be shared with another program or be used for DMA, or it might even be a hardware register. Such uses are always OS-dependent and are often hardware-dependent too.

If I have to understand the raw instruction semantics anyway, wouldn't it be fine to just have me write asm!?

asm! is an option in many cases, especially for kernels which need it for other reasons. However, it suffers from two problems:

It only works on one architecture, whereas these functions can work on more than one (provided that you understand the hardware’s memory model).
asm! unnecessarily constrains compiler optimizations.

comex commented 2 years ago

@RalfJung

Yeah that makes sense. However what this does not explain is why compiler fences would have any effect if there are no asm! blocks or FFI calls in a thread.

Well, I was thinking of arch::load/arch::store as a special case of asm! since that's how this pre-RFC proposes to implement them. But admittedly that would make the compiler_fence question moot.

@DemiMarie, the full RFC should make clear whether the guarantees of these new functions are like asm!, where memory accesses generally can't be reordered across the asm! call if the memory in question is exposed to FFI-land, or like C (and LLVM) volatile, where they can.

I'm tempted to say: disallow the reordering, deprecate compiler_fence, and if we ever switch the implementation from asm! back to LLVM volatile, then have it also include implicit compiler fences. After all, this kind of reordering doesn't strike me as very useful; it feels more like a historical artifact, a result of C compilers applying the same rule to volatile accesses that they were applying to normal memory accesses.

But if we do want to allow reordering, I'd say that compiler_fence should only have a specified effect in conjunction with volatile accesses, similar to how atomic fences only have an effect in conjunction with atomic accesses.

rust-lang / unsafe-code-guidelines

## Pre-Pre-RFC: `core::arch::{load, store}` and stricter volatile semantics #321

core::arch::{load, store}

The semantics of volatile

Unaligned volatile access

New lints

Lowering

`core::arch::{load, store}`

The semantics of `volatile`