Interaction between `volatile` and `fence`

cbiffle commented 3 years ago

Hi! I'm not sure if this is the best place for this question, but it seems worth a shot.

I'm trying to express an ordering between two volatile writes from a single mutator, but the docs don't appear to address this and so I am wary. The corresponding C++ docs are still vague but less so.

Details: A program needs to perform two writes, each volatile -- perhaps they are memory-mapped I/O. The writes must happen (which is to say, complete) in order -- perhaps the first one turns on the physical device that the second one addresses. Is there something from core that I can slip into the middle in the example below to ensure this?

let ptr_a: *mut u32 = ...;
let ptr_b: *mut u32 = ...; // not equal to ptr_a

ptr_a.write_volatile(0xDEAD);
// insert appropriate barrier/fence here
ptr_b.write_volatile(0xBEEF);

Were I willing to be architecture-specific, I know the specific barrier instruction I'm after, and I could express it using inline asm. But it'd be lovely to use something portable. core::sync::atomic::fence -- probably with Release since it's a write-write situation -- was the first thing I reached for, but seeing as these are not atomic accesses per se, the docs on fence imply that it has no effect on their ordering. (Specifically, there are no mentions of volatile anywhere in the atomics docs.)

The C++ memory order documentation does discuss the relationship with volatile, but (1) I admit I don't entirely understand its single relevant sentence, and (2) the remaining sentences are trying to scare off people attempting to use volatile for inter-thread synchronization, which I am not. Plus, I'm not writing C++. :-)

Random people on the Internet keep asserting that fence is sufficient for all our memory-barrier needs, but this doesn't seem obvious to me from the docs. (I'm also more accustomed to the traditional terms around barriers than the atomic memory ordering terms, so this may reflect my own ignorance!)

Pragmatically,

From reading threads here and on the LLVM archives, it looks like LLVM currently preserves relative ordering of atomic and volatile accesses, but I am hesitant to either rely on compiler behavior that may be subject to change, or assume that my backend is LLVM.
A number of Orderings given to fence currently produce the instruction I want on my particular target, but that feels fragile, particularly since my target has fewer barrier varieties than, say, PowerPC, so it might be working by accident.

More detailed context: The system I'm working on is an ARM Cortex-M7 based SoC. The M7 has a fairly complex bus interface, and can issue and retire memory accesses out of order if they issue on different bus ports (which, in practice, means that they apply to different coarse-grained sections of physical address space). The architecture-specific thing to do here is to insert a dmb instruction (available in the cortex_m crate, if you are using it, as cortex_m::asm::dmb()). However, the driver in question is for a generic IP block (a Synopsys Ethernet MAC) that is not inherently ARM-specific, so it'd be great to express this portably.

As you have likely inferred, the goal is to wait for the completion of the first write, not its issuance in program order, and so compiler_fence is not useful here.

Any insight would be greatly appreciated!

Lokathor commented 3 years ago

Volatile accesses are naturally ordered relative to all other volatile accesses in the same thread (but not with atomic and/or standard accesses).

EDIT: oops you don't want to assume LLVM, my bad.

cbiffle commented 3 years ago

Volatile accesses are naturally ordered relative to all other volatile accesses in the same thread (but not with atomic and/or standard accesses).

That is only true of the order that the accesses are generated in the compiler's output (machine code). If the CPU is capable of issuing memory operations out of order, you need to make the operations dependent using a barrier if you want an ordering -- even on LLVM. This is why compiler_fence is not sufficient.

Lokathor commented 3 years ago

Correct. But if the CPU doesn't support out of order operations in the first place then you don't even need a compiler fence.

comex commented 3 years ago

Well, according to ARM documentation for the Cortex-M7, for memory which has been marked in the MPU as "Device" or "Strongly-ordered":

The processor preserves transaction order relative to other transactions to Device or Strongly-ordered memory.

So you shouldn't need a DMB instruction. You also don't need a compiler fence, because the compiler is guaranteed not to reorder volatile accesses with respect to each other. (There isn't necessarily a good place where that guarantee is specified, yet, but it comes from some combination of the C spec and GCC's documentation on volatile.)

That said, the compiler can reorder volatile accesses with respect to non-volatile accesses, and the CPU can reorder Device/Strongly-ordered accesses with respect to normal accesses. Therefore, if the device in question is going to DMA to or from a memory buffer that you access from the CPU using non-volatile reads/writes, you will need some kind of fence.

But I wouldn't recommend atomic fences. In general, combining volatile with atomic fences is not guaranteed to work, because volatile's semantics are defined at the assembly level, whereas atomics are a pure Abstract Machine concept and implementations are allowed to implement them in any crazy way they want. compiler_fence might be an exception, but I wouldn't count on it.

Besides, depending on the target system, a 'standard' barrier instruction may not be enough to synchronize with DMA peripherals anyway. I say system, not architecture, as the type of synchronization required can vary even between e.g. different ARM processors. Thus it's not really something the compiler or standard library can be expected to know about.

As such, in any situation where you do need a DMB, I would recommend sticking with dmb() and adding support for other architectures/systems as necessary.

cbiffle commented 3 years ago

Well, according to ARM documentation for the Cortex-M7, for memory which has been marked in the MPU as "Device" or "Strongly-ordered":

The processor preserves transaction order relative to other transactions to Device or Strongly-ordered memory.

I'm still interested in an answer to the Rust part of this question independent of how the M7 behaves, but:

I don't have a citation for the behavior, but in the presence of store buffers and fancy AXI buses, the M7 will definitely issue Device writes to different buses out of order. The easiest way to reproduce this typically goes something like:

Write one register to ungate clock to a separate peripheral. This is often on some sort of "system control" block which, on ARM micros, is often on a relatively slow APB bus.
Turn around and immediately attempt to write a register on the newly ungated peripheral.

If the ungated peripheral is on a significantly faster bus (say, directly attached to AXI on an M7), and the two writes compile to two str instructions back-to-back (typical for release builds if the code is simple), the writes will be reordered and the operation will fail -- reliably, on the machines I've tested.

So, about fences:

But I wouldn't recommend atomic fences. In general, combining volatile with atomic fences is not guaranteed to work

Good to hear -- that was basically my interpretation of the docs, but it's left implicit. So under that interpretation, Rust doesn't currently have a portable write-write barrier operation that is useful for volatile operations, right?

We can certainly define an abstract one analogous to the one in the Linux kernel, and implement it for our architectures of interest, if required.

Besides, depending on the target system, a 'standard' barrier instruction may not be enough to synchronize with DMA peripherals anyway.

So, I consciously reduced this question to something that didn't involve DMA, to avoid its additional complexity. :-) I agree that none of the barriers we've discussed here are sufficient to do coherent DMA on an M7.

comex commented 3 years ago

I wonder if the behavior you're talking about is related to the difference between Device and Strongly-ordered. Apparently, Strongly-ordered forces the processor to wait until one access is complete before starting to issue another, whereas with Device memory, it just has to issue them in the right order – which is probably meaningless for accesses to two different devices, that may take differing amounts of time. Anyway, I don't have experience with it myself, and I suppose it's off topic.

gThorondorsen commented 3 years ago

I am far from being an expert but I would expect std::sync::atomic::fence(SeqCst) to suffice (and probably some weaker orderings too, at least AcqRel). (EDIT: Wrong. Thanks to @RalfJung for reminding me that fence only works on regular memory.)

My reasoning is that atomic orderings restrict the movement of even plain (non-atomic, non-volatile) memory accesses. Otherwise locks would not be implementable using only atomic operations. And I think of volatile accesses as being stronger than plain accesses even though they don't provide any inter-thread properties because the compiler is allowed to duplicate or remove plain accesses but not volatile ones. So volatile accesses are subject to at least as much movement restriction as plain accesses.

I would also expect such a fence to behave like @comex's Strongly-ordered, with a wording like "all selected events before the fence are fully visible to the selected events after the fence", but their comment makes me doubt it.

RalfJung commented 3 years ago

The best that Rust can guarantee is that the two volatile operations will be put into that order in the final assembly. This guarantee is already made by virtue of them being volatile, without any extra fences.

An atomic fence is not guaranteed to compile to any particular instruction, so I do not see how it can help here. In particular, atomic instructions are all about synchronizing CPU cores with each other; they say nothing and are (to my knowledge) pointless when it comes to synchronization with peripherals.

If you need a particular assembly instruction to be put between these two writes for low-level hardware reasons, I don't think there is a way to express this in a platform-independent way, I am afraid.

If the ungated peripheral is on a significantly faster bus (say, directly attached to AXI on an M7), and the two writes compile to two str instructions back-to-back (typical for release builds if the code is simple), the writes will be reordered and the operation will fail -- reliably, on the machines I've tested.

I wouldn't call this "reordered", this sounds more like a funny kind of race condition caused by the signals literally racing over buses that have different speed. If you need to perform the first operation, then wait, then perform the second operation -- you need to find a way to express this "now please wait until X is done". Fences have nothing to do with waiting for anything (at least on the level of C++/Rust concurrency primitives), they are only concerned with inducing ordering constraints, so they seem like the wrong primitive here.

If this would be all in terms of concurrent threads interacting, I'd say the bug arises from the fact that you sent a signal to two threads (almost) at the same time, and of course the second thread might "wake up" first and process the signal first even if it was sent later. No amount of fences can fix that.

IOW, if we view this as message-passing (which seems appropriate given your talking about buses), what you did is:

fn foo(ptr_a: Sender<i32>, ptr_b: Sender<i32>) {
  ptr_a.send(0xDEAD);
  ptr_b.send(0xBEEF);
}

Fences can only ensure that the messages are sent in the order given in the code, but they cannot ensure that they are processed in that order. For this you'll need some heavier mechanism -- something that actually tells you that the first message has been received and processed. I have no idea what such a mechanism might be in your case.

Do you know which assembly instruction you need to put between the two str to make things work, for the example that you gave?

cbiffle commented 3 years ago

If you need a particular assembly instruction to be put between these two writes for low-level hardware reasons, I don't think there is a way to express this in a platform-independent way, I am afraid.

There certainly doesn't appear to be a platform-independent way of doing it in Rust at the moment. The closest I've seen to an attempt at portable memory barriers is what they've done in the Linux kernel, but I haven't thought about whether their techniques make sense in a nonprivileged context -- so they may be totally irrelevant for a systems language runtime, and best left to kernels themselves.

I wouldn't call this "reordered", this sounds more like a funny kind of race condition caused by the signals literally racing over buses that have different speed.

Technically, in the two-stores case I'm describing, they are not issue reordered by the processor, but are allowed to complete out of order -- which was not true on, say, the M4. I chose this as the simplest case; if one were a load, they would be dual-issued and we'd have issue-level reordering. In case you're curious.

Fences can only ensure that the messages are sent in the order given in the code, but they cannot ensure that they are processed in that order. For this you'll need some heavier mechanism -- something that actually tells you that the first message has been received and processed. I have no idea what such a mechanism might be in your case.

Fortunately in my case knowing that the store instruction has completed (a traditional write-write barrier) is sufficient, I don't need an ACK -- which would be very hard to achieve portably. :-)

Do you know which assembly instruction you need to put between the two str to make things work, for the example that you gave?

I do! In case anyone's following along and wants to solve this problem on an M-class ARM processor, the conservative and halfway portable option is dmb sy (which will force an ordering between all memory accesses issued system-wide before this point, and all issued after). If you know which sharability domain you're trying to affect, you could try to use a more surgical barrier such as dmb osht (for ordering only stores on the outer sharability domain), but at that point we're using system-specific knowledge.

We are probably going to write some wrapper functions analogous to Linux's and implement them for our architectures of interest. In the long term, if a useful set of portable barriers can even be defined (which is not totally clear to me), they'd make an interesting crate and subsequent suggestion for inclusion in core, IMO.

cbiffle commented 3 years ago

FYI - Based on feedback from y'all I have attempted to improve the misleading Stack Overflow answer that brought me here.

mohtasham9 commented 3 years ago

Rust's std::sync::atomic::fence provides an atomic fence operation, which provides synchronization between other atomic fences and atomic memory operations. The terms folks use for describing the various atomic conditions can be a little daunting at first, but they are pretty well defined in the docs, though at the time of this writing there are some omissions.

rust-lang / unsafe-code-guidelines

Interaction between `volatile` and `fence` #260