rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do
https://rust-lang.github.io/unsafe-code-guidelines
Apache License 2.0
658 stars 57 forks source link

Does the concept of a compiler fence make any sense? #347

Open RalfJung opened 2 years ago

RalfJung commented 2 years ago

Atomic fences are specified in terms of things that happen when certain atomic reads occur:

In that particular situation, if the load reads-from the store, then the fences kick in and have an effect. That is the only effect they have, I think.

So, if your program contains no atomic accesses, but some atomic fences, those fences do nothing. We also think that an atomic fence has at least all the effects of a compiler fence, i.e., a compiler fence is strictly weaker than an atomic fence. But that means a compiler fence has no effect on programs without atomic accesses -- which is just wrong, that's not how they are supposed to behave.

So what is the operational spec of a compiler fence? I have no idea. See the discussion here for some more details. Let's go on discussing here.

Diggsey commented 2 years ago

I think atomic fences have additional guarantees that cannot be specified in terms of the abstract machine. Compiler fences provide a subset of those additional guarantees.

These guarantees would probably have to be specified in the same way that FFI or inline assembly is specified - as a relation between the abstract machine state and the underlying lower level state of the machine.

If you look at the uses for a compiler fence: https://stackoverflow.com/a/18454971

Both uses boil down to a single thread of execution being interrupted (either by an actual interrupt, or by a context switch) to run some other code, and the programmer wants to synchronize access to memory between the interrupted code and the interrupting code. This cannot be specified at an abstract machine level, because the abstract machine doesn't have a concept of a thread being interrupted in this way.

Going back to "atomic fences have additional guarantees that cannot be specified in terms of the abstract machine":

I think this is evidenced by the fact that we would expect atomic access from Rust code to be able to synchronize with atomic accesses from another language (or indeed assembly) in the same program. Therefore there must be "more to" atomic accesses than just their semantics within the abstract machine.

RalfJung commented 2 years ago

So you are going in the direction of what was discussed around here.

These guarantees would probably have to be specified in the same way that FFI or inline assembly is specified - as a relation between the abstract machine state and the underlying lower level state of the machine.

That would block a lot more optimizations that you would want when writing concurrent software with atomics. In particular we want there to be a meaningful difference between atomic fences with different orderings, which I do not think is the case under your proposal. So I don't think atomic fences should have any effect like that.

So maybe the statement that an atomic fence is strictly stronger than a compiler fence is wrong?

I think this is evidenced by the fact that we would expect atomic access from Rust code to be able to synchronize with atomic accesses from another language (or indeed assembly) in the same program.

I don't agree. Atomics are a feature inside the language and its memory model, so one should only expect them to synchronize inside that memory model. So whatever the other code does must be cast in terms of that memory model.

C++ atomics are not some abstract way of talking about what the hardware does, they are their completely own thing that is then implemented in terms of what the hardware does. There are sometimes even multiple different mutually incompatible schemes to implement the same model on a given hardware!

@DemiMarie (in the old thread)

Are there semantics for interrupts at all? That is a prerequisite for having semantics for compiler fences.

I guess part of the question here is whether interrupts exist as a language feature (that we'd have to add to the AM) and compiler fences interact with that, or whether we consider them to be outside the AM and compiler-fence to be a way to somehow coordinate with out-of-AM activities.

Atomic operations are not an out-of-AM mechanism. This is one of the many key differences between atomic accesses and volatile accesses. (E.g., the compiler is allowed to reorder and remove atomic accesses when that preserve the semantics of the abstract memory model.) So it would be very strange to me if atomic fences were ouf-of-AM mechanisms, and indeed that would put a lot of extra cost on concurrent algorithms that are perfectly happy staying entirely inside the AM.

DemiMarie commented 2 years ago

I think this is evidenced by the fact that we would expect atomic access from Rust code to be able to synchronize with atomic accesses from another language (or indeed assembly) in the same program.

I don't agree. Atomics are a feature inside the language and its memory model, so one should only expect them to synchronize inside that memory model. So whatever the other code does must be cast in terms of that memory model.

Cross-FFI atomics need to work. The question is how to specify atomics in such a way that they do.

Diggsey commented 2 years ago

In particular we want there to be a meaningful difference between atomic fences with different orderings, which I do not think is the case under your proposal. So I don't think atomic fences should have any effect like that.

That seems like a bit of a leap? I assume you are referring to:

There are sometimes even multiple different mutually incompatible schemes to implement the same model on a given hardware!

I would expect atomics to be like part of the ABI - ie. the compiler is not expected to make any kind of atomic accesses from FFI code work correctly, but as long as the FFI code follows the same rules (for that ordering) as the Rust compiler does, then they would be expected to synchronize. I don't see why this would make orderings redundant.

Diggsey commented 2 years ago

I don't agree. Atomics are a feature inside the language and its memory model, so one should only expect them to synchronize inside that memory model. So whatever the other code does must be cast in terms of that memory model.

It's my understanding that atomics are the expected way to do IPC via shared memory. I don't see how to reconcile your statement with that. (This is certainly a use-case supported by C++'s atomics)

RalfJung commented 2 years ago

It's my understanding that atomics are the expected way to do IPC via shared memory. I don't see how to reconcile your statement with that. (This is certainly a use-case supported by C++'s atomics)

For shared memory with other code that follows the same memory model. Basically, with other instances of (roughly) the same AM.

Lokathor commented 2 years ago

I'm unclear: are you saying that Rust atomics can sync with C++ atomics, or not?

RalfJung commented 2 years ago

That seems like a bit of a leap? I assume you are referring to:

No that's not what I am referring to. I'm saying atomic::fence has this ordering parameter and we agree it is relevant for the semantics, right? Because under your spec I don't see how it is.

I'm unclear: are you saying that Rust atomics can sync with C++ atomics, or not?

We are using the C++ memory model so that works fine.

But syncing e.g. with assembly code only makes sense if whatever that assembly code does can be expressed in terms of this memory model. You can't expect an atomic fence to have any effect other than what it says in the memory model.

RalfJung commented 2 years ago

I guess part of the question here is whether interrupts exist as a language feature (that we'd have to add to the AM) and compiler fences interact with that, or whether we consider them to be outside the AM and compiler-fence to be a way to somehow coordinate with out-of-AM activities.

To be clear, I think saying that compiler_fence is an out-of-AM syncing mechanism is not a bad idea. However, that would make compiler_fence incomparable in strength with atomic fences (and having atomic::compiler_fence just makes no sense).

That is actually consistent with what @comex wrote

But if we do want to allow reordering, I'd say that compiler_fence should only have a specified effect in conjunction with volatile accesses, similar to how atomic fences only have an effect in conjunction with atomic accesses.

Volatile accesses and atomic accesses are also incomparable in strength.

However, I just realized compiler_fence has an Ordering, and I have no idea what that is supposed to mean then...

DemiMarie commented 2 years ago

I'm unclear: are you saying that Rust atomics can sync with C++ atomics, or not?

We are using the C++ memory model so that works fine.

I think we should seriously consider looking at the Linux kernel memory model as well. This has the advantage that it is known to work for low-level programming at scale, whereas I am not aware of the C++ memory model being used for such code. Rust will need to integrate with the Linux kernel memory model to be used as part of the Linux kernel anyway.

Diggsey commented 2 years ago

I'm saying atomic::fence has this ordering parameter and we agree it is relevant for the semantics, right? Because under your spec I don't see how it is.

Oh I see - because the memory ordering is part of the C++/Rust memory model.

I guess another option would be to add the concept of "interruption" to the abstract machine, and then define compiler fences in terms of that. I think it's possible to define everything about "interruption" except for what triggers it - that's the only part that's "outside" what we can define.

RalfJung commented 2 years ago

I think we should seriously consider looking at the Linux kernel memory model as well. This has the advantage that it is known to work for low-level programming at scale, whereas I am not aware of the C++ memory model being used for such code. Rust will need to integrate with the Linux kernel memory model to be used as part of the Linux kernel anyway.

I have opinions about that, but if you want to propose/discuss this, please make it a new issue. This one is about fences. :)

m-ou-se commented 2 years ago

But that means a compiler fence has no effect on programs without atomic accesses -- which is just wrong, that's not how they are supposed to behave.

Why is that wrong?

Single-threaded / signal / compiler fences are relevant for signal handlers, interrupts, and things like Linux' SYS_membarrier.

Lokathor commented 2 years ago

What's wrong is the "that means a compiler fence has no effect..." part, because they're supposed to have some sort of effect.

We all seem to agree they should do something.

m-ou-se commented 2 years ago

They do have an effect, but not in a program without any atomic operations.

Lokathor commented 2 years ago

Ah, then I was incorrect and we seem to disagree.

Many people (myself included) have been under the impression that a compiler_fence has an effect even in a program with no atomic operations (particularly, even for targets where there are no atomics). If that's not the case we really need to fix the docs.

m-ou-se commented 2 years ago

The C++ standard defines atomic_signal_fence (aka our atomic::compiler_fence) as:

Equivalent to atomic­thread­fence(order)¹, except that the resulting ordering constraints are established only between a thread and a signal handler executed in the same thread.

(¹ Aka our atomic::fence().)

And adds:

[Note 1: atomic­signal­fence can be used to specify the order in which actions performed by the thread become visible to the signal handler. Compiler optimizations and reorderings of loads and stores are inhibited in the same way as with atomic­thread­fence, but the hardware fence instructions that atomic­thread­fence would have inserted are not emitted. — end note]

Other types of interrupts or things like SYS_membarrier aren't part of the C++ standard, but are also use cases for a single-threaded/compiler/signal fence.

A more practical/wider applicable way of defining atomic::compiler_fence would be to say it is identical to atomic::fence, except it doesn't work across hardware threads/cores. That means nothing in the abstract machine of course. So it's just a regular atomic fence, but with an additional assumption/requirement from outside the abstract machine. If the abstract machine defines signal handlers or interrupts, it could possibly mention something about those situations specifically.

Lokathor commented 2 years ago

Well, I've never programmed C++, and so honestly I only vaguely follow what that's saying.

I'm unclear on what the implications are for targets without atomics, where we might assume that there's only one core and possibly an interrupt handler.

m-ou-se commented 2 years ago

When you say "without atomics", do you mean without atomic read-modify-write/cas operations, or even without load and store?

If a platform has no atomic load or store for any size atomic.. then there's no way to communicate between a signal handler/interrupt and the main code at all through shared memory. Then there's indeed no use for a fence.

Lokathor commented 2 years ago

I mean for example ARMv4T, which can uninterruptedly read-write "swap" with swp (u32) and swpb (u8), but there's no atomic-compare-and-swap, and no atomic-read-modify-write.

And it does have interrupts.

But you're telling me there's no official/legal way for the interrupt to communicate anything to the main program? That seems... unfortunate.

m-ou-se commented 2 years ago

You don't need compare-and-swap or atomic-read-modify-write. Just a simple atomic store to set a flag and a load to check it later is enough.

Lokathor commented 2 years ago

Well, there's normal load and store instructions, but not atomic anything instructions.

RalfJung commented 2 years ago

Aren't these loads and stores atomic, if they are "uninterruptable" and there is no multi-core version of this hardware?

m-ou-se commented 2 years ago

Well, there's normal load and store instructions, but not atomic anything instructions.

That's the same. All load and store instructions on nearly all architectures are also atomic.

Aren't these loads and stores atomic, if they are "uninterruptable" and there is no multi-core version of this hardware?

Indeed.

Lokathor commented 2 years ago

Oh, well then sure, there's atomic loading and storing.

chorman0773 commented 2 years ago

BTW, I interpret core::sync::atomic::compiler_fence as being Rust's version of atomic_signal_fence - that is, a fence that can only synchronize with atomic ops and fences on the same thread of execution as the calling code (including from a signal handler on that thread).

Reading it this way makes quite a bit of sense to me at least - it just modifies how the fence-atomic, atomic-fence, and fence-fence synchronization rules function.

Lokathor commented 2 years ago

Query: How does one enforce ordering between volatile and non-volatile ops.

Previously, I've also seen people assume that compiler_fence can do this.

But, again, apparently compiler_fence does nothing at all for volatile/plain ops ordering?

chorman0773 commented 2 years ago

Query: How does one enforce ordering between volatile and non-volatile ops.

You can't*. If the volatile op was also atomic, though, then you could get the same synchronization behaviour in a signal/interrupt handler as a full fence would.

And yes, rust needs volatile atomic operations.

repnop commented 2 years ago

How does one enforce ordering between volatile and non-volatile ops.

this is what I've been using compiler_fence for, I need some way to prevent the compiler from reordering non-volatile and volatile operations to write driver code, and the docs for compiler_fence don't mention anything about atomics until the example, so I was under the impression the reason for it being in core::sync::atomic was because it uses the Ordering enum from there. looking at the source, it indeed calls atomic fence intrinsics

#[inline]
#[stable(feature = "compiler_fences", since = "1.21.0")]
#[rustc_diagnostic_item = "compiler_fence"]
pub fn compiler_fence(order: Ordering) {
    // SAFETY: using an atomic fence is safe.
    unsafe {
        match order {
            Acquire => intrinsics::atomic_singlethreadfence_acq(),
            Release => intrinsics::atomic_singlethreadfence_rel(),
            AcqRel => intrinsics::atomic_singlethreadfence_acqrel(),
            SeqCst => intrinsics::atomic_singlethreadfence(),
            Relaxed => panic!("there is no such thing as a relaxed compiler fence"),
        }
    }
}

which makes me think that the docs really need improved to mention it only applies to atomic accesses if that's the only guarantee that it makes, and it doesn't help that it links to the Linux memory barrier docs which discuss much more than just atomic orderings. IMO there should be some other way to prevent non-volatile & volatile reordering as its necessary to be able to guarantee that for things like driver code. The Linux kernel has functions specifically for accessing MMIO with barriers to prevent reordering in code, but do we make any such guarantee about asm! preventing function calls from being reordered inside the caller?

chorman0773 commented 2 years ago

Maybe it should straight up link to atomic_signal_fence on cppreference if that is indeed what it is.

RalfJung commented 2 years ago

How does one enforce ordering between volatile and non-volatile ops.

Indeed I thought this was one of the motivations for compiler-fence: so that you can do a bunch of regular writes, then a fence, and then a single volatile write to "publish" something via MMIO to some other DMA participant.

So I think we basically have 3 choices here:

  1. Provide another kind of fence for that. Then we'll still have to figure out how to spec that fence though, and it'll probably have to be something like this. Or maybe the fence can just be implemented as an empty asm! block (with read-only memory clobber, if that's a thing), since it looks like that's what that spec boils down to.
  2. Allow using either a compiler-fence or a full atomic fence for that. This however means using atomic fences for atomics imposes volatile-related restrictions that that code does not need.
  3. Allow using compiler-fence for that but not atomic fences, which means it is no longer always correct to replace a compiler-fence by an atomic fence.

Looks like C/C++ use (1), but don't provide an official way to write that other fence? I am opposed to (2) since impacting purely atomic code like that seems bad.

Lokathor commented 2 years ago

read-only asm blocks are indeed a thing.

comex commented 2 years ago

Indeed I thought this was one of the motivations for compiler-fence: so that you can do a bunch of regular writes, then a fence, and then a single volatile write to "publish" something via MMIO to some other DMA participant.

So I think we basically have 3 choices here:

  1. Provide another kind of fence for that.

Hold on. For the "signal handler" use case, the goal is still to provide ordering between non-volatile accesses and a volatile access. So it's not so different from the MMIO use case.

This answer linked earlier suggests that a compiler fence could be used "by itself" to synchronize access to variables, without mentioning volatile, but if the intent is really that none of the variables in the example are volatile, it's wrong. You can't use a non-atomic non-volatile variable to synchronize between a thread and its signal handler, even with a compiler fence, for the same reasons you can't use one to synchronize between two threads even with an atomic fence. The problematic optimizations are the same in both cases.

chorman0773 commented 2 years ago

You can't use a non-atomic non-volatile variable to synchronize between a thread and its signal handler, even with a compiler fence, for the same reasons you can't use one to synchronize between two threads even with an atomic fence

Incidentally, you can't use just any volatile variable to communicate between a thread and a signal handler either. Specifically sig_atomic_t, which is some integer type, can be modified by a signal handler w/o being indeterminate at exit. And you still can't synchronize anything w/o using atomics (lock-free atomics specifically, but that's moot in Rust).

RalfJung commented 2 years ago

Hold on. For the "signal handler" use case, the goal is still to provide ordering between non-volatile accesses and a volatile access. So it's not so different from the MMIO use case.

AFAIK, for the signal handler, you are supposed to use atomic accesses in C++, if you plan to use compiler_fence.

My interpretation of C++ is that using compiler_fence + volatile to sync with a signal handler is UB, just like it is UB to use regular fence + volatile to sync with other threads.

I think it's quite different from MMIO since the other side you are syncing with is still in some sense "a thread within the language" (just a thread that definitely runs on the same physical core).

This answer linked earlier suggests that a compiler fence could be used "by itself" to synchronize access to variables,

I think that answer is wrong; is_shared_data_initialized needs to use relaxed atomic accesses.

If the compiler_fence was replaced by an atomic fence, it would definitely be wrong -- we agree on that, right? The same optimizations that break for the atomic case, can also still be done with a compiler_fence, so that version is just as wrong.

RalfJung commented 2 years ago

If the compiler_fence was replaced by an atomic fence, it would definitely be wrong -- we agree on that, right? The same optimizations that break for the atomic case, can also still be done with a compiler_fence, so that version is just as wrong.

Specifically, to my knowledge, a release fence followed by a non-atomic write can be reordered to do the write first. That optimization would break the example given in the answer.

workingjubilee commented 2 years ago

Well, there's normal load and store instructions, but not atomic anything instructions.

That's the same. All load and store instructions on nearly all architectures are also atomic.

Aren't these loads and stores atomic, if they are "uninterruptable" and there is no multi-core version of this hardware?

Indeed.

Reaffirming but also adding a note:

DemiMarie commented 2 years ago

People will use write-barriers for things other than MMIO. Xen’s libvchan uses them for communication between virtual machines, for example. I can also state that driver writers will use ordinary (non-volatile, non-atomic) loads and stores plus memory barriers, instead of using volatile and/or atomic operations for every single access to shared memory.

RalfJung commented 2 years ago

Driver writers are not using the C++ memory model (assuming you are talking about Linux) so that's an apples-to-oranges comparison.

But anyway MMIO is just an example for "things outside the AM" -- "things you'd need volatile for".

Lokathor commented 2 years ago

So in current Rust, is there any way to sequence volatile and regular accesses? Does an empty asm block actually do that?

Or does the (1) option Ralf mentioned of "invent a new fence" have to actually happen for volatile/normal sequencing to even be possible?

repnop commented 2 years ago

empty asm!("");s definitely do not function as a complete fence, optimizations are allowed to use the lack of clobbers as a sign that it doesn't touch the AM state. it does heavily pessimise optimizations from some things I've thrown it at, but for example, aliasing analysis with &mut can cause reordering of code because it knows there can't be any aliasing pointers.

workingjubilee commented 2 years ago

It is my opinion that relying on any properties of volatile beyond the "cache-defeating" property, the property for which it was initially introduced 30 years ago in C89, is unwise. The cache-defeating property has implications for ordering but that doesn't really constitute the same thing on its own. Thus @chorman0773's opinion that we may need atomic_volatile is plausible.

At the moment, LLVM's LangRef only defines two fences, fence (the atomic kind) and llvm.arithmetic.fence. I do not believe LLVM has a model of such sync that is not implicitly about atomic orderings, as a result.

As our compiler fence is atomic_signal_fence, it appears, which syncs normal and atomic accesses, combined with the cache-defeating property, I believe LLVM must nonetheless guarantee ordering for certain combinations of volatile and atomic accesses using compiler_fence. I also believe certain sequences of volatile reads and writes are naturally forced to sequence with "normal" reads and writes, but I say that with low confidence.

Lokathor commented 2 years ago

We can confidently say, for what it's worth, that all volatile ops are definitely kept in order relative to other volatile ops.

And as you say Jubilee, normal data dependencies apply between volatile and standard accesses still apply: If you standard write data that came from a volatile read, the standard write cannot happen before the volatile read simply because it doesn't exist before the read.

chorman0773 commented 2 years ago

FWIW, XIR has a fence and sequence instruction, comparable to atomic_thread_fence and atomic_signal_fence respectively, but that allows any XIR access-class (including merely volatile). The current semantics of a fence volatile or sequence volatile are not entirely worked out, but at a minimum, it will be treated as a side effect and forbid other volatile operations from being reordered relative to it. I think it could easily be defined to forbid the reordering of non-volatile accesses arround them as well.

comex commented 2 years ago

empty asm!("");s definitely do not function as a complete fence, optimizations are allowed to use the lack of clobbers as a sign that it doesn't touch the AM state. it does heavily pessimise optimizations from some things I've thrown it at, but for example, aliasing analysis with &mut can cause reordering of code because it knows there can't be any aliasing pointers.

The issue isn't the lack of clobbers exactly. The "new" asm! has the memory clobber (in the syntax of GCC or the old asm!) on by default, unless you pass nomem to disable it.

But indeed, that is a caveat I wasn't aware of. In this test case:

use std::arch::asm;
pub unsafe fn foo(x: &mut i32, y: *mut *mut i32) {
    *x = 1;
    asm!("");
    y.write_volatile(x);
    // ^-- Imagine that is triggering an MMIO device
    // to start reading from x.
    *x = 2;
}

the *x = 1; is optimized out. The asm!("") is not enough to establish ordering because that asm block is not allowed to read from x.

The optimization can be prevented using std::sync::atomic::compiler_fence. It would also be prevented if volatile accesses were reimplemented using asm themselves.

Lokathor commented 2 years ago

Other than the "we would have to write some new code for every arch" problem I don't see anything particularly wrong with implementing volatile with asm blocks.

comex commented 2 years ago

If the compiler_fence was replaced by an atomic fence, it would definitely be wrong -- we agree on that, right? The same optimizations that break for the atomic case, can also still be done with a compiler_fence, so that version is just as wrong.

I don't know if we disagree on anything. Even supposing for argument's sake that no optimizations occur involving moving anything across the barrier, and no interprocedural optimizations occur…

Assuming is_shared_data_initialized is a bool, even if we assume that bool is 1 byte on the platform (it's not always) and thus the write to it cannot tear, the generated assembly could write an invalid value to that byte (not 0 or 1) before writing the final value; at that point, when the signal handler reads the value, it may misbehave due to the compiler assuming the value is 0 or 1.

...I did miss the fact that atomic_signal_fence is defined to only affect atomics. It doesn't even affect volatile sig_atomic_t, which, as @chorman0773 pointed out, is the only case where the C++ spec blesses using volatile to communicate between a thread and its signal handler.

But some people definitely are using compiler_fence for DMA, or in other words to provide a barrier between non-volatile accesses and volatile-but-not-atomic accesses.

Implementation-wise, today, compiler_fence prevents undesired optimizations in that case while asm!(""); may not, per my last comment. However, it's implemented using LLVM's fence instruction, which is documented by LangRef as only working with atomics.

Lokathor commented 2 years ago

why would there ever be code emitted that temporarily writes an invalid value to a bool, and if that can happen for one type of value why can't that randomly also happen for any other type of value? Should I be concerned that my &mut bool values might suddenly explode some day?

chorman0773 commented 2 years ago

Something could use the same storage the bool occupies while it knows its not being used, and it just remembers the value to restore for the next use because compiler.

On Fri, 8 Jul 2022 at 18:42, Lokathor @.***> wrote:

why would there ever be code emitted that temporarily writes and invalid value to a bool, and if that can happen for one type of value why can't that randomly also happen for any other type of value? Should I be concerned that my &mut bool values might suddenly explode some day?

— Reply to this email directly, view it on GitHub https://github.com/rust-lang/unsafe-code-guidelines/issues/347#issuecomment-1179411703, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGLD22QONYNASHGEWPMU7TVTCVGPANCNFSM52XOL7UQ . You are receiving this because you were mentioned.Message ID: @.***>

RalfJung commented 2 years ago

aliasing analysis with &mut can cause reordering of code because it knows there can't be any aliasing pointers.

That is true with all fences, none of them allows you to violate the aliasing assumptions. If you are using volatile because that memory is shared with some other party (e.g. via MMIO), then you must also inform the type system of that sharing, by using shared references or raw pointers.

And as you say Jubilee, normal data dependencies apply between volatile and standard accesses still apply: If you standard write data that came from a volatile read, the standard write cannot happen before the volatile read simply because it doesn't exist before the read.

The compiler might under some circumstances be able to speculate the write. Then such ordering can still be violated.

Reasoning with data dependencies is pretty much impossible in a programming language. Assembly langues can do it but that's a different situation. (This is why the 'consume' memory ordering is just inherently broken.)

The optimization can be prevented using std::sync::atomic::compiler_fence.

That is coincidence and not guaranteed by any spec. Mutable references are unique and the memory they point to must not be written to or read from through any other pointer (not derived from them). This includes MMIO and DMA accesses of other devices.