Invalid optimisation: `atomicrmw <op>, 0` is not always the same as `load atomic`

cbeuw commented 2 years ago

Credits: @taiki-e first spotted this optimisation and found the offending patch https://github.com/crossbeam-rs/crossbeam/issues/859#issuecomment-1178539492 and @RalfJung pointed out that this may be a miscompilation https://github.com/crossbeam-rs/crossbeam/issues/859#issuecomment-1179355070

[InstCombine] Optimize atomicrmw <op>, 0 into load atomic when possible assumed that an identity Monotonic (aka Relaxed) or Acquire RMW operation ("read-dont-modify-write") is the same as an atomic load. This is not always the case as RMW must see the latest value. Atomic loads do not provide this guarantee.

https://timsong-cpp.github.io/cppwp/n4868/atomics#order-10

Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation.

For a counterexample, please see https://github.com/llvm/llvm-project/issues/56450#issuecomment-1183695905

Likely invalid test cases just for archiving

These equivalent Rust and C++ programs panic on Apple Silicon M1 Pro with any optimisation enabled. ```rust use std::sync::atomic::AtomicU64; use std::sync::atomic::Ordering::Relaxed; use std::sync::atomic::Ordering::SeqCst; use std::thread; #[no_mangle] pub fn writer(gte: &AtomicU64, lte: &AtomicU64) { for i in 0.. { // Since there is only one writer and the ordering is SeqCst, // at any point of the program execution, the latest value of lte is either equal or one less than the latest value of gte gte.store(i, SeqCst); lte.store(i, SeqCst); } } #[no_mangle] pub fn reader(gte: &AtomicU64, lte: &AtomicU64) { /* Since the stores to gte and lte are SeqCst, they are both part of a Single Total Order, looking like this: gte lte gte lte gte (yet to run) 1 1 2 2 3 ... */ loop { /* gte lte gte lte gte (yet to run) 1 1 2 2 3 ... ↑ rmw on lte has to see this, not any earlier value */ let smaller = lte.fetch_add(0, Relaxed); // The writer could increment lte and gte by any number of times // between the two fetch_add(0). This does not matter since smaller is still <= larger let larger = gte.fetch_add(0, Relaxed); assert!(larger >= smaller, "the latest gte value {larger} is less than the latest or earlier lte value {smaller}, this is not possible at any point in the program!"); } } pub fn main() { static GTE: AtomicU64 = AtomicU64::new(0); static LTE: AtomicU64 = AtomicU64::new(0); let t_writer = thread::spawn(|| { writer(>E, <E); }); let t_reader = thread::spawn(|| { reader(>E, <E); }); t_writer.join().unwrap(); t_reader.join().unwrap(); } ``` ```console andy@Andys-MacBook-Pro src % rustc --edition 2021 -C opt-level=1 main.rs && ./main thread '' panicked at 'the latest gte value 45278 is less than the latest or earlier lte value 45312, this is not possible at any point in the program!', main.rs:22:9 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` ```cpp #include #include #include #include void writer(std::atomic_uint64_t& gte, std::atomic_uint64_t& lte) { for (uint64_t i=0;;i++) { // Since there is only one writer and the ordering is SeqCst, // at any point of the program execution, the latest value of lte is either equal or one less than the latest value of gte gte.store(i, std::memory_order_seq_cst); lte.store(i, std::memory_order_seq_cst); } } void reader(std::atomic_uint64_t& gte, std::atomic_uint64_t& lte) { /* Since the stores to gte and lte are SeqCst, they are both part of a Single Total Order, looking like this: gte lte gte lte gte (yet to run) 1 1 2 2 3 ... */ while (true) { /* gte lte gte lte gte (yet to run) 1 1 2 2 3 ... ↑ rmw on lte has to see this, not any earlier value */ uint64_t smaller = lte.fetch_add(0, std::memory_order_relaxed); // The writer could increment lte and gte by any number of times // between the two fetch_add(0). This does not matter since smaller is still <= larger uint64_t larger = gte.fetch_add(0, std::memory_order_relaxed); assert(larger >= smaller); } } int main() { std::atomic_uint64_t gte(0); std::atomic_uint64_t lte(0); std::thread t_writer(writer, std::ref(gte), std::ref(lte)); std::thread t_reader(reader, std::ref(gte), std::ref(lte)); t_writer.join(); t_reader.join(); return 0; } ``` ```console andy@Andys-MacBook-Pro src % clang++ -O2 -Wall -Wextra -std=c++20 main.cpp && ./a.out Assertion failed: (larger >= smaller), function reader, file main.cpp, line 21. ``` On Cortex-A72 (Raspberry Pi 4 Model B), some padding is needed to prevent the two variables from ending up in the same cache line: https://gist.github.com/cbeuw/e5cb76b1d01b51a56fc35ff40248ab27#file-padded-rs ```console andy@pi:~/rust/rdmw$ rustc --edition 2021 -C opt-level=3 padded.rs && ./padded thread '' panicked at 'the latest gte value 40533 is less than the latest or earlier lte value 40602, this is not possible at any point in the program!', padded.rs:22:9 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ```

The patch needs to be reverted. There are some optimisations (https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf, Section 7) that can be done on read-dont-modify-write, though it is not trivial.

cbeuw commented 2 years ago

I updated the test case so that the stores are SeqCst, putting the values of the two variables in a single total order, so it's clear that reading the latest value of lte has implications on what a read of gte can see

cbeuw commented 2 years ago

After some discussion and thinking, I've decided that my test case is (yet again) invalid. The error is here:

        /*
        gte lte gte lte gte (yet to run)
         1   1   2   2   3      ...
                     ↑
                    rmw on lte has to see this, not any earlier value
        */
        let smaller = lte.fetch_add(0, Relaxed);
        // The writer could increment lte and gte by any number of times
        // between the two fetch_add(0). This does not matter since smaller is still <= larger
        let larger = gte.fetch_add(0, Relaxed);

While a Single Total Order exists for all stores (which are SeqCst) to gte and lte, the Relaxed RMWs are not obligated to observe this Order in the same way. Although gte.fetch_add(0, Relaxed) is sequenced after lte.fetch_add(0, Relaxed), it may still observe a value in the Single Total Order earlier than what the latter RMW has seen, through instruction reordering or otherwise.

In the absense of another test case, I have to assume the optimisation is indeed valid. Closing this for now

RalfJung commented 2 years ago

Supposedly there is a counterexample to this optimization in https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf in Section 7 and Figure 8. @cbeuw can you take a look? I can't quite make sense of it.^^

cbeuw commented 2 years ago

That counterexample uses Release ordering which does not get optimised ("the ordering of that operation is compatible with a load (i.e., anything that doesn't have a release semantic).").

When the author said

Hence it appears impossible to avoid generating a fence as part of the fetch add(0, ...) operation.

I think he meant in the general case? I've been told on Discord that "read the latest value in modification order" is meaningless for Relaxed RMWs since there is no way to actually know whether you have seen the latest or not. This optimisation thinks this is also the case for Acquire

RalfJung commented 2 years ago

I guess he meant for the RMW in the seqlock, which has release semantics... or doesn't it?

RalfJung commented 2 years ago

(After a lot more discussion)

I think this issue does need to be reopened, and the optimization is indeed wrong. In the following program, the FAA in the 2nd thread is guaranteed (under the C++ memory model) to always read 1:

// All accesses relaxed unless specified otherwise
FAA(y,1)         || if (*x == 1)
*[release]x = 1  ||   FAA(y,0) // always reads 1

However, a relaxed read in the same place is not guaranteed to always read 1 (it could be speculated out of the if). Therefore, LLVM miscompiles this program.

Cc @jfbastien @qcolombet

Thanks to Hai Dang for showing me the example!

RalfJung commented 2 years ago

And here is another example, due to Ori Lahav:

Y:=1 rlx
fence(rel)
FAA(X,0,rlx)
fence(acq)
a := Z rlx //loads 0
||
Z:=1 rlx
fence(rel)
FAA(X,0,rlx)
fence(acq)
a := Y rlx // loads 0

The annotated result should be forbidden in the code as-is, because one of the two FAA will read-from the other, and then together with the fences they will establish synchronization between these threads. But once the FAA are both just relaxed loads, suddenly seeing 0 everywhere becomes possible. So this is a counterexample for the optimization.

cbeuw commented 2 years ago

The above example is reproducible on real x86_64 hardware (i7-8750H, native Linux)

// From https://github.com/llvm/llvm-project/issues/56450#issuecomment-1183695905

use std::sync::atomic::fence;
use std::sync::atomic::Ordering::Acquire;
use std::sync::atomic::Ordering::Relaxed;
use std::sync::atomic::Ordering::Release;
use std::{sync::atomic::AtomicI32, sync::atomic::AtomicBool, thread};

#[no_mangle]
pub fn rdmw(storing: &AtomicI32, sync: &AtomicI32, loading: &AtomicI32) -> i32 {
    storing.store(1, Relaxed);
    fence(Release);
    sync.fetch_add(0, Relaxed);
    fence(Acquire);
    loading.load(Relaxed)
}

pub fn main() {
    loop {
        let x = Box::leak(Box::new(AtomicI32::new(0)));
        let y = Box::leak(Box::new(AtomicI32::new(0)));
        let z = Box::leak(Box::new(AtomicI32::new(0)));

        // Since each thread is so short, we need to make sure that they truely run at the same time
        // Otherwise t1 will finish before t2 even starts
        let go = Box::leak(Box::new(AtomicBool::new(false)));

        let t1 = thread::spawn(|| {
            while !go.load(Relaxed) {}
            rdmw(y, x, z)
        });

        let t2 = thread::spawn(|| {
            while !go.load(Relaxed) {}
            rdmw(z, x, y)
        });

        go.store(true, Relaxed);

        let a = t1.join().unwrap();
        let b = t2.join().unwrap();
        assert_ne!((a, b), (0, 0));
    }
}

andy@andy-XPS-15-9570:~/Code/rdmw$ rustc --edition=2021 -C opt-level=1 lahav.rs && ./lahav
thread 'main' panicked at 'assertion failed: `(left != right)`
  left: `(0, 0)`,
 right: `(0, 0)`', lahav.rs:42:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I can't get this to reproduce on Apple Silicon or Cortex-A72, presumably because the fences actually compile to dmb, whereas on X86 they are no-ops

kprotty commented 2 years ago

the FAA in the 2nd thread is guaranteed (under the C++ memory model) to always read 1:

@RalfJung Why is this the case? The load of x which sees 1 is not Acquire to there's no happens-before edge with the release fence/store on the first thread. Wouldn't this mean that the FAA(y, 1) is not guaranteed to be a visible side effect for the FAA(y, 0)?

ibraheemdev commented 2 years ago

I'm not sure the first example, but the optimization seems wrong in not taking Release fences into account:

pub fn a(x: &AtomicUsize) -> usize {
    x.fetch_add(0, Ordering::Release)
}

pub fn b(x: &AtomicUsize) -> usize {
    fence(Ordering::Release);
    x.fetch_add(0, Ordering::Relaxed)
}

// example::a:
//         mfence
//         mov     rax, qword ptr [rdi]
//         ret

// example::b:
//         mov     rax, qword ptr [rdi]
//         ret

The release fence should give the relaxed fetch_add release semantics.

kprotty commented 2 years ago

One of the questions for that is whether FAA(0) is truly considered an RMW by LLVM (ditto to loads/stores are the optimized out). If it is an RMW, then there's clearly a miscompilation. If not, then this could be seen as a valid optimization.

The C++ spec seems to imply that any declared atomic RMW op is indeed an RMW similar to volatile w.r.t. expression. But then, in the same page, it also has separate volatile RMW ops listed too.

RalfJung commented 2 years ago

FAA is definitely always an RMW, just like doing an atomic store of 0 to a location that definitely already contains already 0 is also a store. (And similar for noalias, which also distinguishes reads from writes: every write is a write, no matter which values it writes.) Whether or not the value changes is irrelevant to whether this is a "write". You don't get to just re-define the very notion of what a "write" is without something very explicit in the standard saying so.

Why is this the case? The load of x which sees 1 is not Acquire to there's no happens-before edge with the release fence/store on the first thread. Wouldn't this mean that the FAA(y, 1) is not guaranteed to be a visible side effect for the FAA(y, 0)?

I agree there is no happens-before edge through x here. I didn't use such an edge in my argument.

As @cbeuw mentioned, RMWs always read the "last" value from the location. This is a special condition separate from all the usual reads-from/happens-before business. The way that is made precise implies, in particular, that two RMW instructions cannot both read-from the same other store operation. So one of the two FAA in my example has to read-from the other. If the right FAA reads-from the left one, it will obviously read 1. We could possibly declare that the left FAA reads-from the one on the right; but then we have something that looks a lot like an out-of-thin-air execution... we have a cycle in hb+rf. There is no generally accepted formal model of what exactly makes an execution out-of-thin-air, unfortunately. So if you want to use the OOTA loophole to justify this execution, then we are entering very subjective territory.

Lucky enough, we have another example, where the optimization is unambiguously wrong and no OOTA issues come up. So probably it is better to focus discussion on that 2nd example.

kprotty commented 2 years ago

FAA is definitely always an RMW, just like doing an atomic store of 0 to a location that definitely already contains already 0 is also a store.

Ah ok, LLVM then doesn't fit this model.

You don't get to just re-define the very notion of what a "write" is without something very explicit in the standard saying so.

How does this play into optimizations like Dead-Store-Elimination and Common-Subexpression-Elimination for normal reads & writes? I assume the standard explicitly says somewhere that these can be merged and turned into "not writes" somewhere if it's not volatile but if it doesn't then it makes less sense for atomics.

Lucky enough, we have https://github.com/llvm/llvm-project/issues/56450#issuecomment-1183695905,

I can see the issue now with the FAA optimization breaking the happens-before edges due the FAA's as loads no longer being happens-before in their modification order (what I think is the issue).

talchas commented 2 years ago

They fall under the as-if rule, ie if the compiler can prove you can never notice, then it can do whatever (and things like debuggers don't count as noticing). Normal writes do not count as a side-effect in and of themselves, only volatile writes. (And this lowering of atomicrmw didn't result in the wrong output, it would be fine to compile it like this)

RalfJung commented 2 years ago

Ah ok, LLVM then doesn't fit this model.

I am pretty sure it does. Stores can still be turned into loads or removed if that does not change the behavior of the program. (More precisely: if it does not add any new behaviors that were not already possible before.) That's the as-if rule @talchas mentioned.

I assume the standard explicitly says somewhere that these can be merged and turned into "not writes" somewhere

No, it emphatically does not. The standard does not say which optimizations are allowed or not. The standard says which behaviors of the program are allowed or not. It is the job of optimization authors to ensure that their optimizations do not add any new behaviors that were not already possible before: the optimized program has to behave "as if" it were executed as described by the standard.

It is worth reading the relevant section of the C++ standard in detail, in particular this part. C contains similar wording.

In this case, for the 2nd example: the original program, per the standard, will never read 0 in the last line of both threads. (Equivalently: the assertion in this Rust version of the example will never fail.) The program as transformed by LLVM, however, can read 0 in both threads (so the assertion now can fail). This means LLVM added behavior that was not possible before (an assertion failure), and that makes the optimization incorrect. This is basically like giving LLVM a program that says assert!(3 < 4), and getting an assertion failure due to miscompilation -- only a more subtle version of it since atomics are a lot more complicated than integer comparison.

taiki-e commented 2 years ago

There are some optimisations (https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf, Section 7) that can be done on read-dont-modify-write, though it is not trivial.

Btw, this seems to be implemented on x86: https://reviews.llvm.org/D5422 (https://github.com/llvm/llvm-project/commit/810739d17420d914b1a6f239f2ca7d5d82bd5f66)

(It seems that this patch is the reason why x86 fetch_add(0) is compiled as mfence;mov instead of lock xadd in cases where D57854's optimizations are not applied (i.e., when ordering is Release, AcqRel, or SeqCst). Also, all orderings were compiled as mfence;mov before D57854 was merged.)

gonzalobg commented 2 years ago

First, thank you all for this discussion and examples, I learned something.

@RalfJung

because one of the two FAA will read-from the other, and then together with the fences they will establish synchronization between these threads.

This makes sense and I agree with the argument @RalfJung makes.

@jfbastien you were the reviewer, were you aware of Ori's Litmus (test that Ralf posted here) during the review? What do you think of the rationale?

gonzalobg commented 1 year ago

@tstellar Tom, the optimization implemented here breaks the C, C++, Rust, and (imo) LLVM memory models, silently introducing undefined behavior in user programs. What's the process for adding reverting this patch as a release blocker for the next release?

If the intent of the optimization was to fundamentally change the semantics of the LLVM-IR memory model, its commit message, and review process, should have mentioned this, and it would have made sense to discuss this more widely instead of silently breaking every programming language implemented on top of LLVM-IR.

qcolombet commented 1 year ago

FWIW, when we developed and reviewed this optimization we obviously didn’t intend to change the memory model; more precisely we thought it wasn’t changing the memory model.

I haven’t followed this conversation closely but from a high level it seemed to me that we expect more than what is actually promised by the language. Again, that’s a high level read. If people agree the optimization is unsafe, it should be reverted and reworked.

talchas commented 1 year ago

A bunch of the examples are somewhat uncertain because the relaxed OOTA rule in the spec is informal, and formalizations of the rule tend to be very strong. (Those formalizations definitely forbid the reorderings optimization creates)

What is certain however is this example: https://github.com/llvm/llvm-project/issues/56450#issuecomment-1193152334 (unless fetch_add(Release) doesn't actually need that mfence, which iirc it does)

RalfJung commented 1 year ago

This example also does not depend on OOTA, as far as I understand.

ibraheemdev commented 1 year ago

@RalfJung that stems from the same release fence issue.

RalfJung commented 1 year ago

Ah good. :)

ibraheemdev commented 1 year ago

Also, this is only be a problem on strongly ordered archs (x86). On archs where release fences emit an instruction (arm, power), I believe the codegen is correct.

RalfJung commented 1 year ago

Codegen might happen to be correct, but the optimization is still wrong. Optimizations must be justified against the abstract memory model, not the one of the current target architecture. (Or else all the undef/poison-based optimizations would be wrong, since obviously most targets do not behave like that.) I expect there is a way to get wrong codegen also on ARM by chaining this with other optimizations.

gonzalobg commented 1 year ago

Independently of whether this optimization might be a correct backend-specific optimization for certain backends, the current optimization is a general LLVM-IR -> LLVM-IR transformation for which the resulting LLVM-IR does not preserve the semantics of the input LLVM-IR.

I think it makes sense to explore this optimization for the backends for which it is sound, but doing that should not block reverting the current patch to prevent landing a very subtle misoptimization on LLVM 16.

qcolombet commented 1 year ago

Finally got time to read through this. I agree the optimization doesn't seem correct for all backends.

I've put a patch up for review reverting it at https://reviews.llvm.org/D141277.

qcolombet commented 1 year ago

Fix landed in 6b85fa6d81b9f50d7a1c347dfd467c9ab01c4a5d

qcolombet commented 1 year ago

Fixing a similar issue with unused xchg -> atomic store at https://reviews.llvm.org/D142097

RalfJung commented 1 year ago

Fixing a similar issue with unused xchg -> atomic store at https://reviews.llvm.org/D142097

I opened a separate issue for that one, just to have it documented: https://github.com/llvm/llvm-project/issues/60418

gonzalobg commented 3 months ago

Does someone here have an analysis for why the rationale in the HPL paper is correct? Or a link to that report that's not dead? (I could not find it)

LLVM has a portable hook to enable backends to deliver this optimization, see https://github.com/llvm/llvm-project/issues/101551 , so I don't think this issue has been "fully" solved.

RalfJung commented 3 months ago

Here's an archived version of that report: https://web.archive.org/web/20230318132751/http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf

RalfJung commented 3 months ago

However, the optimization is unambiguously incorrect on LLVM IR itself, as that uses the C++ memory model (with some adjustments, but nothing relevant here); the HPL paper should not be needed for that. There is an example upthread.

If Machine IR has a different memory model, the optimization could be performed there.

gonzalobg commented 3 months ago

(for the people from the future) A colleague found it here: https://dl.acm.org/doi/10.1145/2247684.2247688 Not sure why google did not :/

gonzalobg commented 3 months ago

If Machine IR has a different memory model, the optimization could be performed there.

As long as the machine code generated preserves the LLVM IR semantics when run on real hardware. Otherwise, outcomes forbidden by LLVM IR would be produced on real hardware.

RalfJung commented 3 months ago

As long as the machine code generated preserves the LLVM IR semantics when run on real hardware. Otherwise, outcomes forbidden by LLVM IR would be produced on real hardware.

That's covered by each individual step being semantically correct (and Machine-IR having a coherent semantics in its own right). Specifically, we need:

the LLVM-IR to Machine-IR translation needs to be semantically correct
all Machine-IR transformations need to be semantically correct
the Machine-IR to assembly translation needs to be semantically correct

llvm / llvm-project

Invalid optimisation: `atomicrmw <op>, 0` is not always the same as `load atomic` #56450