Closed cbeuw closed 1 year ago
I updated the test case so that the stores are SeqCst, putting the values of the two variables in a single total order, so it's clear that reading the latest value of lte has implications on what a read of gte can see
After some discussion and thinking, I've decided that my test case is (yet again) invalid. The error is here:
/*
gte lte gte lte gte (yet to run)
1 1 2 2 3 ...
↑
rmw on lte has to see this, not any earlier value
*/
let smaller = lte.fetch_add(0, Relaxed);
// The writer could increment lte and gte by any number of times
// between the two fetch_add(0). This does not matter since smaller is still <= larger
let larger = gte.fetch_add(0, Relaxed);
While a Single Total Order exists for all stores (which are SeqCst) to gte and lte, the Relaxed RMWs are not obligated to observe this Order in the same way. Although gte.fetch_add(0, Relaxed)
is sequenced after lte.fetch_add(0, Relaxed)
, it may still observe a value in the Single Total Order earlier than what the latter RMW has seen, through instruction reordering or otherwise.
In the absense of another test case, I have to assume the optimisation is indeed valid. Closing this for now
Supposedly there is a counterexample to this optimization in https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf in Section 7 and Figure 8. @cbeuw can you take a look? I can't quite make sense of it.^^
That counterexample uses Release ordering which does not get optimised ("the ordering of that operation is compatible with a load (i.e., anything that doesn't have a release semantic).").
When the author said
Hence it appears impossible to avoid generating a fence as part of the fetch add(0, ...) operation.
I think he meant in the general case? I've been told on Discord that "read the latest value in modification order" is meaningless for Relaxed RMWs since there is no way to actually know whether you have seen the latest or not. This optimisation thinks this is also the case for Acquire
I guess he meant for the RMW in the seqlock, which has release semantics... or doesn't it?
(After a lot more discussion)
I think this issue does need to be reopened, and the optimization is indeed wrong. In the following program, the FAA in the 2nd thread is guaranteed (under the C++ memory model) to always read 1:
// All accesses relaxed unless specified otherwise
FAA(y,1) || if (*x == 1)
*[release]x = 1 || FAA(y,0) // always reads 1
However, a relaxed read in the same place is not guaranteed to always read 1 (it could be speculated out of the if
). Therefore, LLVM miscompiles this program.
Cc @jfbastien @qcolombet
Thanks to Hai Dang for showing me the example!
And here is another example, due to Ori Lahav:
Y:=1 rlx
fence(rel)
FAA(X,0,rlx)
fence(acq)
a := Z rlx //loads 0
||
Z:=1 rlx
fence(rel)
FAA(X,0,rlx)
fence(acq)
a := Y rlx // loads 0
The annotated result should be forbidden in the code as-is, because one of the two FAA will read-from the other, and then together with the fences they will establish synchronization between these threads. But once the FAA are both just relaxed loads, suddenly seeing 0 everywhere becomes possible. So this is a counterexample for the optimization.
The above example is reproducible on real x86_64 hardware (i7-8750H, native Linux)
// From https://github.com/llvm/llvm-project/issues/56450#issuecomment-1183695905
use std::sync::atomic::fence;
use std::sync::atomic::Ordering::Acquire;
use std::sync::atomic::Ordering::Relaxed;
use std::sync::atomic::Ordering::Release;
use std::{sync::atomic::AtomicI32, sync::atomic::AtomicBool, thread};
#[no_mangle]
pub fn rdmw(storing: &AtomicI32, sync: &AtomicI32, loading: &AtomicI32) -> i32 {
storing.store(1, Relaxed);
fence(Release);
sync.fetch_add(0, Relaxed);
fence(Acquire);
loading.load(Relaxed)
}
pub fn main() {
loop {
let x = Box::leak(Box::new(AtomicI32::new(0)));
let y = Box::leak(Box::new(AtomicI32::new(0)));
let z = Box::leak(Box::new(AtomicI32::new(0)));
// Since each thread is so short, we need to make sure that they truely run at the same time
// Otherwise t1 will finish before t2 even starts
let go = Box::leak(Box::new(AtomicBool::new(false)));
let t1 = thread::spawn(|| {
while !go.load(Relaxed) {}
rdmw(y, x, z)
});
let t2 = thread::spawn(|| {
while !go.load(Relaxed) {}
rdmw(z, x, y)
});
go.store(true, Relaxed);
let a = t1.join().unwrap();
let b = t2.join().unwrap();
assert_ne!((a, b), (0, 0));
}
}
andy@andy-XPS-15-9570:~/Code/rdmw$ rustc --edition=2021 -C opt-level=1 lahav.rs && ./lahav
thread 'main' panicked at 'assertion failed: `(left != right)`
left: `(0, 0)`,
right: `(0, 0)`', lahav.rs:42:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I can't get this to reproduce on Apple Silicon or Cortex-A72, presumably because the fences actually compile to dmb
, whereas on X86 they are no-ops
the FAA in the 2nd thread is guaranteed (under the C++ memory model) to always read 1:
@RalfJung Why is this the case? The load of x which sees 1 is not Acquire to there's no happens-before edge with the release fence/store on the first thread. Wouldn't this mean that the FAA(y, 1)
is not guaranteed to be a visible side effect for the FAA(y, 0)
?
I'm not sure the first example, but the optimization seems wrong in not taking Release
fences into account:
pub fn a(x: &AtomicUsize) -> usize {
x.fetch_add(0, Ordering::Release)
}
pub fn b(x: &AtomicUsize) -> usize {
fence(Ordering::Release);
x.fetch_add(0, Ordering::Relaxed)
}
// example::a:
// mfence
// mov rax, qword ptr [rdi]
// ret
// example::b:
// mov rax, qword ptr [rdi]
// ret
The release fence should give the relaxed fetch_add release semantics.
One of the questions for that is whether FAA(0)
is truly considered an RMW by LLVM (ditto to loads/stores are the optimized out). If it is an RMW, then there's clearly a miscompilation. If not, then this could be seen as a valid optimization.
The C++ spec seems to imply that any declared atomic RMW op is indeed an RMW similar to volatile w.r.t. expression. But then, in the same page, it also has separate volatile RMW ops listed too.
FAA is definitely always an RMW, just like doing an atomic store of 0 to a location that definitely already contains already 0 is also a store. (And similar for noalias
, which also distinguishes reads from writes: every write is a write, no matter which values it writes.) Whether or not the value changes is irrelevant to whether this is a "write". You don't get to just re-define the very notion of what a "write" is without something very explicit in the standard saying so.
Why is this the case? The load of x which sees 1 is not Acquire to there's no happens-before edge with the release fence/store on the first thread. Wouldn't this mean that the FAA(y, 1) is not guaranteed to be a visible side effect for the FAA(y, 0)?
I agree there is no happens-before edge through x
here. I didn't use such an edge in my argument.
As @cbeuw mentioned, RMWs always read the "last" value from the location. This is a special condition separate from all the usual reads-from/happens-before business. The way that is made precise implies, in particular, that two RMW instructions cannot both read-from the same other store operation. So one of the two FAA in my example has to read-from the other. If the right FAA reads-from the left one, it will obviously read 1. We could possibly declare that the left FAA reads-from the one on the right; but then we have something that looks a lot like an out-of-thin-air execution... we have a cycle in hb+rf. There is no generally accepted formal model of what exactly makes an execution out-of-thin-air, unfortunately. So if you want to use the OOTA loophole to justify this execution, then we are entering very subjective territory.
Lucky enough, we have another example, where the optimization is unambiguously wrong and no OOTA issues come up. So probably it is better to focus discussion on that 2nd example.
FAA is definitely always an RMW, just like doing an atomic store of 0 to a location that definitely already contains already 0 is also a store.
Ah ok, LLVM then doesn't fit this model.
You don't get to just re-define the very notion of what a "write" is without something very explicit in the standard saying so.
How does this play into optimizations like Dead-Store-Elimination and Common-Subexpression-Elimination for normal reads & writes? I assume the standard explicitly says somewhere that these can be merged and turned into "not writes" somewhere if it's not volatile
but if it doesn't then it makes less sense for atomics.
Lucky enough, we have https://github.com/llvm/llvm-project/issues/56450#issuecomment-1183695905,
I can see the issue now with the FAA optimization breaking the happens-before edges due the FAA's as loads no longer being happens-before in their modification order (what I think is the issue).
They fall under the as-if rule, ie if the compiler can prove you can never notice, then it can do whatever (and things like debuggers don't count as noticing). Normal writes do not count as a side-effect in and of themselves, only volatile writes. (And this lowering of atomicrmw didn't result in the wrong output, it would be fine to compile it like this)
Ah ok, LLVM then doesn't fit this model.
I am pretty sure it does. Stores can still be turned into loads or removed if that does not change the behavior of the program. (More precisely: if it does not add any new behaviors that were not already possible before.) That's the as-if rule @talchas mentioned.
I assume the standard explicitly says somewhere that these can be merged and turned into "not writes" somewhere
No, it emphatically does not. The standard does not say which optimizations are allowed or not. The standard says which behaviors of the program are allowed or not. It is the job of optimization authors to ensure that their optimizations do not add any new behaviors that were not already possible before: the optimized program has to behave "as if" it were executed as described by the standard.
It is worth reading the relevant section of the C++ standard in detail, in particular this part. C contains similar wording.
In this case, for the 2nd example: the original program, per the standard, will never read 0 in the last line of both threads. (Equivalently: the assertion in this Rust version of the example will never fail.) The program as transformed by LLVM, however, can read 0 in both threads (so the assertion now can fail). This means LLVM added behavior that was not possible before (an assertion failure), and that makes the optimization incorrect. This is basically like giving LLVM a program that says assert!(3 < 4)
, and getting an assertion failure due to miscompilation -- only a more subtle version of it since atomics are a lot more complicated than integer comparison.
There are some optimisations (https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf, Section 7) that can be done on read-dont-modify-write, though it is not trivial.
Btw, this seems to be implemented on x86: https://reviews.llvm.org/D5422 (https://github.com/llvm/llvm-project/commit/810739d17420d914b1a6f239f2ca7d5d82bd5f66)
(It seems that this patch is the reason why x86 fetch_add(0) is compiled as mfence;mov
instead of lock xadd
in cases where D57854's optimizations are not applied (i.e., when ordering is Release, AcqRel, or SeqCst). Also, all orderings were compiled as mfence;mov
before D57854 was merged.)
First, thank you all for this discussion and examples, I learned something.
@RalfJung
because one of the two FAA will read-from the other, and then together with the fences they will establish synchronization between these threads.
This makes sense and I agree with the argument @RalfJung makes.
@jfbastien you were the reviewer, were you aware of Ori's Litmus (test that Ralf posted here) during the review? What do you think of the rationale?
@tstellar Tom, the optimization implemented here breaks the C, C++, Rust, and (imo) LLVM memory models, silently introducing undefined behavior in user programs. What's the process for adding reverting this patch as a release blocker for the next release?
If the intent of the optimization was to fundamentally change the semantics of the LLVM-IR memory model, its commit message, and review process, should have mentioned this, and it would have made sense to discuss this more widely instead of silently breaking every programming language implemented on top of LLVM-IR.
FWIW, when we developed and reviewed this optimization we obviously didn’t intend to change the memory model; more precisely we thought it wasn’t changing the memory model.
I haven’t followed this conversation closely but from a high level it seemed to me that we expect more than what is actually promised by the language. Again, that’s a high level read. If people agree the optimization is unsafe, it should be reverted and reworked.
A bunch of the examples are somewhat uncertain because the relaxed OOTA rule in the spec is informal, and formalizations of the rule tend to be very strong. (Those formalizations definitely forbid the reorderings optimization creates)
What is certain however is this example: https://github.com/llvm/llvm-project/issues/56450#issuecomment-1193152334 (unless fetch_add(Release)
doesn't actually need that mfence
, which iirc it does)
This example also does not depend on OOTA, as far as I understand.
@RalfJung that stems from the same release fence issue.
Ah good. :)
Also, this is only be a problem on strongly ordered archs (x86). On archs where release fences emit an instruction (arm, power), I believe the codegen is correct.
Codegen might happen to be correct, but the optimization is still wrong. Optimizations must be justified against the abstract memory model, not the one of the current target architecture. (Or else all the undef/poison-based optimizations would be wrong, since obviously most targets do not behave like that.) I expect there is a way to get wrong codegen also on ARM by chaining this with other optimizations.
Independently of whether this optimization might be a correct backend-specific optimization for certain backends, the current optimization is a general LLVM-IR -> LLVM-IR transformation for which the resulting LLVM-IR does not preserve the semantics of the input LLVM-IR.
I think it makes sense to explore this optimization for the backends for which it is sound, but doing that should not block reverting the current patch to prevent landing a very subtle misoptimization on LLVM 16.
Finally got time to read through this. I agree the optimization doesn't seem correct for all backends.
I've put a patch up for review reverting it at https://reviews.llvm.org/D141277.
Fix landed in 6b85fa6d81b9f50d7a1c347dfd467c9ab01c4a5d
Fixing a similar issue with unused xchg
-> atomic store
at https://reviews.llvm.org/D142097
Fixing a similar issue with unused xchg -> atomic store at https://reviews.llvm.org/D142097
I opened a separate issue for that one, just to have it documented: https://github.com/llvm/llvm-project/issues/60418
Does someone here have an analysis for why the rationale in the HPL paper is correct? Or a link to that report that's not dead? (I could not find it)
LLVM has a portable hook to enable backends to deliver this optimization, see https://github.com/llvm/llvm-project/issues/101551 , so I don't think this issue has been "fully" solved.
Here's an archived version of that report: https://web.archive.org/web/20230318132751/http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf
However, the optimization is unambiguously incorrect on LLVM IR itself, as that uses the C++ memory model (with some adjustments, but nothing relevant here); the HPL paper should not be needed for that. There is an example upthread.
If Machine IR has a different memory model, the optimization could be performed there.
(for the people from the future) A colleague found it here: https://dl.acm.org/doi/10.1145/2247684.2247688 Not sure why google did not :/
If Machine IR has a different memory model, the optimization could be performed there.
As long as the machine code generated preserves the LLVM IR semantics when run on real hardware. Otherwise, outcomes forbidden by LLVM IR would be produced on real hardware.
As long as the machine code generated preserves the LLVM IR semantics when run on real hardware. Otherwise, outcomes forbidden by LLVM IR would be produced on real hardware.
That's covered by each individual step being semantically correct (and Machine-IR having a coherent semantics in its own right). Specifically, we need:
Credits: @taiki-e first spotted this optimisation and found the offending patch https://github.com/crossbeam-rs/crossbeam/issues/859#issuecomment-1178539492 and @RalfJung pointed out that this may be a miscompilation https://github.com/crossbeam-rs/crossbeam/issues/859#issuecomment-1179355070
[InstCombine] Optimize
atomicrmw <op>, 0
intoload atomic
when possible assumed that an identity Monotonic (aka Relaxed) or Acquire RMW operation ("read-dont-modify-write") is the same as an atomic load. This is not always the case as RMW must see the latest value. Atomic loads do not provide this guarantee.https://timsong-cpp.github.io/cppwp/n4868/atomics#order-10
For a counterexample, please see https://github.com/llvm/llvm-project/issues/56450#issuecomment-1183695905
Likely invalid test cases just for archiving
These equivalent Rust and C++ programs panic on Apple Silicon M1 Pro with any optimisation enabled. ```rust use std::sync::atomic::AtomicU64; use std::sync::atomic::Ordering::Relaxed; use std::sync::atomic::Ordering::SeqCst; use std::thread; #[no_mangle] pub fn writer(gte: &AtomicU64, lte: &AtomicU64) { for i in 0.. { // Since there is only one writer and the ordering is SeqCst, // at any point of the program execution, the latest value of lte is either equal or one less than the latest value of gte gte.store(i, SeqCst); lte.store(i, SeqCst); } } #[no_mangle] pub fn reader(gte: &AtomicU64, lte: &AtomicU64) { /* Since the stores to gte and lte are SeqCst, they are both part of a Single Total Order, looking like this: gte lte gte lte gte (yet to run) 1 1 2 2 3 ... */ loop { /* gte lte gte lte gte (yet to run) 1 1 2 2 3 ... ↑ rmw on lte has to see this, not any earlier value */ let smaller = lte.fetch_add(0, Relaxed); // The writer could increment lte and gte by any number of times // between the two fetch_add(0). This does not matter since smaller is still <= larger let larger = gte.fetch_add(0, Relaxed); assert!(larger >= smaller, "the latest gte value {larger} is less than the latest or earlier lte value {smaller}, this is not possible at any point in the program!"); } } pub fn main() { static GTE: AtomicU64 = AtomicU64::new(0); static LTE: AtomicU64 = AtomicU64::new(0); let t_writer = thread::spawn(|| { writer(>E, <E); }); let t_reader = thread::spawn(|| { reader(>E, <E); }); t_writer.join().unwrap(); t_reader.join().unwrap(); } ``` ```console andy@Andys-MacBook-Pro src % rustc --edition 2021 -C opt-level=1 main.rs && ./main thread 'The patch needs to be reverted. There are some optimisations (https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf, Section 7) that can be done on read-dont-modify-write, though it is not trivial.