Weak memory emulation does not respect C++20 SCfix

RalfJung commented 2 years ago

Our weak memory emulation does not respect the C++20 SCfix, so we could sometimes produce values that are not actually possible on the real machine. This can lead to false UB reports when SeqCst fences are used.

Unfortunately, no literature exists at the time of writing which proposes an implementable and C++20-compatible relaxed memory model that supports all atomic operation existing in Rust.

SabrinaJewson commented 2 years ago

I believe that false positives can be produced even when fences are not used. In particular, consider this code:

static X: AtomicBool = AtomicBool::new(false);
static Y: AtomicBool = AtomicBool::new(false);

let a = thread::spawn(|| X.store(true, Relaxed));
let b = thread::spawn(|| Y.store(true, Relaxed));
let c = thread::spawn(|| { while !X.load(SeqCst) {} Y.load(SeqCst) });
let d = thread::spawn(|| { while !Y.load(SeqCst) {} X.load(SeqCst) });

a.join().unwrap();
b.join().unwrap();
let c = c.join().unwrap();
let d = d.join().unwrap();

assert!(c || d);

playground, I had to loop it to get Miri to fail

Currently, Miri can panic on it, but by the C++20 rules I don’t think it is allowed to. The execution of the code can be represented with this diagram:

     a        static X         c               d        static Y         b
╭─────────╮   ┌───────┐   ╭─────────╮     ╭─────────╮   ┌───────┐   ╭─────────╮
│ store X ├─┐ │ false │ ┏━→ load X  │     │ load Y  ←━┓ │ false │ ┌─┤ store Y │
╰─────────╯ │ └───────┘ ┃ ╰────╥────╯     ╰────╥────╯ ┃ └───────┘ │ ╰─────────╯
            └─┬───────┐ ┃ ╭────⇓────╮     ╭────⇓────╮ ┃ ┌───────┬─┘
              │ true  ┝━┛ │ load Y  ┝━? ?━┥ load X  │ ┗━┥ true  │
              └───────┘   ╰─────────╯     ╰─────────╯   └───────┘

Using [thread name]-[operation number] notation to refer to each operation, the logic is as follows:

Either c-0 precedes d-1 in S, or d-0 precedes c-1 in S, or both: I will call the preceding operation A, the succeeding operation B, and the relevant memory location (either X or Y) M.
If B were to load false from M, it would be coherence-ordered before A.
- This is because “there exists an atomic modification X of M such that [B] is coherence-ordered before X and X is coherence-ordered before [A]”, where X is the Relaxed store of true to M.
  - B is coherence-ordered before X because “[B] and [X] are not the same atomic read-modify-write operation, and there exists an atomic modification [Y] of M such that [B] reads the value stored by [Y] and [Y] precedes [X] in the modification order of M”, where Y is just the initial value of M
  - X is coherence-ordered before A because “[X] is a modification, and [A] reads the value stored by [X]” (i.e. true)
However, if B would be coherence-ordered before A then it would have to precede A in S.
- This is because “for every pair of atomic operations [B] and [A] on an object M, where [B] is coherence-ordered before [A] […] if [B] and [A] are both [SeqCst] operations, then [B] precedes [A] in S”.
But since we already know A precedes B in S, such an execution where B would load false from M is not possible.
Therefore, at least one of c-1 or d-1 must load true from M.

RalfJung commented 2 years ago

Cc @cbeuw would be great if you could take a look at the above. :)

RalfJung commented 2 years ago

I will call the preceding operation A, the succeeding operation B

Here do you mean "succeeding" as in "success" or as in "second"?

SabrinaJewson commented 2 years ago

As in “second”.

RalfJung commented 2 years ago

Okay, makes sense. And thanks for the detailed description! I don't know the C++ memory model well enough to check if this is all according to spec, but the explanation made sense in itself. :)

cbeuw commented 2 years ago

@SabrinaJewson I think the logic deduction is a bit off. You started with the assumption of B loading false, reached a contradiction, thus proving that B cannot load false. But surely you have proved that B cannot load false for all possible values of B, and not "all possible values of B cannot simultaneously load false"?. Suppose A is c-0 and B is d-1, then you have shown that d-1 cannot load false regardless of what c-1 loads (since your proof does not even name it). So what you actually proved is that neither c-1 or d-1 can load false. Both must be true.

I actually do agree with your conclusion that at least one of them must be true though:

Assume d-1:X.load reads false (without assuming the S edge between c and d)
Since d-1:X.load reads false, it is coherence-ordered before a-0:X.store(true) (d-1:X.load "reads-before" a-0:X.store(true))
Since c-0:X.load reads true, a-0:X.store(true) is coherence-ordered before it.
So d-1:X.load is coherence-ordered before c-0:X.load
So d-1:X.load precedes c-0:X.load in S, since they are coherence-ordered before and both SC
So we have S as d-0:Y.load, d-1:X.load, c-0:X.load, c-1:Y.load
Now further assume c-1:Y.load reads false
Through the same deduction you can get c-1:Y.load precedes d-0:Y.load in S, which is a contradiction
Therefore c-1:Y.load must read true

So we get d-1:X.load reading false implies c-1:Y.load reading true, and the reverse obviously stands too, so d-1:X.load and c-1:Y.load cannot both be false.

cppmem says all 4 combinations of (c-1, d-1) can occur under C++11, so this appears to be due to SCFix indeed.

int main() {
  atomic_int x=0; atomic_int y=0;
  x.store(0, memory_order_relaxed); y.store(0, memory_order_relaxed);
  {{{ { x.store(1,memory_order_relaxed); }
  ||| { y.store(1,memory_order_relaxed); }
  ||| { x.load(memory_order_seq_cst).readsvalue(1); y.load(memory_order_seq_cst); }   
  ||| { y.load(memory_order_seq_cst).readsvalue(1); x.load(memory_order_seq_cst); } }}}
  return 0;
}

The crux of the issue here is that if you have two SC loads and a non-SC store, the first SC load doesn't see the store and the second one sees it, then under C++20 the first load must precede the second load in S, whereas previously this is not necessarily the case (unless covered by other coherence rules, such as if the two loads happen-before each other)

cbeuw commented 2 years ago

This is not hard to fix I think. Basically, if there's an SC load that reads from any store, then another SC load later in S cannot read before that store. This is because read from gives you a cob, read before gives you a cob, cob + cob is a cob, and a cob between SCs gives you an extra S edge, so we can't create the read before edge on the second load, otherwise we create a backwards S edge. When going through the store buffer, we need to check if the current store element has ever been SC loaded. If so, then we need to stop the search.

This is very similar to the basic Coherent-RR actually, which states that if two loads on the same location happen before each other, then the later load cannot read a value earlier in MO than what the first load saw. Except in C++20, it's both hb and S.

The remaining question is whether this is consistent with the standard. I don't have the ability to prove anything like the author did, though the modification on the race detector side for the release sequence changes seem to be in the same boat. In any case, since I'm restricting the "outdatedness" of values a load can see, the new possible behaviours is a strict subset of the current implementation, so it can't be more unsound than now.

RalfJung commented 2 years ago

Yeah, if this can definitely only rule out some behaviors Miri currently produces (and not add more behavior), and if it rules out some behavior that we agree is wrong, then we should do it even if it also rules out other behavior that would be allowed (we already have plenty of that).

RalfJung commented 2 years ago

All right, that particular issue should be fixed by https://github.com/rust-lang/miri/pull/2512. :) The wider problem of not respecting SC fences properly remains, though.

This is somewhat surprising since, to my knowledge, SC fences are actually much more well-behaved in C++20 than SC accesses. For example, the operational Promising Semantics model supports SC fences but not SC accesses. I wonder if we can copy the SC fence approach from the promising semantics, or are there too many other differences that make this impractical?

cbeuw commented 2 years ago

SC fences are better behaved. Adapting Promising Semantics' SC fences is possible and can actually simplify our code. The bit I'm not fully comfortable with is the interaction between SC fences and SC accesses, which are fairly intertwined.

I'm having another look at the C++20 SC definitions. I think the new wording is relatively friendly to translate to operational models since it's worded in terms of "if this happened, then there's an S edge", so all we need to focus on is "don't create a backwards S edge by reading the wrong store", since we always have S naturally as the operational steps. But because coherence-ordered before is transitive, it might be tricky to cover all cases.

RalfJung commented 2 years ago

Something we could do as mitigation for now is detect when the same location is used for both SC and non-SC accesses, and print a warning when that happens. It seems (based on what a colleague told me) that SCfix only affects the behavior of programs that mix SC and non-SC on the same location.

SabrinaJewson commented 2 years ago

C++20 SCFix comes in three parts:

The “Power, ARM, and GPU implementability” part, which weakens SeqCst by enabling two new possible executions that can be produced by mixing SC accesses with non-SC accesses
The strengthening of SC fences to rely on the “happens-before” semantic instead of the “sequenced-before” semantic to decide which operations are coherence-ordered-before SC operations after the fence (and coherence-ordered-after SC operations before the fence)
The strengthening of regular SC that I posted about earlier in this thread

So I believe that warning would only affect part (1). It still may be helpful to have, since users would be warning that Miri cannot test all branches of their code, but I don’t think it would help w.r.t. correctness.

RalfJung commented 2 years ago

Your earlier example mixes SC and non-SC on the same location, doesn't it?

I don't know this space enough to judge, I can just relay what other people tell me. :shrug: It might be that the statement (about it being sufficient to detect mixed accesses) was made wrt a different operational model -- e.g. if we implemented the promising semantics, instead of the dynamic data race model.

RalfJung commented 1 year ago

@SabrinaJewson do you have an example of a piece of code where Miri currently performs an execution that is incorrect under SCFix, and that does not mix SC and non-SC accesses on the same location?

SabrinaJewson commented 1 year ago

I am struggling to come up with such an example. I can easily make an example where I believe C++11 would allow the behaviour but it is forbidden by C++20, for example:

static CARRIER: AtomicBool = AtomicBool::new(false);
static CHAINER: AtomicBool = AtomicBool::new(false);
static TARGET: AtomicBool = AtomicBool::new(false);

fn main() {
    loop {
        CARRIER.store(false, atomic::Ordering::Relaxed);
        CHAINER.store(false, atomic::Ordering::Relaxed);
        TARGET.store(false, atomic::Ordering::Relaxed);

        let happens_before_thread = thread::spawn(|| {
            let target = TARGET.load(atomic::Ordering::Relaxed);
            CARRIER.store(true, atomic::Ordering::Release);
            target
        });

        let first_fence_thread = thread::spawn(|| {
            while !CARRIER.load(atomic::Ordering::Acquire) {}
            atomic::fence(atomic::Ordering::SeqCst);
            CHAINER.store(true, atomic::Ordering::Relaxed);
        });

        let second_fence_thread = thread::spawn(|| {
            while !CHAINER.load(atomic::Ordering::Relaxed) {}
            atomic::fence(atomic::Ordering::SeqCst);
            TARGET.store(true, atomic::Ordering::Relaxed);
        });

        let target = happens_before_thread.join().unwrap();
        first_fence_thread.join().unwrap();
        second_fence_thread.join().unwrap();
        assert!(!target);
    }
}

use std::sync::atomic;
use std::sync::atomic::AtomicBool;
use std::thread;

Here, C++11 would allow happens_before_thread to load true from TARGET because:

It does not happen-before any operation that interacts with TARGET at all, therefore it is unrestricted by coherence.
The SeqCst fence only applies restrictions to operations sequenced-before the fence, thus it has no effect on that load.

In contrast C++20 disallows this execution because any operation that happens-before the first fence (i.e. happens_before_thread’s load of TARGET) must not be coherence-ordered-after any operation that happens-after the second fence (i.e. second_fence_thread’s store of TARGET) — and the load of true would cause it to be coherence-ordered-after the store, which is disallowed.

However, this is not particularly helpful because even with the fences removed I could not get hardware, Miri, TSan or Loom to trigger on this code. It’s also true that depending on your interpretation of OOTA even in C++11 the execution without fences isn’t allowed in the first place. One can replace the communication used in CHAINER with a cas from 0 to 1 in first_fence_thread and a store of 2 in second_fence_thread, which weakens the argument that it is disallowed under OOTA (since it would be impossible with that structure to actually transfer data from first_fence_thread to second_fence_thread), but nevertheless none of the tools report a panic.

Do you have an example of Miri producing an invalid execution that does mix SeqCst and non-SeqCst accesses? Maybe if one existed I could try and adapt it for this. I don’t know much about how Miri implements the model so it’s hard for me to guess if there’s any form of the above program that Miri would actually error on.

RalfJung commented 1 year ago

I don’t know much about how Miri implements the model

Neither do I. :joy: It's basically an implementation of this paper, with slight adjustments described here and here.

Do you have an example of Miri producing an invalid execution that does mix SeqCst and non-SeqCst accesses?

I was under the impression that your example here qualifies?

RalfJung commented 1 year ago

I was under the impression that https://github.com/rust-lang/miri/issues/2301#issuecomment-1221502757 qualifies?

Ah, looks like https://github.com/rust-lang/miri/pull/2512 fixed that particular example.

Sadly I don't know enough about all this to come up with new examples. I just remember @cbeuw saying that this is not yet a proper fix. @cbeuw maybe you can help us with such an example?

RalfJung commented 1 year ago

For the record, here is a program where the assertion would be allowed to fail Under C++11 but no more with SCfix. However it does not fail in Miri, so this does not help to determine whether there still is a bug to fix here.

EDIT: The test had a bug oops. And the fixed test does fail in Miri!

static X: AtomicBool = AtomicBool::new(false);
static Y: AtomicBool = AtomicBool::new(false);

fn main() {
    for _ in 0..100 {
        X.store(false, atomic::Ordering::Relaxed);
        Y.store(false, atomic::Ordering::Relaxed);

        let thread1 = thread::spawn(|| {
            let a = X.load(atomic::Ordering::Relaxed);
            atomic::fence(Ordering::SeqCst);
            let b = Y.load(atomic::Ordering::Relaxed);
            (a, b)
        });

        let thread2 = thread::spawn(|| {
            X.store(true, Ordering::Relaxed);
        });
        let thread3 = thread::spawn(|| {
            Y.store(true, Ordering::Relaxed);
        });

        let thread4 = thread::spawn(|| {
            let c = Y.load(atomic::Ordering::Relaxed);
            atomic::fence(Ordering::SeqCst);
            let d = X.load(atomic::Ordering::Relaxed);
            (c, d)
        });

        let (a, b) = thread1.join().unwrap();
        thread2.join().unwrap();
        thread3.join().unwrap();
        let (c, d) = thread4.join().unwrap();
        let bad = a == true && b == false && c == true && d == false;
        assert!(!bad);
    }
}

use std::sync::atomic::{self, Ordering};
use std::sync::atomic::AtomicBool;
use std::thread;

cbeuw commented 1 year ago

Looks like https://github.com/rust-lang/miri/issues/2301#issuecomment-1403915704 does fail under Miri. However, I don't think the asserted execution is forbidden under SC Fix.

An execution is only forbidden if there is a contradiction between the observed values and S. However, in the forbidden execution, none of the Relaxed accesses participate in S. To make two Relaxed accesses participate in S, we can only go by the "if a memory_order::seq_cst fence X happens before A and B happens before a memory_order::seq_cst fence Y, then X precedes Y in S." rule, where A is coherence-ordered before B. All other rules involve at least one SeqCst access which does not exist in the program.

In other words, we need SC fence -hb-> A, B -hb-> SC fence, plus A -cob-> B in the execution. The only way to get the first two edges is by having A = Y.load false in t1 and B = Y.load true in t4 (or the symmetric equivalent A = X.load false in t4 and B = X.load true in t1), but in the asserted execution there is no t1: Y.load false -cob-> t4: Y.load true, we can see this from all the coherence-ordered before edges in the execution. The labels on the edges are the sources of coherence-ordered before relation:


            rf^-1                                                  rf^-1
 ┌──────────────────────────────── Initialise X, Y ─────────────────────────────────────┐
 │                                      │   │                                           │
 │                                  mo  │   │  mo                                       │
 │                                 ┌────┘   └──────┐                                    │
 │                     rf^-1       ▼               ▼          rf^-1                     │
 │    X.load // true ◄──────  X.store(true)    Y.store(true) ─────► Y.load // true      │
 │    SC fence                     │       rb       │               SC fence            │
 └──► Y.load // false ◄────────────┼────────────────┘               X.load // false ◄───┘
                                   │                    rb                ▲
                                   └──────────────────────────────────────┘

Therefore S contains no Relaxed accesses and the execution is actually allowed under SC Fix.

RalfJung commented 1 year ago

I'm out of my depth here.^^ I was just told that this is the canonical example for what SCfix is intended to rule out.

I'll try to get an expert into this thread. :)

orilahav commented 1 year ago

The rf edges and rb edges in this drawing are all reversed.

We do have "cob" from "Y.load false" to "Y.load true":

"Y.load false" reads from initialization which is modification order before "Y.store(true)" - so we have "cob" from "Y.load false" to "Y.store(true)"
"Y.load true" reads from "Y.store(true)" - so we have "cob" from "Y.store(true)" to "Y.load true"
by transitivity, we have "cob" from "Y.load false" to "Y.load true"

orilahav commented 1 year ago

one of the main points of our "SCfix" work was to make sure that an SC-fence between every two instructions would actually guarantee that there are no weak behaviors.

tomtomjhj commented 1 year ago

By the way, genmc agrees that the bad execution from https://github.com/rust-lang/miri/issues/2301#issuecomment-1403915704 is forbidden.

equivalent c code

Run with ``` genmc --print-exec-graphs -check-consistency-type=full -check-consistency-point=step the-code.c ``` ```c #include #include #include #include #include atomic_bool X = false; atomic_bool Y = false; bool a = false; bool b = false; bool c = false; bool d = false; void *thread_1(void *unused) { a = atomic_load_explicit(&X, memory_order_relaxed); atomic_thread_fence(memory_order_seq_cst); b = atomic_load_explicit(&Y, memory_order_relaxed); return NULL; } void *thread_2(void *unused) { atomic_store_explicit(&X, true, memory_order_relaxed); return NULL; } void *thread_3(void *unused) { atomic_store_explicit(&Y, true, memory_order_relaxed); return NULL; } void *thread_4(void *unused) { c = atomic_load_explicit(&Y, memory_order_relaxed); atomic_thread_fence(memory_order_seq_cst); d = atomic_load_explicit(&X, memory_order_relaxed); return NULL; } int main() { pthread_t t1, t2, t3, t4; if (pthread_create(&t1, NULL, thread_1, NULL)) abort(); if (pthread_create(&t2, NULL, thread_2, NULL)) abort(); if (pthread_create(&t3, NULL, thread_3, NULL)) abort(); if (pthread_create(&t4, NULL, thread_4, NULL)) abort(); if (pthread_join(t1, NULL)) abort(); if (pthread_join(t2, NULL)) abort(); if (pthread_join(t3, NULL)) abort(); if (pthread_join(t4, NULL)) abort(); bool bad = a == true && b == false && c == true && d == false; assert(!bad); return 0; } ```

cbeuw commented 1 year ago

Indeed I have drawn the rb edges the wrong way round 😅. The execution is forbidden and Miri is wrong here

RalfJung commented 1 year ago

Okay, nice, we have an example. :)

Now, the next questions are:

is it reasonable to assume that this is the "only kind of example" that Miri currently gets wrong, or do we have to look out for other patterns as well?
is there some reasonably simple way to fix Miri behavior on this example?

rust-lang / miri

Weak memory emulation does not respect C++20 SCfix #2301