Non-temporal stores (and _mm_stream operations in stdarch) break our memory model

rust-lang / rust

Empowering everyone to build reliable and efficient software.

https://www.rust-lang.org

Other

98.54k stars 12.74k forks source link

Non-temporal stores (and _mm_stream operations in stdarch) break our memory model #114582

Closed RalfJung closed 3 months ago

RalfJung commented 1 year ago

I recently discovered this funny little intrinsic with the great comment

Emits a !nontemporal store according to LLVM (see their docs). Probably will never become stable.

Unfortunately, the comment is wrong: this has become stable, through vendor intrinsics like _mm_stream_ps.

Why is that a problem? Well, turns out non-temporal stores completely break our memory model. The following assertion can fail under the current compilation scheme used by LLVM:

static mut DATA: usize = 0;
static INIT: AtomicBool = AtomicBool::new(false);

thread::spawn(|| {
  while INIT.load(Acquire) == false {}
  assert_eq!(DATA, 42); // can this ever fail? that would be bad
});

nontemporal_store(&mut DATA, 42);
INIT.store(true, Release);

The assertion can fail because the CPU may order MOVNT after later MOV (for different locations), so the nontemporal_store might occur after the release store. Sources for this claim:

Peter Cordes answer here: "A mutex unlock on x86 is sometimes a lock add, in which case that's a full fence for NT stores already. But if you can't rule out a mutex implementation using a simple mov store then you need at least sfence at some point after NT stores, before unlock."
glibc fixing their memcpy (which uses nontemporal stores) to have a trailing sfence.

This is a big problem -- we have a memory model that says you can use release/acquire operations to synchronize any (including non-atomic) memory accesses, and we have memory accesses which are not properly synchronized by release/acquire operations.

So what could be done?

Remove nontemporal_store and implement the _mm_stream intrinsics without it and mark them as deprecated to signal that they don't match the expected semantics of the underlying hardware operation. People should use inline assembly instead and then it is their responsibility to have an sfence at the end of their asm block to restore expected synchronization behavior.
Change the way release stores are compiled such that an sfence is emitted. This is mostly a theoretical option though: this is an ABI-breaking change, at least my understanding of the x86 ABI is that a regular mov is considered to be sufficient synchronization. To make this work reliably all compilers for all languages that have something like release writes need to emit the sfence, or else Rust code cannot be soundly linked with code produced by those other compilers.
There's a third hypothetical option of adjusting our concurrency memory model to be able to support these operations. But that's (a) a huge design space -- which fences are supposed to interact how with this? We might even have to add a new kind of fence! And (b) modifying the C++ concurrency memory model should be done with utmost care and thorough formal analysis; the current model is the result of a decade worth of research that shouldn't be thrown away lightly.
Just ignore the problem and hope it doesn't explode? I am not happy with this.

Thanks a lot to @workingjubilee and @the8472 for their help in figuring out the details of nontemporal stores.

Cc @rust-lang/lang @Amanieu

Also see the nomination comment here.

RalfJung commented 1 year ago

I should have also Cc @rust-lang/opsem

the8472 commented 1 year ago

I believe there are additional options:

C) have the backend lower the non-temporal store to either

a regular store
MOVNT + deferred sfence

whichever seems more beneficial. The sfence can be deferred up to the next atomic release (or stronger) and things that might contain one.

D) add a Safety requirement to the vendor intrinsic.

digama0 commented 1 year ago

It's not clear to me that this "breaks the model" as opposed to "extends the model". Why can't we have nontemporal stores as an explicit part of the atomics model, and say that they are not necessarily synchronized by a release-acquire pair to another thread?

RalfJung commented 1 year ago

have the backend lower the non-temporal store to either

Is "deferred sfence" a thing we can tell LLVM to do?

Also I think if we lower _mm_stream_ps in a way that includes an sfence, people will be very surprised. We should rather not expose _mm_stream_ps than expose something that's not the same operation.

add a Safety requirement to the vendor intrinsic.

No that doesn't work. We still need to explain how and why code has UB when it violates safety. Our current memory model has no way to make this code be UB.

This is an intrinsic, not a library function. It is specified by the operational semantics, not a bunch of axioms in the "safety" comment. The safety comment is merely reflecting the constraints that arise from the operational semantics. Making the safety comment the spec would mean making the spec axiomatic and that is something we certainly don't want to do.

Why can't we have nontemporal stores as an explicit part of the atomics model,

We can, but then we'll have to develop our own model. The C++ model does not have such stores. I don't think we should have our own concurrency model, that well exceeds our capacity. Concurrency models are very hard to get right and being able to inherit all the formal studies of the C++ model is extremely valuable.

And process-wise, we certainly can't have a library feature like stdarch just change the memory model. That requires an RFC. It seems fairly clear that the impact of _mm_stream_ps on the whole language was not realized at the time of stabilization, and I think our only option is to remove this operation again -- or rather, get as close to removing it was we can within our stability guarantees: make it harmless, and deprecate it.

the8472 commented 1 year ago

Is "deferred sfence" a thing we can tell LLVM to do?

Not that I know. But it may be something llvm should be doing when a store is annotated with !nontemporal. Unless their memory model already knows how to specify nontemporal + release behavior (which may just be UB) because it's broader than the C++ model.

Also I think if we lower _mm_stream_ps in a way that includes an sfence, people will be very surprised. We should rather not expose _mm_stream_ps than expose something that's not the same operation.

I assume long as the sfence is sunk far enough they might not care. Just like mixing AVX and SSE can result in VZEROUPPER being inserted by the backend.

This is an intrinsic, not a library function.

I see. That distinction isn't always obvious. But it makes sense if we want the compiler to optimize around the intrinsics more than around FFI calls.

We can, but then we'll have to develop our own model. The C++ model does not have such stores.

Can we say we use the C++ model with the modification that an axiom (all stores are ordered with release operations) is turned into a requirement (all stores must be ordered with release operations). That seems minimally invasive.

CAD97 commented 1 year ago

This is an intrinsic, not a library function.

nontemporal_store is our intrinsic, but _mm_stream_ps is a vendor intrinsic. If nontemporal_store is to be directly exposed it certainly needs to participate in the operational semantics. But vendor intrinsics are somewhat special.

Vendor intrinsics are sort of halfway between standard Rust and inline asm. The semantics of the vendor intrinsic is whatever asm the vendor says it is, and it's the responsibility of the user of the intrinsic to understand what that means (or doesn't) on the Rust AM and use the intrinsics in a way consistent on the AM. Ideally, writing _mm_stream_ps(ptr, a) and writing asm!("MOVNTPS {ptr}, {a}") (with the correct flags which I haven't bothered figuring out) should be functionally identical, except that the former is potentially better understood by the compiler and doesn't have the monkeypatchable semantics of inline asm.

Is this complicated by the extra function boundaries imposed by us exposing vendor intrinsics as extern "Rust" fn instead of extern "vendor-intrinsic"? Certainly[^1], since _mm_stream_ps(ptr, a); _mm_sfence() and asm!("MOVNTPS {ptr}, {a}", "SFENCE") are different, but in a way I think easier to resolve than either fixing compilers to better respect weak/nt memory ordering on x86[_64] or auditing every vendor intrinsic for individually having consistent semantics on the Rust AM.

[^1]: We should ideally consider if we can get away with "fixing" this such that vendor intrinsics can't be made into function pointers and aren't actually functions, to make their special status more evident. Though abusing the ABI marker for intrinsics is still a bit of a bodge, and this doesn't completely resolve the issues since asm!("MOVNTPS {ptr}, {a}"); asm!("SFENCE") still steps the AM in an inconsistent state between the two asm blocks.

Slightly reinterpreting: it's perfectly acceptable, even expected for Miri to error when encountering vendor intrinsics. Using them steps outside the AM in a similar manner to inline asm, and as such you're relying on target specifics for coherency and sacrificing the ability to use AM-level sanitizers like Miri except on a best-effort basis.

RalfJung commented 1 year ago

I see. That distinction isn't always obvious. But it makes sense if we want the compiler to optimize around the intrinsics more than around FFI calls.

FFI calls don't help, FFI calls are only allowed to perform operations that could also be performed by Rust code (insofar as Rust-controlled state is affected). Getting the program into a state where release/acquire synchronization does not work any more is not allowed in any way, not even via FFI.

Like, imagine the following code:

static mut DATA: usize = 0;
static INIT: AtomicBool = AtomicBool::new(false);

thread::spawn(|| {
  while INIT.load(Acquire) == false {}
  let data = DATA;
  if data != data { unreachable_unchecked(); }
});

some_function(&mut DATA, 42);
INIT.store(true, Release);

This code is obviously sound. The compiler is allowed to change this code such that the spawned thread becomes

  while INIT.load(Acquire) == false {}
  if DATA != DATA { unreachable_unchecked(); }

However, if some_function does a non-temporal store, this change can introduce UB! Now the non-temporal store might take effect between the two reads of DATA, and suddenly a value can seem unequal to itself.

Therefore, it is UB to leave an inline assembly block or FFI operation with any "pending non-temporal stores that haven't been guarded by a fence yet". The correctness of compilation relies on this assumption.

Can we say we use the C++ model with the modification that an axiom (all stores are ordered with release operations) is turned into a requirement (all stores must be ordered with release operations). That seems minimally invasive.

No, that's not how defining a memory model works. You'll need to define a new kind of memory access that's even weaker than non-atomic accesses and adjust the entire model to account for them.

The semantics of the vendor intrinsic is whatever asm the vendor says it is,

That doesn't work, they need to be phrased in terms of the Rust AM. Lucky enough they are mostly about what happens to certain values of SIMD type, so the vendor semantics directly translate to Rust AM semantics.

But when it comes to synchronization, this approach falls flat. If _mm_stream_ps was implemented via inline assembly it would be UB.

auditing every vendor intrinsic for individually having consistent semantics on the Rust AM.

There's no way around that, but I hope we don't have many intrinsics that have global synchronization effects.

Slightly reinterpreting: it's perfectly acceptable, even expected for Miri to error when encountering vendor intrinsics. Using them steps outside the AM in a similar manner to inline asm, and as such you're relying on target specifics for coherency and sacrificing the ability to use AM-level sanitizers like Miri except on a best-effort basis.

Even vendor intrinsics are subject to the rule that governs all FFI and inline assembly: you can only do things to Rust-visible state that you could also have done in Rust. _mm_stream_ps violates that rule.

talchas commented 1 year ago

I hold that while nontemporal_store as intended is not covered by the memory model, it is a) not a huge extension to add it to C/C++'s (though that's ugly to do for a single special case) b) there's a reasonable way you can model what users actually would use it for inside the existing model c) as other people said, no user cares about the purity of the rust AM when doing this, even if you do.

The potential extension to the current C++ intro.races (using C++ mainly because there's nicer html for it) is something like:

Nontemporal writes do not participate as A in any of the existing rules
A nontemporal write A happens before an evaluation B if
- A is sequenced before B (ie nontemporal writes do participate in this one existing rule)
- there is some nontemporal fence (sfence) X and evaluation Y such that A is sequenced before X, which is sequenced before Y, which simply happens before B (ie similar to the strongly happens-before rule)

Obviously this makes happens before even more nontransitive, which is ugly, and it's totally possible I have the details wrong the same way every C/C++ spec has done, but it's really not implausible to do.

Alternatively within the existing rules you can completely ignore hardware and say that they act like this (pretending nontemporal_store is monomorphic on u8):

// blah blah boilerplate
#[derive(Clone, Copy)]
struct SendMe<T>(T);
unsafe impl<T> Send for SendMe<T> {}
unsafe impl<T> Sync for SendMe<T> {}

#[thread_local]
static mut BUFFER: Option<SendMe<&'static Mutex<Vec<(*mut u8, u8)>>>> = None;

pub unsafe fn nontemporal_store(ptr: *mut u8, val: u8) {
    let buf = if let Some(b) = BUFFER {
        b
    } else {
        let b = SendMe(&*Box::leak(Box::new(Mutex::new(Vec::new()))));
        BUFFER = Some(b);
        b
    };
    std::thread::spawn(move || {
        std::thread::sleep_ms(rand::random());
        let buf = buf;
        do_sfence(&buf.0);
    });
    buf.0.lock().unwrap().push((ptr, val));
}

pub fn sfence() {
    unsafe {
        if let Some(b) = BUFFER {
            do_sfence(&b.0)
        }
    }
}

fn do_sfence(buffer: &'static Mutex<Vec<(*mut u8, u8)>>) {
    let mut buffer = buffer.lock().unwrap();
    for (ptr, val) in buffer.drain(..) {
        unsafe { *ptr = val; }
    }
}

ie you do vaguely what the cpu will actually do - buffer all nontemporal stores until (some indeterminate time later, or you do a fencing operation).

This does not permit nontemporal_store(ptr, val); *ptr, which should be permitted, but it does show that modelling nontemporal_store as an opaque FFI operation is sufficient for the compiler. And the actual users can consider nontemporal_store to be either this or just *ptr = val; depending on whether or not they are sure they have the correct fencing, which the actual compiler cannot ruin, because it's opaque. (And miri can either grow something complicated or not detect errors in this use, or just not support this operation at all; any of these are fine)

CAD97 commented 1 year ago

The semantics of the vendor intrinsic is whatever asm the vendor says it is,

That doesn't work, they need to be phrased in terms of the [language] AM. [abstraction ed.]

While this may be true for intrinsics to be coherently usable from [language], it's not quite true in practice; the Intel documentation is that _mm_stream_ps is the intrinsic equivalent for the MOVNTPS instruction. It's definition is "does this assembly mnemonic operation" and no more, though the compiler is in fairness expected to understand what that means and do what's necessary to make that coherent in the most efficient way possible.

I think my position can mostly be summed up as that if using the vendor intrinsic is (available and) immediate UB in GCC C, GCC C++, LLVM C, LLVM C++, and/or MSVC C++[^2] (i.e. due to breaking the atomic memory model), it's "fine" if using the intrinsic is immediate UB in rustc Rust. The behavior of the vendor intrinsic is the same as it is in C: an underspecified and incoherent extension to the memory model that might or might not work (i.e. UB).

[^2]: MSVC C only sort of exists, thus my omission of it. Also, I think Microsoft may only provide _mm_stream_ss and not _mm_stream_ps? The MSVC docs for the former link to the Intel docs for the latter (and _mm_sfence).

Though given that presumably the intrinsic should work as intended on Intel's compiler, and I think the Intel oneAPI C++ Compiler is an LLVM, it's presumably handled "good enough" in LLVM for Intel's purposes.

I'd be fine with marking vendor intrinsics as "morally questionable" (for lack of a better descriptor) and potentially even deprecating them if the backend can't handle them coherently[^1], but I wouldn't want us to neuter them to use different operations. It's assumed at a level equivalent to inline asm that if a developer is using vendor intrinsics that they know what they're doing. It's "just" (giant scare quotes) that knowing what you're doing with _mm_stream_ps and friends probably requires using an Intel-certified compiler with the whatever-Intel-says atomic model rather than the standard C++20 atomic model that most of computation relies on.

[^1]: Ideally marked in such a way that's backend-specific, such that use of broken vendor intrinsics can warn on the backends which don't handle it fully, but not on the vendor backend that does, if such a thing ever comes into existence. Though with MIR semantics not respecting NT operations, the vendor would also need to validate any middleend transforms as well as their backend.

My justification for not neutering the implementation is roughly that with mem::uninitialized, it's "our" intrinsic, and we failed to put sufficient requirements on how we told people to use it, so neutering it (i.e. initializing to fill with 0x00 or 0x01) to get it closer to being sound where we previously said it was is the correct move, but for _mm_stream_ps and other vendor intrinsics, they've always been documented as "does what the vendor says, figure it out," so it's not "our fault" if "what the vendor says" is incoherent like it is with mem::uninitialized.

Especially if this family of vendor intrinsics is UB in LLVM C++, this feels like a question that needs to bubble up to Intel's compiler team on how they think the intrinsic should be handled. Because this is a vendor intrinsic, it should behave however the vendor says it should, but we'd absolutely be justified to put a warning on it if it's fundamentally broken. But we shouldn't change how it behaves without input from the vendor.

I see maybe 2½+1 resolutions:

The compiler is made properly aware of NT semantics on x86[_64], and strengthens either the NT operation (s/MOVNT(.*)/MOV\1/) or the first following memory_order_release operation (SFENCE or LOCK) to ensure proper atomic memory order is preserved.
Intel says SFENCE should not be automatically inserted for NT intrinsics; we thus mark NT intrinsics as broken but leave the implementation unchanged.
Intel agrees NT intrinsics are incoherent and retracts them; we thus "remove" NT intrinsics (deprecate and neuter à la mem::uninitialized).
Highly unlikely: the C++20 atomic model is coherently extended with a memory_order_nontemporal, matching the semantics of NT operations, and Rust adopts it (unlike memory_order_consume which Rust doesn't expose[^3]).

but hard choosing one without vendor input seems improper.

[^3]: Unrelated tangential query: is the Rust AM memory model "C++20" or "C++20 without memory_order_consume"? IOW, is FFI using memory_order_consume in a way visible to the Rust AM defined or UB? I very much do not know the full story of memory_order_consume and honestly don't particularly care to know much more — knowing how OOTA is permitted to break causality is cursed enough for me — but if memory_order_consume means something in the atomic model which Rust inherits from C++ and uses (i.e. it isn't just aliased to acquire), it should probably be possible from Rust (even if #[doc(hidden)] #[deprecated]) to communicate that reality.

I find it interesting that NT would be "easier" for Rust if it weren't same-thread consistent, because then it could probably be modeled as an access from an anonymous thread (i.e. exempted from the sequenced-before relation), as modeled by talchas. But it is, so this observation is mostly irrelevant. I make no attempt to say whether the relaxation of the model is accurate, nor whether it breaks the existing proof body built on the current model[^4]. ~~Though I do think you're at least missing a buffer.clear() from your do_sfence (or should be iterating buffer.drain()).~~

[^4]: And this risk of breaking the rest of the model is the problematic risk of extending the model. Especially if Rust requires an adhoc nontemporal extension without the C++ model that everyone else is assuming adopting the same extension. You need to prove not just that your extension is sufficient for a nontemporal memory order but that its presence also doesn't impact the rest of the model, such that existing proofs that ignore nontemporal remain accurate so long as the relevant memory locations are not accessed nontemporally.

talchas commented 1 year ago

Oh yes, I originally wrote that as consuming the buffer and forgot to shift it to drain. (fixed)

CAD97 commented 1 year ago

Since I got somewhat nerd-sniped:

A spitball extension to the cppreference memory ordering description.

Ultimately I think this came out in similar effect to talchas's two rules, but I worked it out a bit further. I believe the most straightforward way to extend the C++ atomic model to support nontemporal writes would be to split off a concept of *weakly sequenced-before* for nontemporal writes in the causal chain, like how *happens-before* is split to handle consume causality.

The accuracy of my extension is predicated on my understanding of the happens-before relation qualifiers being accurate.

Namely, that *inter-thread happens-before* is *happens-before* which includes crossing a thread boundary, *simply happens-before* excludes consume operations (the *dependency-ordered-before* relation), and *strongly happens-before* only matters for establishing the global seq_cst ordering $S$. As such I'm not particularly interested in sussing out the exact difference of *strongly happens-before* and *simply happens-before*, since as $A$ *sequenced-before* $X$ ***simply** happens-before* $Y$ *sequenced-before* $B$ $\implies$ $A$ ***strongly** happens-before* $B$, the difference between the relations is IIUC very small — only when $A$ or $B$ is exactly the operation on one side of an acq/rel *synchronizes-with* in the causal chain, since *strongly happens-before* restricts a direct *synchronizes-with* contribution to those using seq_cst operations — and I simply choose not to write the kind of multiple atomic location using code where seq_cst could even potentially be needed. Though I am part of the heathen group which would prefer acq_rel be permitted for non-rmw operations, with the semantics of choosing acq/rel as appropriate based on if it's a load/store. Maybe that lessens the validity of my opinions somewhat, but I legitimately believe that a significant part of the reason that "just use seq_cst" is a thing is not because seq_cst is stronger than acq_rel (I find that the added guarantees actually make the behavior *harder* to reason about, and it doesn't even make an interleaving based understanding of many thread interactions any less wrong) but rather more because you *can* just use seq_cst everywhere, instead of needing to choose between acq, rel, or acq_rel based on if the operation is a load, store, or rmw. It *would* make confusion around (the non-existence of) "release loads" and "acquire stores" worse, but I personally think the benefit of being able to use acq_rel everywhere when consistently using acq_rel syncronization outweighs that, since in that paradigm the use of an aquire or release rmw operation being part-relaxed is more immediately evident. -----

We define $A$ *weakly sequenced-before* $B$ as defined by the language evaluation order. $A$ *sequenced-before* $B$ now requires both $A$ *weakly sequenced-before* $B$ **and** that $A$ is not a nontemporal write. Downstream *sequenced-before* requirements are adjusted as: - *carries a dependency into* still uses *sequenced-before*, excluding nontemporal writes. (consume ordering is rather cursed; notes after.) - *inter-thread happens-before* still uses *sequenced-before*, excluding nontemporal writes. - *happens-before* is relaxed to require *weakly sequenced-before*. NB: *happens-before* is not transitive, and relies on the separate transitivity of its two sufficient relations, *weakly sequenced-before* and *inter-thread happens-before*. - *simply happens-before* and *strongly happens-before* still uses *sequenced-before*, excluding nontemporal writes. (This is the dangerous part. The notes about consume ordering w.r.t. *simply happens-before* and *strongly happens-before* w.r.t. consume should now say consume and nontemporal. That nontemporal ends up handled similarly to how consume is (i.e. the split between transitive *inter-thread happens-before* and non-transitive *happens-before*) I think makes the problems fairly clear, given the known problems with consume ordering. But if consume is formally coherent in the current definition, even if no compiler *actually* treats it as weaker than acq_rel, then nontemporal can be coherent with this (or a similar) formalization, even if no compiler will *actually* treat it as weaker than a standard write.) Why does *carries a dependency into* exclude nontemporal writes? It only matters with release-consume synchronization, but if a chain of *weakly sequenced-before* $\implies$ *dependency ordered-before* $\implies$ *inter-thread happens-before* $\implies$ *happens-before* can stand, we have a way to require nontemporal writes to be visible cross-thread without guaranteeing an sfence (bad) and that release-consume synchronization is stronger than release-acquire synchronization (even worse). So nontemporal writes simply don't participate in dependency ordering, which in retrospect clearly sounds like the correct result. This doesn't cover fences, but neither does the cppreference memory ordering description. If my understanding of fences is accurate — that an atomic fence (enters $S$ if a seq_cst fence and) upgrades any atomic loads (any ordering) *sequenced-before* it to acquire (if an acquire fence) and any atomic stores (any ordering) that it is *sequenced-before* to release (if a release fence), except that the synchronization occurs with respect to the sequence point of the fence instead of of the fenced operation(s) — then a sequence fence can be handled with the addition of one more rule: that $A$ *weakly sequenced-before* sequential fence $F_B$ $\implies$ $A$ *sequenced-before* $F_B$ ($\therefore$ $A$ now participates in *inter-thread happens-before*).

Phrasing other fences in these terms:

Given atomic store (any order) $X$ and atomic load (any order) $Y$, I will use " $X$ *would synchronize-with* $Y$" as shorthand for " $Y$ reads the value written by $X$ (or by the *release sequence headed by* $X$ were $X$ a release operation)." Equivalently, $X$ *would synchronize-with* $Y$ if-and-only-if were $X$ a release store and $Y$ an acquire load then $X$ would *synchronize-with* $Y$. (Though *fence synchronizes-with* might be a better name, or perhaps *weakly synchronizes-with* in analogy with how we handled sequential fences... or maybe rename *weakly sequenced-before* to *fence sequenced-before*?) - release fence $F_A$ *sequenced-before* atomic store (any order) $X$ **and** $X$ *would synchronize-with* acquire load $B$ $\implies$ $F_A$ *synchronizes-with* $B$ - $\therefore$ $F_A$ *happens-before* $B$. - release store $A$ *would synchronize-with* atomic load (any order) $Y$ **and** $Y$ *sequenced-before* acquire fence $F_B$ $\implies$ $A$ *synchronizes-with* $F_B$ - $\therefore$ $A$ *happens-before* $F_B$. - release fence $F_A$ *sequenced-before* atomic store (any order) $X$ **and** atomic load $Y$ *sequenced-before* acquire fence $F_B$ **and** $X$ *would synchronize-with* $Y$ $\implies$ $F_A$ *synchronizes-with* $F_B$ - $\therefore$ $F_A$ *happens-before* $F_B$. (In the above, **and** binds more tightly than $\implies$, thus [ $I$ **and** $J$ $\implies$ $K$] parses as [($I$ **and** $J$) together imply $K$].) -----

NB: since atomic operations (including fences) are not nontemporal writes, the *weakly sequenced-before* and *sequenced-before* relations between them are definitionally identical. ----- A primary benefit of adding nontemporal operations in this manner is that it should be trivially true that if no nontemporal operations are involved, nothing changes. This means that the extension cannot make the situation any worse for people relying on the memory order guarantees than it currently is; at worst nontemporal operations go from immediate UB to broken in a different manner. It could still of course muck things up for people in charge of maintaining the memory order (compilers), but I believe talchas's surface language `fn nontemporal_store` shows that this doesn't restrict reasoning around fully unknown (potentially thread spawning) functions. The only terms which are impacted are *happens-before*, *visible* (side effect), *visible sequence of side-effects*, and the newly added *weakly sequenced-before*. The chance of me ever actually proposing this extension anywhere is effectively 0. But I do, to the full extent legally possible, waive my copyright for the contents of this post/comment, releasing it into the public domain. If you want to do something with this wording, feel free. -----

However, even if the above would potentially work as a definition, what it definitely does not show is whether nontemporal stores are currently nonsense (given compilers' models) and "miscompiled" (given this model) by existing compiler transforms. I think it's probably fine — a code transformation relying on weakly sequenced-before (nontemporal) being sequenced-before would necessarily require inter-thread reasoning, and I don't think any existing compiler transforms do such, since "no sane compiler would optimize atomics" — but that's an extremely strong assertion to make.

Separately, and actually somewhat relevant — Rust might still not be able to simply use the C++20 memory model completely as-is if we permit mixed-size atomic operations. Though of course, even if the formal C++ description leaves such as a UB data race, the extension to make them at least not race is fairly trivial, and IIRC I don't think anyone expects them to actually synchronize.

Though I do think people would expect mixed-size atomic operations to be coherent w.r.t. modification order, and those rules talk about some "atomic object $M$" and not a "memory location," and I'm no longer sure the necessary modifications are in any way simple.

programmerjake commented 1 year ago

because LLVM currently compiles all but sequentially-consistent fences to no-ops on x86, I think it currently miscompiles non-temporal stores if they're anything remotely like normal stores, because you can have acquire/release fences as much as you please and LLVM will happily compile them to nothing. imho the extra guarantees provided by sequential consistency shouldn't be required to make non-temporal stores behave.

https://rust.godbolt.org/z/1dhEKM5oa

programmerjake commented 1 year ago

there's also non-temporal loads which seem to also be just as much fun! https://www.felixcloutier.com/x86/movntdqa

workingjubilee commented 1 year ago

Non-temporal loads, in practice, are effectively normal loads, because their optimization is so conditional it can rarely trigger according to the letter of the ISA (it requires close reading of the fine print to tease this out). Because it is such a theoretical optimization, my understanding is it is largely unimplemented, and the one vestige is that the prefetchnta instruction is sometimes supported.

talchas commented 1 year ago

Yes, you need the instructions that happen to be generated by sequential consistency fences, or explicit calls to arch intrinsics/asm like sfence. If you wanted nontemporal stores to behave like normal stores in the language model, you wouldn't need a fence at all; requiring a release fence and pessimizing its codegen for other cases would be a weird worst of all worlds imo. (Requiring a seqcst fence that is in practice always going to generate an acceptable instruction or requiring an explicit sfence both seem fine to me)

And yeah, for loads note the "if the memory source is WC (write combining) memory type" in the instruction description - WC is not the normal memory type, it's the weak memory type and if your rust code built for x86_64-any-target-at-all has access to any memory of that type it's already broken wrt synchronization. (NT stores just treat any memory like it's WC)

programmerjake commented 1 year ago

And yeah, for loads note the "if the memory source is WC (write combining) memory type" in the instruction description - WC is not the normal memory type, it's the weak memory type and if your rust code built for x86_64-any-target-at-all has access to any memory of that type it's already broken wrt synchronization. (NT stores just treat any memory like it's WC)

well, we still need to be able to have Rust properly handle it because WC memory is often returned by 3D graphics APIs (e.g. Vulkan) for memory-mapped video memory.

talchas commented 1 year ago

Well uh that'll be fun if you wanted to program to spec, because any store there is basically a nontemporal store and isn't flagged as such in any way to the compiler.

It'll work of course so long as you either sfence manually in the right places or never try to have another thread do the commit (which presumably will do the right sync, but needs to happen on the cpu that did the WC write), since any memory given to you like that will be known by the compiler to be exposed, and asm sfence/etc would be marked as touching all exposed memory. (Don't make an &mut of it though probably? Who knows what llvm thinks the rules around noalias + asm-memory are)

RalfJung commented 1 year ago

I hold that while nontemporal_store as intended is not covered by the memory model, it is a) not a huge extension to add it to C/C++'s

It requires an entire new access class, so it is a bigger change than anything that happened since 2011. It also requires splitting "synchronizes-with" into "synchronization through seqcst fences" (which does synchronize nontemporal accesses) and everything else (which does not synchronize nontemporal accesses). So this is a huge change to the model, and all consistency theorems (like DRF), compiler transformations, lowering schemes etc will have to be carefully reconsidered. We currently have proofs showing that the x86 instructions compilers use to implement memory operations give the right semantics. It would be a mistake to lower our standards below this, mistakes have been made in the past.

It is just false to claim that this is a simple extension.

Any attempt to do this needs to start with something like https://plv.mpi-sws.org/scfix/paper.pdf, the formulation of the model in the standard is just too vague to be reliable. (According to the authors of that paper, the standard doesn't even reflect the SC fix properly even though that was the intention of the standard authors.)

The C++ memory model needed many rounds of iteration with formal memory model researchers to get into its current state. It still has one major issue (out-of-thin-air), but the original version (before Mark Betty's thesis) was a lot worse and even after that some things still had to be fixed (like SCfix). Anyone who claims they can just quickly modify this model and be sure not to reintroduce any of these problems is vastly underestimating the complexity of weak memory models.

@CAD97

It's definition is "does this assembly mnemonic operation" and no more, though the compiler is in fairness expected to understand what that means and do what's necessary to make that coherent in the most efficient way possible.

Again that just doesn't work. We have a very clear rule for inline assembly and FFI: they can only do things that Rust code could also have done, insofar as Rust-visible state is affected. It is completely and utterly meaningless to take an operation from one semantics and just take it into another, completely different semantics.

RalfJung commented 1 year ago

because LLVM currently compiles all but sequentially-consistent fences to no-ops on x86, I think it currently miscompiles non-temporal stores if they're anything remotely like normal stores, because you can have acquire/release fences as much as you please and LLVM will happily compile them to nothing. imho the extra guarantees provided by sequential consistency shouldn't be required to make non-temporal stores behave.

Well they don't say much about how nontemporal stores are supposed to behave, so it's unclear if they are miscompiling or just implementing a surprising (and unwritten) spec. I opened https://github.com/llvm/llvm-project/issues/64521 to find out.

CAD97 commented 1 year ago

We have a very clear rule for inline assembly and FFI: they can only do things that Rust code could also have done, insofar as Rust-visible state is affected. It is completely and utterly meaningless to take an operation from one semantics and just take it into another, completely different semantics.

To be clear, the point I'm making is actually that if the definition as "does this assembly sequence" has busted and unusable semantics on the AM, then the vendor intrinsic is busted and unusable, which is IIUC in agreement with you.

The main point where I disagree is that, given Intel has defined a busted and unusable intrinsic, the correct behavior is for us to expose the intrinsic as-is, busted and unusable though it may be, potentially with an editorial note saying as much.

basically saying the same thing again twice over but I didn't want to delete it

We don't specify the behavior of vendor intrinsics, the vendor does, and if the vendor specified an unusable intrinsic, that's *on them* to fix, not us. Changing the implementation of the intrinsic to be busted but still usable would be exposing an incorrect intrinsic. I *fully agree* that our current semantics don't support the intrinsic, though I do wish it was possible to temporarily suspend the AM's knowledge of a memory location in regions that a spurious or reordered load can't reach, in order to make it, well, not supported, but marginally less unsupported. But, and despite understanding the implications of UB, I do believe "it's always UB (but might accidentally happen to do what you want sometimes)" (the current situation) is preferable to not doing what the intrinsic says on the tin because what it says is "causes UB" (but obfuscated). (Though the intrinsic *might* actually be soundly usable on a pointer which is never deferenced, retagged, nor otherwise accessed from Rust code, such that the Rust AM is never aware of the AM-invalid allocated object even existing. But such isn't anywhere near practical so is mostly irrelevant even if potentially sound.) ------

on a memory model extension

> [NT store] also requires splitting "synchronizes-with" into "synchronization through seqcst fences" (which does synchronize nontemporal accesses) and everything else (which does not synchronize nontemporal accesses). While this is true of the Intel semantics, I would personally expect the memory model to *not* include sfence effects into a seqcst fence (my spitball formalization deliberately doesn't) and require users to either write both or to rely on the target architecture having a strong memory ordering plus "no sane compiler would optimize atomics." However, further discussion on extending the model to support NT stores (as opposed to documenting/mitigating for what we have currently) should probably stick to Zulip. Any such movement is at best years out. I only included this point here since this post has other (maybe) useful content. ------

RalfJung commented 1 year ago

(I have edited the OP to explicitly state that adjusting the memory model is an option in principle, but not one I consider realistic.)

The main point where I disagree is that, given Intel has defined a busted and unusable intrinsic, the correct behavior is for us to expose the intrinsic as-is, busted and unusable though it may be, potentially with an editorial note saying as much.

I don't see why we would expose a busted intrinsic. We already have inline assembly, and these days it is stable (it was not when stdarch got stabilized), so I think we are totally free to tell Intel that we won't expose operations that don't compose with the rest of our language -- for reasons that are entirely on them, since they told the world that their hardware is TSO(-like) and then also added operations which subvert TSO.

If these intrinsics weren't stable yet, is anyone seriously suggesting we would have stabilized them, knowing what we do now? I would be shocked if that were the case. The intrinsic even says in its doc comment that this will likely never be stabilized! Sadly whoever implemented _mm_stream_ps didn't heed that comment. So to me this is a clear oversight and the question is how we best do damage control.

RalfJung commented 1 year ago

I wonder if there is some way that we can argue that

_mm_stream_ps();
_mm_sfence();

is equivalent to a single inline assembly block with both of these operations. The CPU is in a strange state between those two inline assembly blocks, but can we formulate some conditions under which that strange state cannot have negative side-effects? IOW, can we weaken the rule that inline asm can only do things which Rust could have done, to something where one inline asm block gets the machine into a state that does not fully match what Rust could have done (but the difference only affects a few operations), and then a 2nd inline asm block fixes up the state?

Basically what's important is that there is no release operation between the two intrinsics. If the programmer ensures this is the case, then -- can we ensure the compiler doesn't introduce such operations? At the very least this means ensuring the inline asm blocks don't get reordered with release operations, but is that sufficient? I am a bit worried this might be too fragile, but OTOH a principle of "it's fine if machine state the Rust AM doesn't currently need is inconsistent" does seem useful. It's just hard for me to predict the impact of this on optimizations.

talchas commented 1 year ago

Yeah, I mean you can just say it doesn't explode unless execution reaches a release operation before it reaches an sfence (or other explicit NT sync op). The only way that you could really do that by accident is signals, or maybe it's plausible you'd call vec push aka malloc aka takes a lock in some paths. (Of course in actual codegen taking a lock is still fine since any RMW is an NT sync, so you'd need some faster path that does a release store, and the AM looking inside of the allocator is super broken in the first place)

The compiler reordering a release operation before the sfence would be a clear bug in its implementation of sfence, since a release operation must be a store, and sfence is the store fence. (And if you somehow come up with a way to get a seqcst load to wind up happens-before another thread's operation in a useful fashion, then seqcst loads probably shouldn't reorder with much of anything)

RalfJung commented 1 year ago

The compiler could order a release operation from before the _mm_stream_ps down, though, maybe? If this was true inline asm then of course not, but with the intrinsics more things can go wrong. That would still work since the stuff before the _mm_stream_ps is properly released, just the _mm_stream_ps itself is not.

So I guess this only really becomes a problem if the compiler somehow synthesizes release operations for its own kind of synchronization, like what auto-parallelization transformations might do.

workingjubilee commented 1 year ago

For what it's worth, the basic intended usage of movnti and movntdq, followed by sfence, is in fact basically as what was, at the time, equivalent to what is now basically encompassed by Enhanced REP MOVSB, which on CPUs with that feature means REP MOVSB also has a "please do not use this to write to semaphores, because it actually uses write-combining buffers in its implementation now" caveat. It has stronger ordering guarantees with other operations, but its own stores, within the range touched by the movsb loop, are more weakly ordered, almost like... a bunch of nontemporal writes and then sfence!

apiraino commented 1 year ago

A few questions while thinking to an actionable. Where and where has this been introduced? Was it part of one our usual LLVM upgrades and this stabilization slipped in without us realizing? Would a partial revert be possible/useful while we figure the long term solution (I think what @RalfJung suggests as first point in the opening comment: Remove nontemporal_store and implement the _mm_stream intrinsics without it and mark them as deprecated)?

RalfJung commented 1 year ago

The nontemporal-store intrinsic was introduced ~forever ago, in https://github.com/rust-lang/rust/commit/fe53a8106dfb54b5fe04d2ce7e8ee6472b0d5b16, with a comment saying this will likely never be stable.

The first mm_stream wrappers were added here unstably. They got marked as stable in https://github.com/rust-lang/stdarch/pull/414. Nothing there discussed the problem that these intrinsics have unexpected behavior in the presence of concurrency, or the fact that an intrinsic got stably exposed despite it having an explicit comment to the opposite.

talchas commented 1 year ago

A compiler reordering a release operation down isn't actually a problem for the rust spec that says nt_store(); release(); is immediate UB, that's only a problem for an IR spec that says that. (And before you go to even more absurd lengths by trying to make LLVM's not-really-existent-either spec a rust problem, come up with an actual optimization that would misbehave; I'm pretty sure that would require a cross-thread "optimization", and just no, stop)

digama0 commented 1 year ago

They got marked as stable in https://github.com/rust-lang/stdarch/pull/414. Nothing there discussed the problem that these intrinsics have unexpected behavior in the presence of concurrency, or the fact that an intrinsic got stably exposed despite it having an explicit comment to the opposite.

That's also a massive PR which stabilizes a ton of things at once and says it was done by a script, so it is not a surprise that a don't stabilize me bro on the wrapped intrinsic wouldn't get caught in review.

RalfJung commented 1 year ago

That's also a massive PR which stabilizes a ton of things at once and says it was done by a script, so it is not a surprise that a don't stabilize me bro on the wrapped intrinsic wouldn't get caught in review.

Yeah, and it seems there are more intrinsics that should have had closer t-lang attention before being stabilized -- see https://github.com/rust-lang/stdarch/pull/1454. This might need a proper audit...

RalfJung commented 1 year ago

@digama0 did some research into how the stream intrinsics are used in the wild. Seems like basically nobody remembers to put the sfence at the end...

workingjubilee commented 1 year ago

We certainly don't document it as required, so of course they don't.

RalfJung commented 1 year ago

Since I've seen my position misrepresented on Zulip, I want to clarify what my concern(s) are here.

First of all, there has been a process failure. Stably exposing operations from the standard library that extend the language needs explicit T-lang discussion, which didn't happen for _mm_stream*. (And we can't argue that this is just an inline asm block, since these operations violate the inline asm block rule that their overall effect on the machine state must be expressible in Rust. That's why they are language extensions to begin with.) This applies to everything the standard library exposes, "vendor intrinsics" don't have any special privilege to break Rust's general principles.

The reason this happened also seems fairly clear; this was a huge stabilization (FCP happened here) and a few odd ducks like this one or floating-point environment manipulation just slipped through.

I think we have consensus on that? I sure hope so, at least.

If we had followed process, stabilizing _mm_stream* would have been blocked on an RFC that defines Rust's own memory model, or at least T-lang explicit approval to do something ad-hoc and unprincipled like "after a nontemporal store, until the next sfence, any release operation (release or stronger write, release or stronger fence, thread spawn) is UB". I'm saying "unprincipled" since I am not aware of a principled argument that having the compiler apply optimizations when the machine is in a state where release operations are UB is correct.

Now what shall we do with this? I think rushing a spec extension that can cover these intrinsics would be ill-advised; concurrency models are too subtle to rush anything. Some people seem determined to go the route of defining a Rust concurrency memory model. I was probably too dismissive of these efforts, this is exciting! However, I think that RFC will take a long time to go through; experience shows that these models are very hard to specify correctly so we should get some weak memory researchers to take a look and prove some theorems before declaring it official. (The C++ committee worked with Mark Batty in a similar way before C++11 got finalized; the standard was in a pretty terrible state before that happened. C++21 SCfix similarly had academic analysis, and then they still managed to introduce ambiguities when translating the formal model into English.) So, IMO this is not a short-term solution; I'm both excited and terrified about Rust taking responsibility for its own memory model (and I'd love to help resolve ambiguities and work with weak memory researchers on the formalization) but this is not something that can happen quickly while we struggle to fix past mistakes.

What can we do short-term? Usually when a mistake was made the default answer is a revert. We cannot revert the stabilization but we can adjust these intrinsics to no longer be language extensions, by using regular stores. It's not a great solution since it defies the expectations associated with these intrinsics, but then the real-world data collected by @digama0 shows that people currently don't properly use these intrinsics (including the very person who added the nontemporal_store intrinsic to the compiler).

(The following part got updated to account for this reply.)

An alternative would be to turn these intrinsics into inline assembly blocks (i.e., make them fully opaque to the compiler), and then argue that we can come up with Rust code that safely approximates all the possible side-effects of using nontemporal stores, and adjust the documentation to require users to avoid all UB that would occur if this Rust code was actually used at run-time (which is likely more UB than the actual MOVNT instruction). Here is a proposal for such Rust code, here is another one. If we can convince ourselves that the actual MOVNT operation indeed has "no more strange behavior" than that Rust code, then this would be a reasonable solution.

Yet another alternative would be to only do documentation changes: e.g., we document that after these operations and until the next sfence, it is UB to perform a release operation. (Or a similar restriction.) There is a risk that this puts constraints on optimizations and analyses that we don't understand yet. I think we should be clear that we don't in general allow inline asm blocks (or FFI/intrinsics) to leave the machine in a "bad" state that is fixed up later by another inline asm block (or FFI/intrinsic), but that for this particular case this seems "good enough" and the desire to write such code without inline asm outweighs the desire for systematic correctness. If we do this then presumably it is because the previous alternative was somehow not acceptable and we need compiler insight into these intrinsics for optimizations -- that is concerning since this is correct only if the compiler is aware of the non-standard nature of the memory accesses that are performed by these intrinsics.

Nominating for T-lang discussion. @rust-lang/lang, my own view of the situation is described at the top of this issue and above in this comment. (If someone else wants to write a summary of their views, I'll happily add a link to it here.)

RalfJung commented 1 year ago

We certainly don't document it as required, so of course they don't.

Yeah we definitely need to at least update our docs. That said, how high are the chances that people will even read these docs, given that these operations are described as "vendor intrinsics" so people might have the reasonable expectation that the Intel manual tells them everything they need to know?

RalfJung commented 1 year ago

(I edited my summary a bit, if you follow by email please re-read on Github.)

talchas commented 1 year ago

(And we can't argue that this is just an inline asm block, since these operations violate the inline asm block rule that their overall effect on the machine state must be expressible in Rust. That's why they are language extensions to begin with.)

There is in fact a description that is in fact slightly more restrictive to users than what _mm_stream_ps "should" be (but would be a defensible description for a cross-arch nontemporal store) right here #issuecomment-1668386167 near the start of all this. The additional restriction being that subsequent accesses from the same thread would be racy until an sfence.

Similarly there's the obvious other possible result of an inline asm block which is more restrictive to the compiler: a normal store.

So the only part that has any specification difficulty is pinning down what precisely is allowed in that in-between region, and my impression from a glance through the search for the few existing users was that they don't use that anyways. (And saying "doing anything to take advantage of x86 permitting same-thread acceses is yolo pending LLVM giving a more precise definition of !nontemporal" is also on option)

So assuming LLVM doesn't come back reasonably quickly saying "oops we'll magically insert sfence the way we do vzeroupper" (or anything else useful), just changing the intrinsic to be inline asm seems fine.

RalfJung commented 1 year ago

Oh I see, that's where you were going with your proposed "thread-local buffer of pending stores". You are saying if these are inline assembly blocks (and not LLVM intrinsics like right now) we get to invent their Rust-level semantics as usual, and you are saying that the actual implementation with MOVNT is a correct refinement of those buffers. Sorry, I hadn't realized this is where you were going.

I'll try to poke holes into this from that angle, then.

So the documentation for the streaming operations would then say something like, after calling this operation and until this thread calls _mm_sfence, it is UB for any thread to read or write these memory locations? This would allow some code that my earlier proposal forbids (doing a release operation before an sfence, accepting that this will not release the streamed writes) but disallow some other code (doing same-thread accesses before the sfence).

RalfJung commented 1 year ago

Turns out there is a paper on nontemporal stores in x86. However it is on the hardware level, not on the level of a surface language like Rust or C++.

workingjubilee commented 1 year ago

I should note that a lot of vendor intrinsics are specified as "like the instruction, but..." with that "but" being a sly carve-out for more wiggle-room where the compilers can use it. Assuming Intel doesn't descend from on-high tomorrow with their full formally verified model for nontemporal stores and write-combining memory (I assume they have one, somewhere, under lock and key, that they are distinctly not sharing), condemning any slightly suspicious uses of these intrinsics seems preferable to churning the ones that adhere to something like a sensible usage, modulo the store-store fence.

Also, I have opened PRs to fix all the libraries that need _mm_sfence:

Yeah we definitely need to at least update our docs. That said, how high are the chances that people will even read these docs, given that these operations are described as "vendor intrinsics" so people might have the reasonable expectation that the Intel manual tells them everything they need to know?

If we need to lean on Intel to update their website and suchlike, then we can do that, too. We have clout enough to demand an audience, at least.

RalfJung commented 1 year ago

One thing about this "explicit buffer" model that strikes me as odd is that the background thread will only non-deterministically flush the entire buffer. That implies there is some ordering, later writes will never be flushed before earlier writes. I think I had expected something different, where

a single background thread is spawned on the first nontemporal_store in a thread
that background thread, at random intervals, just picks a random element from the buffer and writes it to memory

However, I can't find a way for anything to actually observe the order of stores here, so this might well be equivalent.

The other point that makes me feel uneasy is nontemporal_store acquiring a lock that will synchronize with the background thread. But again since we cannot know which writes have already been flushed I can't find a way to exploit that -- and obviously when a thread runs sfence then it is crucial that this synchronizes with the background thread that might have already flushed some of the writes.

I should note that a lot of vendor intrinsics are specified as "like the instruction, but..." with that "but" being a sly carve-out for more wiggle-room where the compilers can use it.

There's no problem with that when it's just about mutating some data by-value, or causing the effect of regular loads/stores. It's when these effects are not (or not trivially) expressible in regular Rust that we need to individually consider each case. These interactions are language-specific and Intel can't predict how their intrinsics behave with each and every language out there. (However they could really have predicted the fact that nontemporal stores are a massive footgun.)

Also, I have opened PRs to fix all the libraries that need _mm_sfence:

That's awesome. :)

talchas commented 1 year ago

Yes, I did them in order because it was simplest, and because doing two NT stores to the same location (with different values) and expecting the later one to win once you sfence() seemed plausible for graphics code or whatever and it's guaranteed by x86. Instead doing a "on insert replace any existing value" and "have each spawned thread only write a single random element" would be closer to the actual behavior, but way more of a pain to write, and as you note I don't think it actually is visible.

Mixing NT store + regular store is also guaranteed by x86 and an opaque asm block will provide enough constraint to the compiler to do the right thing, but specifying that as a rust implementation is not something I can even begin to see how to do.

RalfJung commented 1 year ago

doing two NT stores to the same location (with different values) and expecting the later one to win

Ah, that's a good point, so "randomly pick things from the buffer" is not equivalent. We should check that paper to figure out if their model guarantees same-thread ordering of MOVNT.

It does seem strange though that a regular write would be UB but an NT write would be allowed. I think ideally we make it so that every NT write can legally be replaced by a regular write. So I think I'd prefer a model that flushes the buffer out-of-order.

#[thread_local]
static mut BUFFER: Option<SendMe<&'static Mutex<Vec<(*mut u8, u8)>>>> = None;

pub unsafe fn nontemporal_store(ptr: *mut u8, val: u8) {
    let buf = if let Some(b) = BUFFER {
        b
    } else {
        let b = SendMe(&*Box::leak(Box::new(Mutex::new(Vec::new()))));
        BUFFER = Some(b);
        // spawn one flushing thread per "real" thread.
        std::thread::spawn(move || loop {
            std::thread::sleep_ms(rand::random());
            let buf = b.0.lock().unwrap();
            if !buf.is_empty() {
                let (ptr, val) = buf.remove(rand::random() % buf.len());
                *ptr = val;
            }
        })
        b
    };
    buf.0.lock().unwrap().push((ptr, val));
}

pub fn sfence() {
    unsafe {
        if let Some(b) = BUFFER {
            // Just wait until the background thread drained the buffer.
            while b.0.lock().unwrap().len() > 0 {}
        }
    }
}

talchas commented 1 year ago

Every difference compared to the actual intrinsic is downside, so no, I don't think that's a good idea (if std wanted to expose a nontemporal_store outside of stdarch, then maybe, but not for _mm_sfence). Weird corner cases that only exist because of trying to shoehorn this into a spec aren't really a problem. It doesn't even make the fake code representation of the intrinsic look better.

(Also a bit of the point is that between the two possible asm blocks the compiler can't actually screw up "asm!("movnti" ...), the spec is movnti", so if someone did want to write that they could, for all that you'd refuse because ~AM~)

RalfJung commented 1 year ago

The actual MOVNT does have the property that it is safe to replace it by a MOV. Though I agree that adding extra UB to satisfy a property like this is not a clear win.

But actually my proposal wouldn't even make it so that doing two MOVNT to the same location would be UB -- it would just be non-deterministic which write would win. So I'm not happy with that proposal either. I think this one has the desired effect:

#[thread_local]
static mut PENDING_WRITES = AtomicUsize::new(0);

pub unsafe fn nontemporal_store<T>(ptr: *mut T, val: T) {
    PENDING_WRITES.fetch_add(1, Relaxed);
    // Spawn a thread that will eventually do our write.
    let ptr = SendMe(ptr);
    let pending_writes = SendMe(addr_of!(PENDING_WRITES));
    std::thread::spawn(move || {
        let ptr = ptr; let pending_writes = pending_writes; // closure field capturing is annoying...
        std::thread::sleep_ms(rand::random()); // not really needed due to scheduler non-determinism
        *ptr.0 = val;
        (&*pending_writes.0).fetch_sub(1, Release);
    });
}

pub fn sfence() {
    unsafe {
        // Wait until there are no more pending writes.
        while PENDING_WRITES.load(Acquire) > 0 {}
    }
}

Here's a variant that actually builds and also uses Box::leak to avoid a use-after-free in the write-back thread.

This also has the nice advantage of making it much easier to support writing arbitrary types.

(Also a bit of the point is that between the two possible asm blocks the compiler can't actually screw up "asm!("movnti" ...), the spec is movnti", so if someone did want to write that they could, for all that you'd refuse because AM)

:shrug: they cold write whatever they want, of course, but it would not be a solid argument showing that the compiler cannot screw up. "We don't know a counterexample" is just not strong enough evidence IMO. Considering "We don't know a counterexample" good enough is what lead to LLVM's semantics for uninit and pointer provenance being an inconsistent mess. Rust should strive to do better than that.

apiraino commented 1 year ago

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

joshtriplett commented 1 year ago

On Thu, Aug 10, 2023 at 11:27:32PM -0700, Ralf Jung wrote:

But actually my proposal wouldn't even make it so that doing two MOVNT to the same location would be UB -- it would just be non-deterministic which write would win.

That's not a fatal flaw; sometimes being non-deterministic for performance is OK, as long as there's a well-defined way to be deterministic if you want to be, such as by adding additional barriers.

RalfJung commented 1 year ago

That's not a fatal flaw; sometimes being non-deterministic for performance is OK, as long as there's a well-defined way to be deterministic if you want to be, such as by adding additional barriers.

Yeah but it was my intent for that to be UB.^^ (And barriers make it defined, of course.) That is achieved by my later proposal, which can be summarized very succinctly: a nontemporal store is like starting a new background thread that will do the actual store using non-atomic ordering at some point in the future. The fence is waiting for all background threads that were spawned by nontemporal stores of the current thread. The hope is that all the behavior of nontemporal stores can be described with this model.

If we want to define that behavior even without fences we should go with something closer to @talchas' original proposal. This can be summarized as: there is a per-thread write-back buffer for nontemporal stores. A background thread flushes the entire buffer at non-deterministic intervals using non-atomic writes. (There is no partial flushing, it's always flushing the entire thing.) The fence flushes the entire buffer right then and there.

In both cases, MOVNT; MOV to the same location is UB, since there is a race between the write-back thread and the MOV in the real thread. MOVNT; SFENCE; MOV is fine since the fence guarantees that the write-back happened so there cannot be a race.

eduardosm commented 1 year ago

The _mm_maskmoveu_si128 intrinsic uses the maskmovdqu instruction.

According to Intel documentation, it looks like a non-temporal store, even though Rust documentation does not mention it.

Conditionally store 8-bit integer elements from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element) and a non-temporal memory hint. mem_addr does not need to be aligned on any particular boundary.

talchas commented 1 year ago

maskmovdqu, not movmskpd, they're very different instructions (thanks intel), but otherwise yes, it (and maskmovq and all of movnt*) is a nontemporal store.

tmandry commented 1 year ago

Proposal: Can we lint on these intrinsics for now (by marking them as deprecated, or implementing a dedicated lint), warning people that they are unsound, and figure out how to address them longer term in the future?

This was discussed in the lang team meeting. We recognize that there needs to be more in-depth discussion on this, but a lint would be a useful band-aid for the future.

If, on the other hand, we agree that it is possible to use these lints safely, we should update the documentation on how to do that, and consider adding a lint that looks for unsound uses of them.