Closed RalfJung closed 3 months ago
I should have also Cc @rust-lang/opsem
I believe there are additional options:
C) have the backend lower the non-temporal store to either
whichever seems more beneficial. The sfence can be deferred up to the next atomic release (or stronger) and things that might contain one.
D) add a Safety
requirement to the vendor intrinsic.
It's not clear to me that this "breaks the model" as opposed to "extends the model". Why can't we have nontemporal stores as an explicit part of the atomics model, and say that they are not necessarily synchronized by a release-acquire pair to another thread?
have the backend lower the non-temporal store to either
Is "deferred sfence" a thing we can tell LLVM to do?
Also I think if we lower _mm_stream_ps in a way that includes an sfence, people will be very surprised. We should rather not expose _mm_stream_ps than expose something that's not the same operation.
add a Safety requirement to the vendor intrinsic.
No that doesn't work. We still need to explain how and why code has UB when it violates safety. Our current memory model has no way to make this code be UB.
This is an intrinsic, not a library function. It is specified by the operational semantics, not a bunch of axioms in the "safety" comment. The safety comment is merely reflecting the constraints that arise from the operational semantics. Making the safety comment the spec would mean making the spec axiomatic and that is something we certainly don't want to do.
Why can't we have nontemporal stores as an explicit part of the atomics model,
We can, but then we'll have to develop our own model. The C++ model does not have such stores. I don't think we should have our own concurrency model, that well exceeds our capacity. Concurrency models are very hard to get right and being able to inherit all the formal studies of the C++ model is extremely valuable.
And process-wise, we certainly can't have a library feature like stdarch just change the memory model. That requires an RFC. It seems fairly clear that the impact of _mm_stream_ps
on the whole language was not realized at the time of stabilization, and I think our only option is to remove this operation again -- or rather, get as close to removing it was we can within our stability guarantees: make it harmless, and deprecate it.
Is "deferred sfence" a thing we can tell LLVM to do?
Not that I know. But it may be something llvm should be doing when a store is annotated with !nontemporal
. Unless their memory model already knows how to specify nontemporal + release behavior (which may just be UB) because it's broader than the C++ model.
Also I think if we lower _mm_stream_ps in a way that includes an sfence, people will be very surprised. We should rather not expose _mm_stream_ps than expose something that's not the same operation.
I assume long as the sfence is sunk far enough they might not care. Just like mixing AVX and SSE can result in VZEROUPPER being inserted by the backend.
This is an intrinsic, not a library function.
I see. That distinction isn't always obvious. But it makes sense if we want the compiler to optimize around the intrinsics more than around FFI calls.
We can, but then we'll have to develop our own model. The C++ model does not have such stores.
Can we say we use the C++ model with the modification that an axiom (all stores are ordered with release operations) is turned into a requirement (all stores must be ordered with release operations). That seems minimally invasive.
This is an intrinsic, not a library function.
nontemporal_store
is our intrinsic, but _mm_stream_ps
is a vendor intrinsic. If nontemporal_store
is to be directly exposed it certainly needs to participate in the operational semantics. But vendor intrinsics are somewhat special.
Vendor intrinsics are sort of halfway between standard Rust and inline asm. The semantics of the vendor intrinsic is whatever asm the vendor says it is, and it's the responsibility of the user of the intrinsic to understand what that means (or doesn't) on the Rust AM and use the intrinsics in a way consistent on the AM. Ideally, writing _mm_stream_ps(ptr, a)
and writing asm!("MOVNTPS {ptr}, {a}")
(with the correct flags which I haven't bothered figuring out) should be functionally identical, except that the former is potentially better understood by the compiler and doesn't have the monkeypatchable semantics of inline asm.
Is this complicated by the extra function boundaries imposed by us exposing vendor intrinsics as extern "Rust" fn
instead of extern "vendor-intrinsic"
? Certainly[^1], since _mm_stream_ps(ptr, a); _mm_sfence()
and asm!("MOVNTPS {ptr}, {a}", "SFENCE")
are different, but in a way I think easier to resolve than either fixing compilers to better respect weak/nt memory ordering on x86[_64] or auditing every vendor intrinsic for individually having consistent semantics on the Rust AM.
[^1]: We should ideally consider if we can get away with "fixing" this such that vendor intrinsics can't be made into function pointers and aren't actually functions, to make their special status more evident. Though abusing the ABI marker for intrinsics is still a bit of a bodge, and this doesn't completely resolve the issues since asm!("MOVNTPS {ptr}, {a}"); asm!("SFENCE")
still steps the AM in an inconsistent state between the two asm blocks.
Slightly reinterpreting: it's perfectly acceptable, even expected for Miri to error when encountering vendor intrinsics. Using them steps outside the AM in a similar manner to inline asm, and as such you're relying on target specifics for coherency and sacrificing the ability to use AM-level sanitizers like Miri except on a best-effort basis.
I see. That distinction isn't always obvious. But it makes sense if we want the compiler to optimize around the intrinsics more than around FFI calls.
FFI calls don't help, FFI calls are only allowed to perform operations that could also be performed by Rust code (insofar as Rust-controlled state is affected). Getting the program into a state where release/acquire synchronization does not work any more is not allowed in any way, not even via FFI.
Like, imagine the following code:
static mut DATA: usize = 0;
static INIT: AtomicBool = AtomicBool::new(false);
thread::spawn(|| {
while INIT.load(Acquire) == false {}
let data = DATA;
if data != data { unreachable_unchecked(); }
});
some_function(&mut DATA, 42);
INIT.store(true, Release);
This code is obviously sound. The compiler is allowed to change this code such that the spawned thread becomes
while INIT.load(Acquire) == false {}
if DATA != DATA { unreachable_unchecked(); }
However, if some_function
does a non-temporal store, this change can introduce UB! Now the non-temporal store might take effect between the two reads of DATA
, and suddenly a value can seem unequal to itself.
Therefore, it is UB to leave an inline assembly block or FFI operation with any "pending non-temporal stores that haven't been guarded by a fence yet". The correctness of compilation relies on this assumption.
Can we say we use the C++ model with the modification that an axiom (all stores are ordered with release operations) is turned into a requirement (all stores must be ordered with release operations). That seems minimally invasive.
No, that's not how defining a memory model works. You'll need to define a new kind of memory access that's even weaker than non-atomic accesses and adjust the entire model to account for them.
The semantics of the vendor intrinsic is whatever asm the vendor says it is,
That doesn't work, they need to be phrased in terms of the Rust AM. Lucky enough they are mostly about what happens to certain values of SIMD type, so the vendor semantics directly translate to Rust AM semantics.
But when it comes to synchronization, this approach falls flat. If _mm_stream_ps
was implemented via inline assembly it would be UB.
auditing every vendor intrinsic for individually having consistent semantics on the Rust AM.
There's no way around that, but I hope we don't have many intrinsics that have global synchronization effects.
Slightly reinterpreting: it's perfectly acceptable, even expected for Miri to error when encountering vendor intrinsics. Using them steps outside the AM in a similar manner to inline asm, and as such you're relying on target specifics for coherency and sacrificing the ability to use AM-level sanitizers like Miri except on a best-effort basis.
Even vendor intrinsics are subject to the rule that governs all FFI and inline assembly: you can only do things to Rust-visible state that you could also have done in Rust. _mm_stream_ps
violates that rule.
I hold that while nontemporal_store as intended is not covered by the memory model, it is a) not a huge extension to add it to C/C++'s (though that's ugly to do for a single special case) b) there's a reasonable way you can model what users actually would use it for inside the existing model c) as other people said, no user cares about the purity of the rust AM when doing this, even if you do.
The potential extension to the current C++ intro.races (using C++ mainly because there's nicer html for it) is something like:
Obviously this makes happens before even more nontransitive, which is ugly, and it's totally possible I have the details wrong the same way every C/C++ spec has done, but it's really not implausible to do.
Alternatively within the existing rules you can completely ignore hardware and say that they act like this (pretending nontemporal_store is monomorphic on u8):
// blah blah boilerplate
#[derive(Clone, Copy)]
struct SendMe<T>(T);
unsafe impl<T> Send for SendMe<T> {}
unsafe impl<T> Sync for SendMe<T> {}
#[thread_local]
static mut BUFFER: Option<SendMe<&'static Mutex<Vec<(*mut u8, u8)>>>> = None;
pub unsafe fn nontemporal_store(ptr: *mut u8, val: u8) {
let buf = if let Some(b) = BUFFER {
b
} else {
let b = SendMe(&*Box::leak(Box::new(Mutex::new(Vec::new()))));
BUFFER = Some(b);
b
};
std::thread::spawn(move || {
std::thread::sleep_ms(rand::random());
let buf = buf;
do_sfence(&buf.0);
});
buf.0.lock().unwrap().push((ptr, val));
}
pub fn sfence() {
unsafe {
if let Some(b) = BUFFER {
do_sfence(&b.0)
}
}
}
fn do_sfence(buffer: &'static Mutex<Vec<(*mut u8, u8)>>) {
let mut buffer = buffer.lock().unwrap();
for (ptr, val) in buffer.drain(..) {
unsafe { *ptr = val; }
}
}
ie you do vaguely what the cpu will actually do - buffer all nontemporal stores until (some indeterminate time later, or you do a fencing operation).
This does not permit nontemporal_store(ptr, val); *ptr
, which should be permitted, but it does show that modelling nontemporal_store
as an opaque FFI operation is sufficient for the compiler. And the actual users can consider nontemporal_store
to be either this or just *ptr = val;
depending on whether or not they are sure they have the correct fencing, which the actual compiler cannot ruin, because it's opaque. (And miri can either grow something complicated or not detect errors in this use, or just not support this operation at all; any of these are fine)
The semantics of the vendor intrinsic is whatever asm the vendor says it is,
That doesn't work, they need to be phrased in terms of the [language] AM. [abstraction ed.]
While this may be true for intrinsics to be coherently usable from [language], it's not quite true in practice; the Intel documentation is that _mm_stream_ps
is the intrinsic equivalent for the MOVNTPS
instruction. It's definition is "does this assembly mnemonic operation" and no more, though the compiler is in fairness expected to understand what that means and do what's necessary to make that coherent in the most efficient way possible.
I think my position can mostly be summed up as that if using the vendor intrinsic is (available and) immediate UB in GCC C, GCC C++, LLVM C, LLVM C++, and/or MSVC C++[^2] (i.e. due to breaking the atomic memory model), it's "fine" if using the intrinsic is immediate UB in rustc Rust. The behavior of the vendor intrinsic is the same as it is in C: an underspecified and incoherent extension to the memory model that might or might not work (i.e. UB).
[^2]: MSVC C only sort of exists, thus my omission of it. Also, I think Microsoft may only provide _mm_stream_ss
and not _mm_stream_ps
? The MSVC docs for the former link to the Intel docs for the latter (and _mm_sfence
).
Though given that presumably the intrinsic should work as intended on Intel's compiler, and I think the Intel oneAPI C++ Compiler is an LLVM, it's presumably handled "good enough" in LLVM for Intel's purposes.
I'd be fine with marking vendor intrinsics as "morally questionable" (for lack of a better descriptor) and potentially even deprecating them if the backend can't handle them coherently[^1], but I wouldn't want us to neuter them to use different operations. It's assumed at a level equivalent to inline asm that if a developer is using vendor intrinsics that they know what they're doing. It's "just" (giant scare quotes) that knowing what you're doing with _mm_stream_ps
and friends probably requires using an Intel-certified compiler with the whatever-Intel-says atomic model rather than the standard C++20 atomic model that most of computation relies on.
[^1]: Ideally marked in such a way that's backend-specific, such that use of broken vendor intrinsics can warn on the backends which don't handle it fully, but not on the vendor backend that does, if such a thing ever comes into existence. Though with MIR semantics not respecting NT operations, the vendor would also need to validate any middleend transforms as well as their backend.
My justification for not neutering the implementation is roughly that with mem::uninitialized
, it's "our" intrinsic, and we failed to put sufficient requirements on how we told people to use it, so neutering it (i.e. initializing to fill with 0x00 or 0x01) to get it closer to being sound where we previously said it was is the correct move, but for _mm_stream_ps
and other vendor intrinsics, they've always been documented as "does what the vendor says, figure it out," so it's not "our fault" if "what the vendor says" is incoherent like it is with mem::uninitialized
.
Especially if this family of vendor intrinsics is UB in LLVM C++, this feels like a question that needs to bubble up to Intel's compiler team on how they think the intrinsic should be handled. Because this is a vendor intrinsic, it should behave however the vendor says it should, but we'd absolutely be justified to put a warning on it if it's fundamentally broken. But we shouldn't change how it behaves without input from the vendor.
I see maybe 2½+1 resolutions:
mem::uninitialized
).but hard choosing one without vendor input seems improper.
[^3]: Unrelated tangential query: is the Rust AM memory model "C++20" or "C++20 without memory_order_consume"? IOW, is FFI using memory_order_consume in a way visible to the Rust AM defined or UB? I very much do not know the full story of memory_order_consume and honestly don't particularly care to know much more — knowing how OOTA is permitted to break causality is cursed enough for me — but if memory_order_consume means something in the atomic model which Rust inherits from C++ and uses (i.e. it isn't just aliased to acquire), it should probably be possible from Rust (even if #[doc(hidden)] #[deprecated]
) to communicate that reality.
I find it interesting that NT would be "easier" for Rust if it weren't same-thread consistent, because then it could probably be modeled as an access from an anonymous thread (i.e. exempted from the sequenced-before relation), as modeled by talchas. But it is, so this observation is mostly irrelevant. I make no attempt to say whether the relaxation of the model is accurate, nor whether it breaks the existing proof body built on the current model[^4]. Though I do think you're at least missing a buffer.clear()
from your do_sfence
(or should be iterating buffer.drain()
).
[^4]: And this risk of breaking the rest of the model is the problematic risk of extending the model. Especially if Rust requires an adhoc nontemporal extension without the C++ model that everyone else is assuming adopting the same extension. You need to prove not just that your extension is sufficient for a nontemporal memory order but that its presence also doesn't impact the rest of the model, such that existing proofs that ignore nontemporal remain accurate so long as the relevant memory locations are not accessed nontemporally.
Oh yes, I originally wrote that as consuming the buffer and forgot to shift it to drain. (fixed)
Since I got somewhat nerd-sniped:
However, even if the above would potentially work as a definition, what it definitely does not show is whether nontemporal stores are currently nonsense (given compilers' models) and "miscompiled" (given this model) by existing compiler transforms. I think it's probably fine — a code transformation relying on weakly sequenced-before (nontemporal) being sequenced-before would necessarily require inter-thread reasoning, and I don't think any existing compiler transforms do such, since "no sane compiler would optimize atomics" — but that's an extremely strong assertion to make.
Separately, and actually somewhat relevant — Rust might still not be able to simply use the C++20 memory model completely as-is if we permit mixed-size atomic operations. Though of course, even if the formal C++ description leaves such as a UB data race, the extension to make them at least not race is fairly trivial, and IIRC I don't think anyone expects them to actually synchronize.
Though I do think people would expect mixed-size atomic operations to be coherent w.r.t. modification order, and those rules talk about some "atomic object $M$" and not a "memory location," and I'm no longer sure the necessary modifications are in any way simple.
because LLVM currently compiles all but sequentially-consistent fences to no-ops on x86, I think it currently miscompiles non-temporal stores if they're anything remotely like normal stores, because you can have acquire/release fences as much as you please and LLVM will happily compile them to nothing. imho the extra guarantees provided by sequential consistency shouldn't be required to make non-temporal stores behave.
there's also non-temporal loads which seem to also be just as much fun! https://www.felixcloutier.com/x86/movntdqa
Non-temporal loads, in practice, are effectively normal loads, because their optimization is so conditional it can rarely trigger according to the letter of the ISA (it requires close reading of the fine print to tease this out). Because it is such a theoretical optimization, my understanding is it is largely unimplemented, and the one vestige is that the prefetchnta
instruction is sometimes supported.
Yes, you need the instructions that happen to be generated by sequential consistency fences, or explicit calls to arch intrinsics/asm like sfence. If you wanted nontemporal stores to behave like normal stores in the language model, you wouldn't need a fence at all; requiring a release fence and pessimizing its codegen for other cases would be a weird worst of all worlds imo. (Requiring a seqcst fence that is in practice always going to generate an acceptable instruction or requiring an explicit sfence both seem fine to me)
And yeah, for loads note the "if the memory source is WC (write combining) memory type" in the instruction description - WC is not the normal memory type, it's the weak memory type and if your rust code built for x86_64-any-target-at-all has access to any memory of that type it's already broken wrt synchronization. (NT stores just treat any memory like it's WC)
And yeah, for loads note the "if the memory source is WC (write combining) memory type" in the instruction description - WC is not the normal memory type, it's the weak memory type and if your rust code built for x86_64-any-target-at-all has access to any memory of that type it's already broken wrt synchronization. (NT stores just treat any memory like it's WC)
well, we still need to be able to have Rust properly handle it because WC memory is often returned by 3D graphics APIs (e.g. Vulkan) for memory-mapped video memory.
Well uh that'll be fun if you wanted to program to spec, because any store there is basically a nontemporal store and isn't flagged as such in any way to the compiler.
It'll work of course so long as you either sfence manually in the right places or never try to have another thread do the commit (which presumably will do the right sync, but needs to happen on the cpu that did the WC write), since any memory given to you like that will be known by the compiler to be exposed, and asm sfence/etc would be marked as touching all exposed memory. (Don't make an &mut of it though probably? Who knows what llvm thinks the rules around noalias + asm-memory are)
I hold that while nontemporal_store as intended is not covered by the memory model, it is a) not a huge extension to add it to C/C++'s
It requires an entire new access class, so it is a bigger change than anything that happened since 2011. It also requires splitting "synchronizes-with" into "synchronization through seqcst fences" (which does synchronize nontemporal accesses) and everything else (which does not synchronize nontemporal accesses). So this is a huge change to the model, and all consistency theorems (like DRF), compiler transformations, lowering schemes etc will have to be carefully reconsidered. We currently have proofs showing that the x86 instructions compilers use to implement memory operations give the right semantics. It would be a mistake to lower our standards below this, mistakes have been made in the past.
It is just false to claim that this is a simple extension.
Any attempt to do this needs to start with something like https://plv.mpi-sws.org/scfix/paper.pdf, the formulation of the model in the standard is just too vague to be reliable. (According to the authors of that paper, the standard doesn't even reflect the SC fix properly even though that was the intention of the standard authors.)
The C++ memory model needed many rounds of iteration with formal memory model researchers to get into its current state. It still has one major issue (out-of-thin-air), but the original version (before Mark Betty's thesis) was a lot worse and even after that some things still had to be fixed (like SCfix). Anyone who claims they can just quickly modify this model and be sure not to reintroduce any of these problems is vastly underestimating the complexity of weak memory models.
@CAD97
It's definition is "does this assembly mnemonic operation" and no more, though the compiler is in fairness expected to understand what that means and do what's necessary to make that coherent in the most efficient way possible.
Again that just doesn't work. We have a very clear rule for inline assembly and FFI: they can only do things that Rust code could also have done, insofar as Rust-visible state is affected. It is completely and utterly meaningless to take an operation from one semantics and just take it into another, completely different semantics.
because LLVM currently compiles all but sequentially-consistent fences to no-ops on x86, I think it currently miscompiles non-temporal stores if they're anything remotely like normal stores, because you can have acquire/release fences as much as you please and LLVM will happily compile them to nothing. imho the extra guarantees provided by sequential consistency shouldn't be required to make non-temporal stores behave.
Well they don't say much about how nontemporal stores are supposed to behave, so it's unclear if they are miscompiling or just implementing a surprising (and unwritten) spec. I opened https://github.com/llvm/llvm-project/issues/64521 to find out.
We have a very clear rule for inline assembly and FFI: they can only do things that Rust code could also have done, insofar as Rust-visible state is affected. It is completely and utterly meaningless to take an operation from one semantics and just take it into another, completely different semantics.
To be clear, the point I'm making is actually that if the definition as "does this assembly sequence" has busted and unusable semantics on the AM, then the vendor intrinsic is busted and unusable, which is IIUC in agreement with you.
The main point where I disagree is that, given Intel has defined a busted and unusable intrinsic, the correct behavior is for us to expose the intrinsic as-is, busted and unusable though it may be, potentially with an editorial note saying as much.
(I have edited the OP to explicitly state that adjusting the memory model is an option in principle, but not one I consider realistic.)
The main point where I disagree is that, given Intel has defined a busted and unusable intrinsic, the correct behavior is for us to expose the intrinsic as-is, busted and unusable though it may be, potentially with an editorial note saying as much.
I don't see why we would expose a busted intrinsic. We already have inline assembly, and these days it is stable (it was not when stdarch got stabilized), so I think we are totally free to tell Intel that we won't expose operations that don't compose with the rest of our language -- for reasons that are entirely on them, since they told the world that their hardware is TSO(-like) and then also added operations which subvert TSO.
If these intrinsics weren't stable yet, is anyone seriously suggesting we would have stabilized them, knowing what we do now? I would be shocked if that were the case. The intrinsic even says in its doc comment that this will likely never be stabilized! Sadly whoever implemented _mm_stream_ps
didn't heed that comment. So to me this is a clear oversight and the question is how we best do damage control.
I wonder if there is some way that we can argue that
_mm_stream_ps();
_mm_sfence();
is equivalent to a single inline assembly block with both of these operations. The CPU is in a strange state between those two inline assembly blocks, but can we formulate some conditions under which that strange state cannot have negative side-effects? IOW, can we weaken the rule that inline asm can only do things which Rust could have done, to something where one inline asm block gets the machine into a state that does not fully match what Rust could have done (but the difference only affects a few operations), and then a 2nd inline asm block fixes up the state?
Basically what's important is that there is no release operation between the two intrinsics. If the programmer ensures this is the case, then -- can we ensure the compiler doesn't introduce such operations? At the very least this means ensuring the inline asm blocks don't get reordered with release operations, but is that sufficient? I am a bit worried this might be too fragile, but OTOH a principle of "it's fine if machine state the Rust AM doesn't currently need is inconsistent" does seem useful. It's just hard for me to predict the impact of this on optimizations.
Yeah, I mean you can just say it doesn't explode unless execution reaches a release operation before it reaches an sfence (or other explicit NT sync op). The only way that you could really do that by accident is signals, or maybe it's plausible you'd call vec push aka malloc aka takes a lock in some paths. (Of course in actual codegen taking a lock is still fine since any RMW is an NT sync, so you'd need some faster path that does a release store, and the AM looking inside of the allocator is super broken in the first place)
The compiler reordering a release operation before the sfence would be a clear bug in its implementation of sfence, since a release operation must be a store, and sfence is the store fence. (And if you somehow come up with a way to get a seqcst load to wind up happens-before another thread's operation in a useful fashion, then seqcst loads probably shouldn't reorder with much of anything)
The compiler could order a release operation from before the _mm_stream_ps down, though, maybe? If this was true inline asm then of course not, but with the intrinsics more things can go wrong. That would still work since the stuff before the _mm_stream_ps is properly released, just the _mm_stream_ps itself is not.
So I guess this only really becomes a problem if the compiler somehow synthesizes release operations for its own kind of synchronization, like what auto-parallelization transformations might do.
For what it's worth, the basic intended usage of movnti
and movntdq
, followed by sfence
, is in fact basically as what was, at the time, equivalent to what is now basically encompassed by Enhanced REP MOVSB, which on CPUs with that feature means REP MOVSB also has a "please do not use this to write to semaphores, because it actually uses write-combining buffers in its implementation now" caveat. It has stronger ordering guarantees with other operations, but its own stores, within the range touched by the movsb loop, are more weakly ordered, almost like... a bunch of nontemporal writes and then sfence!
A few questions while thinking to an actionable. Where and where has this been introduced? Was it part of one our usual LLVM upgrades and this stabilization slipped in without us realizing? Would a partial revert be possible/useful while we figure the long term solution (I think what @RalfJung suggests as first point in the opening comment: Remove nontemporal_store and implement the _mm_stream intrinsics without it and mark them as deprecated)?
The nontemporal-store intrinsic was introduced ~forever ago, in https://github.com/rust-lang/rust/commit/fe53a8106dfb54b5fe04d2ce7e8ee6472b0d5b16, with a comment saying this will likely never be stable.
The first mm_stream wrappers were added here unstably. They got marked as stable in https://github.com/rust-lang/stdarch/pull/414. Nothing there discussed the problem that these intrinsics have unexpected behavior in the presence of concurrency, or the fact that an intrinsic got stably exposed despite it having an explicit comment to the opposite.
A compiler reordering a release operation down isn't actually a problem for the rust spec that says nt_store(); release();
is immediate UB, that's only a problem for an IR spec that says that. (And before you go to even more absurd lengths by trying to make LLVM's not-really-existent-either spec a rust problem, come up with an actual optimization that would misbehave; I'm pretty sure that would require a cross-thread "optimization", and just no, stop)
They got marked as stable in https://github.com/rust-lang/stdarch/pull/414. Nothing there discussed the problem that these intrinsics have unexpected behavior in the presence of concurrency, or the fact that an intrinsic got stably exposed despite it having an explicit comment to the opposite.
That's also a massive PR which stabilizes a ton of things at once and says it was done by a script, so it is not a surprise that a don't stabilize me bro on the wrapped intrinsic wouldn't get caught in review.
That's also a massive PR which stabilizes a ton of things at once and says it was done by a script, so it is not a surprise that a don't stabilize me bro on the wrapped intrinsic wouldn't get caught in review.
Yeah, and it seems there are more intrinsics that should have had closer t-lang attention before being stabilized -- see https://github.com/rust-lang/stdarch/pull/1454. This might need a proper audit...
@digama0 did some research into how the stream intrinsics are used in the wild. Seems like basically nobody remembers to put the sfence at the end...
We certainly don't document it as required, so of course they don't.
Since I've seen my position misrepresented on Zulip, I want to clarify what my concern(s) are here.
First of all, there has been a process failure. Stably exposing operations from the standard library that extend the language needs explicit T-lang discussion, which didn't happen for _mm_stream*
. (And we can't argue that this is just an inline asm block, since these operations violate the inline asm block rule that their overall effect on the machine state must be expressible in Rust. That's why they are language extensions to begin with.) This applies to everything the standard library exposes, "vendor intrinsics" don't have any special privilege to break Rust's general principles.
The reason this happened also seems fairly clear; this was a huge stabilization (FCP happened here) and a few odd ducks like this one or floating-point environment manipulation just slipped through.
I think we have consensus on that? I sure hope so, at least.
If we had followed process, stabilizing _mm_stream*
would have been blocked on an RFC that defines Rust's own memory model, or at least T-lang explicit approval to do something ad-hoc and unprincipled like "after a nontemporal store, until the next sfence, any release operation (release or stronger write, release or stronger fence, thread spawn) is UB". I'm saying "unprincipled" since I am not aware of a principled argument that having the compiler apply optimizations when the machine is in a state where release operations are UB is correct.
Now what shall we do with this? I think rushing a spec extension that can cover these intrinsics would be ill-advised; concurrency models are too subtle to rush anything. Some people seem determined to go the route of defining a Rust concurrency memory model. I was probably too dismissive of these efforts, this is exciting! However, I think that RFC will take a long time to go through; experience shows that these models are very hard to specify correctly so we should get some weak memory researchers to take a look and prove some theorems before declaring it official. (The C++ committee worked with Mark Batty in a similar way before C++11 got finalized; the standard was in a pretty terrible state before that happened. C++21 SCfix similarly had academic analysis, and then they still managed to introduce ambiguities when translating the formal model into English.) So, IMO this is not a short-term solution; I'm both excited and terrified about Rust taking responsibility for its own memory model (and I'd love to help resolve ambiguities and work with weak memory researchers on the formalization) but this is not something that can happen quickly while we struggle to fix past mistakes.
What can we do short-term? Usually when a mistake was made the default answer is a revert. We cannot revert the stabilization but we can adjust these intrinsics to no longer be language extensions, by using regular stores. It's not a great solution since it defies the expectations associated with these intrinsics, but then the real-world data collected by @digama0 shows that people currently don't properly use these intrinsics (including the very person who added the nontemporal_store
intrinsic to the compiler).
(The following part got updated to account for this reply.)
An alternative would be to turn these intrinsics into inline assembly blocks (i.e., make them fully opaque to the compiler), and then argue that we can come up with Rust code that safely approximates all the possible side-effects of using nontemporal stores, and adjust the documentation to require users to avoid all UB that would occur if this Rust code was actually used at run-time (which is likely more UB than the actual MOVNT instruction). Here is a proposal for such Rust code, here is another one. If we can convince ourselves that the actual MOVNT operation indeed has "no more strange behavior" than that Rust code, then this would be a reasonable solution.
Yet another alternative would be to only do documentation changes: e.g., we document that after these operations and until the next sfence, it is UB to perform a release operation. (Or a similar restriction.) There is a risk that this puts constraints on optimizations and analyses that we don't understand yet. I think we should be clear that we don't in general allow inline asm blocks (or FFI/intrinsics) to leave the machine in a "bad" state that is fixed up later by another inline asm block (or FFI/intrinsic), but that for this particular case this seems "good enough" and the desire to write such code without inline asm outweighs the desire for systematic correctness. If we do this then presumably it is because the previous alternative was somehow not acceptable and we need compiler insight into these intrinsics for optimizations -- that is concerning since this is correct only if the compiler is aware of the non-standard nature of the memory accesses that are performed by these intrinsics.
Nominating for T-lang discussion. @rust-lang/lang, my own view of the situation is described at the top of this issue and above in this comment. (If someone else wants to write a summary of their views, I'll happily add a link to it here.)
We certainly don't document it as required, so of course they don't.
Yeah we definitely need to at least update our docs. That said, how high are the chances that people will even read these docs, given that these operations are described as "vendor intrinsics" so people might have the reasonable expectation that the Intel manual tells them everything they need to know?
(I edited my summary a bit, if you follow by email please re-read on Github.)
(And we can't argue that this is just an inline asm block, since these operations violate the inline asm block rule that their overall effect on the machine state must be expressible in Rust. That's why they are language extensions to begin with.)
There is in fact a description that is in fact slightly more restrictive to users than what _mm_stream_ps "should" be (but would be a defensible description for a cross-arch nontemporal store) right here #issuecomment-1668386167 near the start of all this. The additional restriction being that subsequent accesses from the same thread would be racy until an sfence.
Similarly there's the obvious other possible result of an inline asm block which is more restrictive to the compiler: a normal store.
So the only part that has any specification difficulty is pinning down what precisely is allowed in that in-between region, and my impression from a glance through the search for the few existing users was that they don't use that anyways. (And saying "doing anything to take advantage of x86 permitting same-thread acceses is yolo pending LLVM giving a more precise definition of !nontemporal" is also on option)
So assuming LLVM doesn't come back reasonably quickly saying "oops we'll magically insert sfence the way we do vzeroupper" (or anything else useful), just changing the intrinsic to be inline asm seems fine.
Oh I see, that's where you were going with your proposed "thread-local buffer of pending stores". You are saying if these are inline assembly blocks (and not LLVM intrinsics like right now) we get to invent their Rust-level semantics as usual, and you are saying that the actual implementation with MOVNT is a correct refinement of those buffers. Sorry, I hadn't realized this is where you were going.
I'll try to poke holes into this from that angle, then.
So the documentation for the streaming operations would then say something like, after calling this operation and until this thread calls _mm_sfence
, it is UB for any thread to read or write these memory locations? This would allow some code that my earlier proposal forbids (doing a release operation before an sfence, accepting that this will not release the streamed writes) but disallow some other code (doing same-thread accesses before the sfence).
Turns out there is a paper on nontemporal stores in x86. However it is on the hardware level, not on the level of a surface language like Rust or C++.
I should note that a lot of vendor intrinsics are specified as "like the instruction, but..." with that "but" being a sly carve-out for more wiggle-room where the compilers can use it. Assuming Intel doesn't descend from on-high tomorrow with their full formally verified model for nontemporal stores and write-combining memory (I assume they have one, somewhere, under lock and key, that they are distinctly not sharing), condemning any slightly suspicious uses of these intrinsics seems preferable to churning the ones that adhere to something like a sensible usage, modulo the store-store fence.
Also, I have opened PRs to fix all the libraries that need _mm_sfence
:
Yeah we definitely need to at least update our docs. That said, how high are the chances that people will even read these docs, given that these operations are described as "vendor intrinsics" so people might have the reasonable expectation that the Intel manual tells them everything they need to know?
If we need to lean on Intel to update their website and suchlike, then we can do that, too. We have clout enough to demand an audience, at least.
One thing about this "explicit buffer" model that strikes me as odd is that the background thread will only non-deterministically flush the entire buffer. That implies there is some ordering, later writes will never be flushed before earlier writes. I think I had expected something different, where
nontemporal_store
in a threadHowever, I can't find a way for anything to actually observe the order of stores here, so this might well be equivalent.
The other point that makes me feel uneasy is nontemporal_store
acquiring a lock that will synchronize with the background thread. But again since we cannot know which writes have already been flushed I can't find a way to exploit that -- and obviously when a thread runs sfence then it is crucial that this synchronizes with the background thread that might have already flushed some of the writes.
I should note that a lot of vendor intrinsics are specified as "like the instruction, but..." with that "but" being a sly carve-out for more wiggle-room where the compilers can use it.
There's no problem with that when it's just about mutating some data by-value, or causing the effect of regular loads/stores. It's when these effects are not (or not trivially) expressible in regular Rust that we need to individually consider each case. These interactions are language-specific and Intel can't predict how their intrinsics behave with each and every language out there. (However they could really have predicted the fact that nontemporal stores are a massive footgun.)
Also, I have opened PRs to fix all the libraries that need _mm_sfence:
That's awesome. :)
Yes, I did them in order because it was simplest, and because doing two NT stores to the same location (with different values) and expecting the later one to win once you sfence() seemed plausible for graphics code or whatever and it's guaranteed by x86. Instead doing a "on insert replace any existing value" and "have each spawned thread only write a single random element" would be closer to the actual behavior, but way more of a pain to write, and as you note I don't think it actually is visible.
Mixing NT store + regular store is also guaranteed by x86 and an opaque asm block will provide enough constraint to the compiler to do the right thing, but specifying that as a rust implementation is not something I can even begin to see how to do.
doing two NT stores to the same location (with different values) and expecting the later one to win
Ah, that's a good point, so "randomly pick things from the buffer" is not equivalent. We should check that paper to figure out if their model guarantees same-thread ordering of MOVNT.
It does seem strange though that a regular write would be UB but an NT write would be allowed. I think ideally we make it so that every NT write can legally be replaced by a regular write. So I think I'd prefer a model that flushes the buffer out-of-order.
#[thread_local]
static mut BUFFER: Option<SendMe<&'static Mutex<Vec<(*mut u8, u8)>>>> = None;
pub unsafe fn nontemporal_store(ptr: *mut u8, val: u8) {
let buf = if let Some(b) = BUFFER {
b
} else {
let b = SendMe(&*Box::leak(Box::new(Mutex::new(Vec::new()))));
BUFFER = Some(b);
// spawn one flushing thread per "real" thread.
std::thread::spawn(move || loop {
std::thread::sleep_ms(rand::random());
let buf = b.0.lock().unwrap();
if !buf.is_empty() {
let (ptr, val) = buf.remove(rand::random() % buf.len());
*ptr = val;
}
})
b
};
buf.0.lock().unwrap().push((ptr, val));
}
pub fn sfence() {
unsafe {
if let Some(b) = BUFFER {
// Just wait until the background thread drained the buffer.
while b.0.lock().unwrap().len() > 0 {}
}
}
}
Every difference compared to the actual intrinsic is downside, so no, I don't think that's a good idea (if std wanted to expose a nontemporal_store outside of stdarch, then maybe, but not for _mm_sfence). Weird corner cases that only exist because of trying to shoehorn this into a spec aren't really a problem. It doesn't even make the fake code representation of the intrinsic look better.
(Also a bit of the point is that between the two possible asm blocks the compiler can't actually screw up "asm!("movnti" ...)
, the spec is movnti
", so if someone did want to write that they could, for all that you'd refuse because ~AM~)
The actual MOVNT does have the property that it is safe to replace it by a MOV. Though I agree that adding extra UB to satisfy a property like this is not a clear win.
But actually my proposal wouldn't even make it so that doing two MOVNT to the same location would be UB -- it would just be non-deterministic which write would win. So I'm not happy with that proposal either. I think this one has the desired effect:
#[thread_local]
static mut PENDING_WRITES = AtomicUsize::new(0);
pub unsafe fn nontemporal_store<T>(ptr: *mut T, val: T) {
PENDING_WRITES.fetch_add(1, Relaxed);
// Spawn a thread that will eventually do our write.
let ptr = SendMe(ptr);
let pending_writes = SendMe(addr_of!(PENDING_WRITES));
std::thread::spawn(move || {
let ptr = ptr; let pending_writes = pending_writes; // closure field capturing is annoying...
std::thread::sleep_ms(rand::random()); // not really needed due to scheduler non-determinism
*ptr.0 = val;
(&*pending_writes.0).fetch_sub(1, Release);
});
}
pub fn sfence() {
unsafe {
// Wait until there are no more pending writes.
while PENDING_WRITES.load(Acquire) > 0 {}
}
}
Here's a variant that actually builds and also uses Box::leak
to avoid a use-after-free in the write-back thread.
This also has the nice advantage of making it much easier to support writing arbitrary types.
(Also a bit of the point is that between the two possible asm blocks the compiler can't actually screw up "asm!("movnti" ...), the spec is movnti", so if someone did want to write that they could, for all that you'd refuse because AM)
:shrug: they cold write whatever they want, of course, but it would not be a solid argument showing that the compiler cannot screw up. "We don't know a counterexample" is just not strong enough evidence IMO. Considering "We don't know a counterexample" good enough is what lead to LLVM's semantics for uninit and pointer provenance being an inconsistent mess. Rust should strive to do better than that.
WG-prioritization assigning priority (Zulip discussion).
@rustbot label -I-prioritize +P-medium
On Thu, Aug 10, 2023 at 11:27:32PM -0700, Ralf Jung wrote:
But actually my proposal wouldn't even make it so that doing two MOVNT to the same location would be UB -- it would just be non-deterministic which write would win.
That's not a fatal flaw; sometimes being non-deterministic for performance is OK, as long as there's a well-defined way to be deterministic if you want to be, such as by adding additional barriers.
That's not a fatal flaw; sometimes being non-deterministic for performance is OK, as long as there's a well-defined way to be deterministic if you want to be, such as by adding additional barriers.
Yeah but it was my intent for that to be UB.^^ (And barriers make it defined, of course.) That is achieved by my later proposal, which can be summarized very succinctly: a nontemporal store is like starting a new background thread that will do the actual store using non-atomic ordering at some point in the future. The fence is waiting for all background threads that were spawned by nontemporal stores of the current thread. The hope is that all the behavior of nontemporal stores can be described with this model.
If we want to define that behavior even without fences we should go with something closer to @talchas' original proposal. This can be summarized as: there is a per-thread write-back buffer for nontemporal stores. A background thread flushes the entire buffer at non-deterministic intervals using non-atomic writes. (There is no partial flushing, it's always flushing the entire thing.) The fence flushes the entire buffer right then and there.
In both cases, MOVNT; MOV
to the same location is UB, since there is a race between the write-back thread and the MOV in the real thread. MOVNT; SFENCE; MOV
is fine since the fence guarantees that the write-back happened so there cannot be a race.
The _mm_maskmoveu_si128
intrinsic uses the maskmovdqu
instruction.
According to Intel documentation, it looks like a non-temporal store, even though Rust documentation does not mention it.
Conditionally store 8-bit integer elements from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element) and a non-temporal memory hint. mem_addr does not need to be aligned on any particular boundary.
maskmovdqu
, not movmskpd
, they're very different instructions (thanks intel), but otherwise yes, it (and maskmovq
and all of movnt*
) is a nontemporal store.
Proposal: Can we lint on these intrinsics for now (by marking them as deprecated, or implementing a dedicated lint), warning people that they are unsound, and figure out how to address them longer term in the future?
This was discussed in the lang team meeting. We recognize that there needs to be more in-depth discussion on this, but a lint would be a useful band-aid for the future.
If, on the other hand, we agree that it is possible to use these lints safely, we should update the documentation on how to do that, and consider adding a lint that looks for unsound uses of them.
I recently discovered this funny little intrinsic with the great comment
Unfortunately, the comment is wrong: this has become stable, through vendor intrinsics like
_mm_stream_ps
.Why is that a problem? Well, turns out non-temporal stores completely break our memory model. The following assertion can fail under the current compilation scheme used by LLVM:
The assertion can fail because the CPU may order MOVNT after later MOV (for different locations), so the nontemporal_store might occur after the release store. Sources for this claim:
This is a big problem -- we have a memory model that says you can use release/acquire operations to synchronize any (including non-atomic) memory accesses, and we have memory accesses which are not properly synchronized by release/acquire operations.
So what could be done?
_mm_stream
intrinsics without it and mark them as deprecated to signal that they don't match the expected semantics of the underlying hardware operation. People should use inline assembly instead and then it is their responsibility to have an sfence at the end of their asm block to restore expected synchronization behavior.Thanks a lot to @workingjubilee and @the8472 for their help in figuring out the details of nontemporal stores.
Cc @rust-lang/lang @Amanieu
Also see the nomination comment here.