Non-temporal stores (and _mm_stream operations in stdarch) break our memory model

RalfJung commented 1 year ago

I recently discovered this funny little intrinsic with the great comment

Emits a !nontemporal store according to LLVM (see their docs). Probably will never become stable.

Unfortunately, the comment is wrong: this has become stable, through vendor intrinsics like _mm_stream_ps.

Why is that a problem? Well, turns out non-temporal stores completely break our memory model. The following assertion can fail under the current compilation scheme used by LLVM:

static mut DATA: usize = 0;
static INIT: AtomicBool = AtomicBool::new(false);

thread::spawn(|| {
  while INIT.load(Acquire) == false {}
  assert_eq!(DATA, 42); // can this ever fail? that would be bad
});

nontemporal_store(&mut DATA, 42);
INIT.store(true, Release);

The assertion can fail because the CPU may order MOVNT after later MOV (for different locations), so the nontemporal_store might occur after the release store. Sources for this claim:

Peter Cordes answer here: "A mutex unlock on x86 is sometimes a lock add, in which case that's a full fence for NT stores already. But if you can't rule out a mutex implementation using a simple mov store then you need at least sfence at some point after NT stores, before unlock."
glibc fixing their memcpy (which uses nontemporal stores) to have a trailing sfence.

This is a big problem -- we have a memory model that says you can use release/acquire operations to synchronize any (including non-atomic) memory accesses, and we have memory accesses which are not properly synchronized by release/acquire operations.

So what could be done?

Remove nontemporal_store and implement the _mm_stream intrinsics without it and mark them as deprecated to signal that they don't match the expected semantics of the underlying hardware operation. People should use inline assembly instead and then it is their responsibility to have an sfence at the end of their asm block to restore expected synchronization behavior.
Change the way release stores are compiled such that an sfence is emitted. This is mostly a theoretical option though: this is an ABI-breaking change, at least my understanding of the x86 ABI is that a regular mov is considered to be sufficient synchronization. To make this work reliably all compilers for all languages that have something like release writes need to emit the sfence, or else Rust code cannot be soundly linked with code produced by those other compilers.
There's a third hypothetical option of adjusting our concurrency memory model to be able to support these operations. But that's (a) a huge design space -- which fences are supposed to interact how with this? We might even have to add a new kind of fence! And (b) modifying the C++ concurrency memory model should be done with utmost care and thorough formal analysis; the current model is the result of a decade worth of research that shouldn't be thrown away lightly.
Just ignore the problem and hope it doesn't explode? I am not happy with this.

Thanks a lot to @workingjubilee and @the8472 for their help in figuring out the details of nontemporal stores.

Cc @rust-lang/lang @Amanieu

Also see the nomination comment here.

traviscross commented 1 year ago

If we do add a lint, we should make sure to tell people that if they do want to live on the wild side and use these (as clearly some people do), that they need to add the SFENCE, as that often seems to be neglected.

RalfJung commented 1 year ago

I'm still in favor of the approach described above:

reimplement these functions in inline asm (since LLVM isn't properly modeling nontemporal stores so there's no way we can rely on it doing anything sensible here)
extend the documentation to explain that they behave like this Rust code, or like this one.

traviscross commented 1 year ago

@rustbot labels -I-lang-nominated

I'm going to unnominate because, while this issue is important, T-lang is not likely to have much more to say about it without a document that explores the issue and the feasible resolutions in some detail. Or additionally or alternatively, T-lang would likely have something more to say about a PR adding e.g. a lint (as suggested above) or about a specific addition to the documentation.

Please nominate any of those things, or re-nominate this issue if there's new information and a specific question that T-lang might be able to answer.

RalfJung commented 1 year ago

T-lang is not likely to have much more to say about it without a document that explores the issue and the feasible resolutions in some detail

This comment (linked from the top comment) was supposed to be such a "document" / summary.

workingjubilee commented 1 year ago

Sorry but, as far as I can tell, "a document that explores the issue in detail" means nothing if the information provided so far is insufficient. I am assuming that T-lang does not literally mean they need us to copy and paste that comment's body text into a HackMD link.

May we have the pleasure of T-lang specifying what they feel is actually needed in terms of "exploring the issue in detail"?

Admittedly, the text offered is incomplete.

@tmandry I haven't said it yet in this thread, but the problem actually does not start or end with these intrinsics, so asking about linting on them alone is a distraction. The property attributed to "non-temporal stores" is a property that can apply to anywhere in memory if the page is mapped appropriately, and sometimes such interactions can happen for graphics API code which exposes the ability to intentionally map memory pages appropriately to userspace, and which is used to obtain high-performance graphics (in the sense that for the past 10-20 years we have had "high-performance graphics" on computers). Because it makes a lot of things a great deal more efficient for certain write-heavy sequences. Unlike "you can open /proc/self/mem and sabotage yourself on purpose", this is done intentionally for not-a-joke reasons by ordinary programs and is expected to have a meaningful operational semantics.

RalfJung commented 1 year ago

Admittedly, the text offered is incomplete.

Yeah, that's fair.

I think my preferred way to proceed is a PR against stdarch that

changes the implementations to inline asm, so that we avoid the LLVM nontemporal flag (which clearly LLVM itself has no clue how to treat properly so we should take a wide berth around it)
document that these stores behave like "spawn a background thread that does the store eventually (as a non-atomic store)" and "sfence synchronizes with all background threads that were previously spawned by the current thread". Accesses that race with that background thread store lead to data races and hence UB.

Once we have that PR we can ask t-lang for FCP on that I guess. I'd be happy to write the docs but I'd probably need help with the inline asm part.

tmandry commented 1 year ago

That sounds like a reasonable way forward to me, @RalfJung. For the larger questions, it sounds like we should probably have a design meeting (or an RFC) dedicated to the topic.

RalfJung commented 9 months ago

I've opened https://github.com/rust-lang/stdarch/pull/1534 to update the stdarch docs. However I need help updating the implementations of these intrinsics to move away from the !nontemporal LLVM attribute -- either using LLVM functions specific for a particular instruction, or if that is not possible, using inline assembly.

Where can I find out which instruction-specific LLVM functions exist? (I mean functions like llvm.x86.sse4a.movnt.sd.)

RalfJung commented 8 months ago

And the last step, moving away from LLVM's !nontemporal which (from what can be gleaned from the LangRef) entirely fails to account for the fact that this has subtle interactions with the memory model: https://github.com/rust-lang/stdarch/pull/1541.

iximeow commented 8 months ago

i missed most of the existence of this issue, but thank you for also updating the documentation on {core,std}::intrinsics::nontemporal_store to point away from the intrinsic. i'd missed that at first and was going to suggest as much here, before finding that PR :)

i'd also like to, after the fact, thank folks here for concluding on the inline assembly implementation for _mm_stream_*, as someone who would use these. to go back to a very early comment here,

Also I think if we lower _mm_stream_ps in a way that includes an sfence, people will be very surprised. We should rather not expose _mm_stream_ps than expose something that's not the same operation.

to which i very strongly agree. i have a very specific expectation for _mm_stream_* and what nontemporal_store would mean on x86/x86_64 targets: "results in use of a movnt* instruction appropriate for the data being operated on". maintaining that is something of a relief :)

i realize the conversation here is focused mostly on the unsoundness of the existing implementation of these intrinsics, but i'd like to emphasize as a user of these instructions that their primary value is for the microarchitectural hardware effect. it's not particularly useful to any use i've seen that these stores are write-combining and weakly ordered, or that corresponding non-temporal loads are also WC and weakly-ordered. in fact (opinion warning) it's something of an annoying x86/x86_64 implementation detail rather than an expected or desired behavior. the primary value of these intrinsics is to avoid pulling known-unlikely data into caches only because it happened to be needed once.

using write-combining semantics to achieve that effect is, in my understanding, mostly a convenience to reuse an existing x86 memory type, rather than invent a new type used only for a handful of instructions. so, i hope this underlines why simply implementing these intrinsics as regular loads an stores would be so surprising.

one other earlier comment i'd like to understand better, on a hypothetical alternative to having simply stabilized these intrinsics a while ago:

something ad-hoc and unprincipled like "after a nontemporal store, until the next sfence, any release operation (release or stronger write, release or stronger fence, thread spawn) is UB".

would it be correct to phrase this a little more precisely as "after a nontemporal store and until the next sfence, any concurrent access to the memory nontemporally stored to is UB"? this is an important distinction as where i've typically used non-temporal instructions is to write to a buffer that i may not ever read again from the Rust function doing writing in the first place. in those cases it's on me to ensure (sfence or other target-appropriate) ordering guarantees, but it would not result in UB in terms of the Rust AM?

of course, "simply integrate the Rust AM, cross-language requirements, and subtle target architecture semantics and make sure you got it all right" is a tall order and part of why i cannot, for example, Just Point To where i do or would use _mm_stream_*.

and one other thought on why these are useful intrinsics to have, re.

For what it's worth, the basic intended usage of movnti and movntdq, followed by sfence, is in fact basically as what was, at the time, equivalent to what is now basically encompassed by Enhanced REP MOVSB, which on CPUs with that feature means ...

Enhanced REP MOVSB, and the related fast-string feature, as far as i've read, do not guarantee use anything about cache lines related to the buffers that are operated on. in practice cache lines may not be filled, but this would be architecture-dependent in a way that movnt*/sfence are not. additionally, movnt*/sfence are also used to implement non-cache-polluting memset, which rep stosb + fast-string doesn't guarantee. if the Rust project could pressure Intel into committing to non-cache-polluting semantics for fast string operations from some architecture onwards, i and all 200 people who are particularly invested in this weird corner of x86 would be greatly appreciative :pray:

finally, a kind of meta thought on the entire notion of these instrinsics as someone who would use them: in 2022 i'd considered using nontemporal_store but also needed a nontemporal_load. asm!() was not yet stable, and so i used neither and wrote a larger routine as assembly in a separate .s and linked it in. revisiting that from the perspective of this issue in 2024 i:

would be warned away from nontemporal_store,
not realize that _mm_stream_*() exist, just like i did not know they were present in 2022,
know that i just want movnti or vmovntdqa, write an asm!() block anyway.

if i'd known to look at _mm_stream_*() i'd probably use them in 2024. to opine, the entire notion of vendor-defined intrinsics is a little bit of a mess; i know the instruction i want on the far end of all of this, and at all points i'm basically working backwards to discover the most stable way to get that instruction produced in the resulting program, so the value of a vendor-defined intrinsic to me is that in C i could use that rather than compiler-specific assembly directives. not the kind of issue i'd have here.. :)

that all said, https://github.com/rust-lang/stdarch/pull/1534 and https://github.com/rust-lang/stdarch/pull/1541 seem quite reasonable, so thanks again.

RalfJung commented 8 months ago

i'd also like to, after the fact, thank folks here for concluding on the inline assembly implementation for _mmstream*, as someone who would use these.

I am not sure we have concluded yet. https://github.com/rust-lang/stdarch/pull/1541 is still open and I honestly don't know if people generally think that this is a good idea or not.

I'd prefer if we could use an LLVM intrinsic like for most of the other vendor intrinsics, but I don't know if LLVM would even accept taking such an intrinsic given that they have this !nontemporal attribute already.

something ad-hoc and unprincipled like "after a nontemporal store, until the next sfence, any release operation (release or stronger write, release or stronger fence, thread spawn) is UB".

would it be correct to phrase this a little more precisely as "after a nontemporal store and until the next sfence, any concurrent access to the memory nontemporally stored to is UB"? this is an important distinction as where i've typically used non-temporal instructions is to write to a buffer that i may not ever read again from the Rust function doing writing in the first place. in those cases it's on me to ensure (sfence or other target-appropriate) ordering guarantees, but it would not result in UB in terms of the Rust AM?

What you were quoting was an earlier, different proposal. What you said is very different from the text you were quoting. So no it wouldn't be correct to view what you said as a rephrasing of what you quoted. When I wrote "release operations are UB" I meant exactly that, I don't see how what you are saying is even similar to what I said.

But anyway please let's not discuss an outdated proposal that has since been overhauled. The updated proposal is what landed in https://github.com/rust-lang/stdarch/pull/1534.

rust-lang / rust

Non-temporal stores (and _mm_stream operations in stdarch) break our memory model #114582