Closed RalfJung closed 3 months ago
If we do add a lint, we should make sure to tell people that if they do want to live on the wild side and use these (as clearly some people do), that they need to add the SFENCE
, as that often seems to be neglected.
I'm still in favor of the approach described above:
@rustbot labels -I-lang-nominated
I'm going to unnominate because, while this issue is important, T-lang is not likely to have much more to say about it without a document that explores the issue and the feasible resolutions in some detail. Or additionally or alternatively, T-lang would likely have something more to say about a PR adding e.g. a lint (as suggested above) or about a specific addition to the documentation.
Please nominate any of those things, or re-nominate this issue if there's new information and a specific question that T-lang might be able to answer.
T-lang is not likely to have much more to say about it without a document that explores the issue and the feasible resolutions in some detail
This comment (linked from the top comment) was supposed to be such a "document" / summary.
Sorry but, as far as I can tell, "a document that explores the issue in detail" means nothing if the information provided so far is insufficient. I am assuming that T-lang does not literally mean they need us to copy and paste that comment's body text into a HackMD link.
May we have the pleasure of T-lang specifying what they feel is actually needed in terms of "exploring the issue in detail"?
Admittedly, the text offered is incomplete.
@tmandry I haven't said it yet in this thread, but the problem actually does not start or end with these intrinsics, so asking about linting on them alone is a distraction. The property attributed to "non-temporal stores" is a property that can apply to anywhere in memory if the page is mapped appropriately, and sometimes such interactions can happen for graphics API code which exposes the ability to intentionally map memory pages appropriately to userspace, and which is used to obtain high-performance graphics (in the sense that for the past 10-20 years we have had "high-performance graphics" on computers). Because it makes a lot of things a great deal more efficient for certain write-heavy sequences. Unlike "you can open /proc/self/mem
and sabotage yourself on purpose", this is done intentionally for not-a-joke reasons by ordinary programs and is expected to have a meaningful operational semantics.
Admittedly, the text offered is incomplete.
Yeah, that's fair.
I think my preferred way to proceed is a PR against stdarch that
nontemporal
flag (which clearly LLVM itself has no clue how to treat properly so we should take a wide berth around it)Once we have that PR we can ask t-lang for FCP on that I guess. I'd be happy to write the docs but I'd probably need help with the inline asm part.
That sounds like a reasonable way forward to me, @RalfJung. For the larger questions, it sounds like we should probably have a design meeting (or an RFC) dedicated to the topic.
I've opened https://github.com/rust-lang/stdarch/pull/1534 to update the stdarch docs. However I need help updating the implementations of these intrinsics to move away from the !nontemporal
LLVM attribute -- either using LLVM functions specific for a particular instruction, or if that is not possible, using inline assembly.
Where can I find out which instruction-specific LLVM functions exist? (I mean functions like llvm.x86.sse4a.movnt.sd
.)
And the last step, moving away from LLVM's !nontemporal
which (from what can be gleaned from the LangRef) entirely fails to account for the fact that this has subtle interactions with the memory model: https://github.com/rust-lang/stdarch/pull/1541.
i missed most of the existence of this issue, but thank you for also updating the documentation on {core,std}::intrinsics::nontemporal_store
to point away from the intrinsic. i'd missed that at first and was going to suggest as much here, before finding that PR :)
i'd also like to, after the fact, thank folks here for concluding on the inline assembly implementation for _mm_stream_*
, as someone who would use these. to go back to a very early comment here,
Also I think if we lower _mm_stream_ps in a way that includes an sfence, people will be very surprised. We should rather not expose _mm_stream_ps than expose something that's not the same operation.
to which i very strongly agree. i have a very specific expectation for _mm_stream_*
and what nontemporal_store
would mean on x86/x86_64 targets: "results in use of a movnt*
instruction appropriate for the data being operated on". maintaining that is something of a relief :)
i realize the conversation here is focused mostly on the unsoundness of the existing implementation of these intrinsics, but i'd like to emphasize as a user of these instructions that their primary value is for the microarchitectural hardware effect. it's not particularly useful to any use i've seen that these stores are write-combining and weakly ordered, or that corresponding non-temporal loads are also WC and weakly-ordered. in fact (opinion warning) it's something of an annoying x86/x86_64 implementation detail rather than an expected or desired behavior. the primary value of these intrinsics is to avoid pulling known-unlikely data into caches only because it happened to be needed once.
using write-combining semantics to achieve that effect is, in my understanding, mostly a convenience to reuse an existing x86 memory type, rather than invent a new type used only for a handful of instructions. so, i hope this underlines why simply implementing these intrinsics as regular loads an stores would be so surprising.
one other earlier comment i'd like to understand better, on a hypothetical alternative to having simply stabilized these intrinsics a while ago:
something ad-hoc and unprincipled like "after a nontemporal store, until the next sfence, any release operation (release or stronger write, release or stronger fence, thread spawn) is UB".
would it be correct to phrase this a little more precisely as "after a nontemporal store and until the next sfence, any concurrent access to the memory nontemporally stored to is UB"? this is an important distinction as where i've typically used non-temporal instructions is to write to a buffer that i may not ever read again from the Rust function doing writing in the first place. in those cases it's on me to ensure (sfence or other target-appropriate) ordering guarantees, but it would not result in UB in terms of the Rust AM?
of course, "simply integrate the Rust AM, cross-language requirements, and subtle target architecture semantics and make sure you got it all right" is a tall order and part of why i cannot, for example, Just Point To where i do or would use _mm_stream_*
.
and one other thought on why these are useful intrinsics to have, re.
For what it's worth, the basic intended usage of
movnti
andmovntdq
, followed bysfence
, is in fact basically as what was, at the time, equivalent to what is now basically encompassed by Enhanced REP MOVSB, which on CPUs with that feature means ...
Enhanced REP MOVSB, and the related fast-string
feature, as far as i've read, do not guarantee use anything about cache lines related to the buffers that are operated on. in practice cache lines may not be filled, but this would be architecture-dependent in a way that movnt*/sfence
are not. additionally, movnt*/sfence
are also used to implement non-cache-polluting memset
, which rep stosb + fast-string
doesn't guarantee. if the Rust project could pressure Intel into committing to non-cache-polluting semantics for fast string operations from some architecture onwards, i and all 200 people who are particularly invested in this weird corner of x86 would be greatly appreciative :pray:
finally, a kind of meta thought on the entire notion of these instrinsics as someone who would use them: in 2022 i'd considered using nontemporal_store
but also needed a nontemporal_load
. asm!()
was not yet stable, and so i used neither and wrote a larger routine as assembly in a separate .s and linked it in. revisiting that from the perspective of this issue in 2024 i:
nontemporal_store
,_mm_stream_*()
exist, just like i did not know they were present in 2022,movnti
or vmovntdqa
, write an asm!()
block anyway.if i'd known to look at _mm_stream_*()
i'd probably use them in 2024. to opine, the entire notion of vendor-defined intrinsics is a little bit of a mess; i know the instruction i want on the far end of all of this, and at all points i'm basically working backwards to discover the most stable way to get that instruction produced in the resulting program, so the value of a vendor-defined intrinsic to me is that in C i could use that rather than compiler-specific assembly directives. not the kind of issue i'd have here.. :)
that all said, https://github.com/rust-lang/stdarch/pull/1534 and https://github.com/rust-lang/stdarch/pull/1541 seem quite reasonable, so thanks again.
i'd also like to, after the fact, thank folks here for concluding on the inline assembly implementation for _mmstream*, as someone who would use these.
I am not sure we have concluded yet. https://github.com/rust-lang/stdarch/pull/1541 is still open and I honestly don't know if people generally think that this is a good idea or not.
I'd prefer if we could use an LLVM intrinsic like for most of the other vendor intrinsics, but I don't know if LLVM would even accept taking such an intrinsic given that they have this !nontemporal
attribute already.
something ad-hoc and unprincipled like "after a nontemporal store, until the next sfence, any release operation (release or stronger write, release or stronger fence, thread spawn) is UB".
would it be correct to phrase this a little more precisely as "after a nontemporal store and until the next sfence, any concurrent access to the memory nontemporally stored to is UB"? this is an important distinction as where i've typically used non-temporal instructions is to write to a buffer that i may not ever read again from the Rust function doing writing in the first place. in those cases it's on me to ensure (sfence or other target-appropriate) ordering guarantees, but it would not result in UB in terms of the Rust AM?
What you were quoting was an earlier, different proposal. What you said is very different from the text you were quoting. So no it wouldn't be correct to view what you said as a rephrasing of what you quoted. When I wrote "release operations are UB" I meant exactly that, I don't see how what you are saying is even similar to what I said.
But anyway please let's not discuss an outdated proposal that has since been overhauled. The updated proposal is what landed in https://github.com/rust-lang/stdarch/pull/1534.
I recently discovered this funny little intrinsic with the great comment
Unfortunately, the comment is wrong: this has become stable, through vendor intrinsics like
_mm_stream_ps
.Why is that a problem? Well, turns out non-temporal stores completely break our memory model. The following assertion can fail under the current compilation scheme used by LLVM:
The assertion can fail because the CPU may order MOVNT after later MOV (for different locations), so the nontemporal_store might occur after the release store. Sources for this claim:
This is a big problem -- we have a memory model that says you can use release/acquire operations to synchronize any (including non-atomic) memory accesses, and we have memory accesses which are not properly synchronized by release/acquire operations.
So what could be done?
_mm_stream
intrinsics without it and mark them as deprecated to signal that they don't match the expected semantics of the underlying hardware operation. People should use inline assembly instead and then it is their responsibility to have an sfence at the end of their asm block to restore expected synchronization behavior.Thanks a lot to @workingjubilee and @the8472 for their help in figuring out the details of nontemporal stores.
Cc @rust-lang/lang @Amanieu
Also see the nomination comment here.