Closed RalfJung closed 1 month ago
r? @Amanieu
rustbot has assigned @Amanieu. They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.
Use r? to explicitly pick a reviewer
My understanding is that LLVM can turn a nontemporal store into a normal one, but not the other way around. This seems to be fine as far as I understand.
The CI failure happens because the target_feature
attribute only enables sse2
and rustc isn't smart enough to figure out that this implies sse
(only LLVM knowns that). You fix it by enabling the sse
feature as well.
My understanding is that LLVM can turn a nontemporal store into a normal one, but not the other way around. This seems to be fine as far as I understand.
It's completely unclear. LangRef talks about it as a hint:
The optional !nontemporal metadata must reference a single metadata name
corresponding to a metadata node with one i32 entry of value 1. The existence of the !nontemporal metadata on the instruction tells the optimizer and code generator that this load is not expected to be reused in the cache. The code generator may select special instructions to save cache bandwidth, such as the MOVNT instruction on x86.
That would mean the flag can be added or removed arbitrarily ("this load is not expected to be reused in the cache" -- but no semantic constraints or anything). But that's clearly wrong. LLVM doesn't acknowledge in the slightest the extra UB that can be caused by non-temporal stores (https://github.com/llvm/llvm-project/issues/64521). Therefore I have zero confidence that anyone thought about how !nontemporal
interacts with all the LLVM passes that work on load
(almost all of which probably just ignore the attribute entirely). I'm not even aware of any cross-platform memory model with support for nontemporal stores that they could be using here -- and they clearly need a cross-platform memory model since they are doing optimizations in the context of a C++11-style model.
SDE ERROR: TID: 1064 executed instruction with an unaligned memory reference to address 0x7f27229035e0 INSTR: 0x562d8a5e21f3: IFORM: VMOVNTPS_MEMf32_ZMMf32_AVX512 :: vmovntps zmmword ptr [rax], zmm0
IMAGE: /checkout/target/x86_64-unknown-linux-gnu/release/deps/core_arch-59198cd2fc79a24a
FUNCTION: _ZN9core_arch9core_arch3x867avx512f5tests20test_mm512_stream_ps20test_mm512_stream_ps17hb7f0b28acc824410E.llvm.13799798511543115899
FUNCTION ADDR: 0x562d8a5e21c0
Hm, yes, this requires alignment, but that shouldn't be new...?
I think I found where LLVM defines the x86 intrinsics: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/IntrinsicsX86.td.
I found nothing with "stream" in the name, and the only "movnt" is int_x86_mmx_movnt_dq
, probably accessible via llvm.x86.mmx.movnt.dq
, which I assume is not the right thing.
There seem to be already quite a few asm!
in stdarch so I guess using that here is acceptable? IMO it's better than just using normal loads since presumably people actually want the streaming semantics when using this operation.
Awesome, thanks. :)
After the next stdarch bump we can then remove the intrinsic from rustc.
The intrinsic is only broken on x86, it still has value on other targets.
Hm, fair. Maybe we should then document the intrinsic as "it is semantically equivalent to a regular load, just a hint", and on x86 actually compile it to just a load since that architecture doesn't have a "just a hint" version of this. For all other architectures we'd have to check whether what LLVM does there is sensible or not.
@Amanieu any chance we could get a stdarch bump in the rustc repo that includes this change? :)
We're waiting on a bootstrap bump that should happen next week.
LLVM treats
!nontemporal
as just a hint on store operations, which is unsound -- they have a totally different semantics, similar to atomic memory orderings. So I'd like to avoid any risk of that causing any issues by entirely avoiding their!nontemporal
attribute. Is it acceptable to use inline assembly to implement these intrinsics?Note that this is my first time ever writing inline assembly, so the code may or may not make any sense.^^