rust-lang / stdarch

Rust's standard library vendor-specific APIs and run-time feature detection
https://doc.rust-lang.org/stable/core/arch/
Apache License 2.0
599 stars 260 forks source link

non-temporal stores: use inline assembly #1541

Closed RalfJung closed 1 month ago

RalfJung commented 5 months ago

LLVM treats !nontemporal as just a hint on store operations, which is unsound -- they have a totally different semantics, similar to atomic memory orderings. So I'd like to avoid any risk of that causing any issues by entirely avoiding their !nontemporal attribute. Is it acceptable to use inline assembly to implement these intrinsics?

Note that this is my first time ever writing inline assembly, so the code may or may not make any sense.^^

rustbot commented 5 months ago

r? @Amanieu

rustbot has assigned @Amanieu. They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Amanieu commented 5 months ago

My understanding is that LLVM can turn a nontemporal store into a normal one, but not the other way around. This seems to be fine as far as I understand.


The CI failure happens because the target_feature attribute only enables sse2 and rustc isn't smart enough to figure out that this implies sse (only LLVM knowns that). You fix it by enabling the sse feature as well.

RalfJung commented 5 months ago

My understanding is that LLVM can turn a nontemporal store into a normal one, but not the other way around. This seems to be fine as far as I understand.

It's completely unclear. LangRef talks about it as a hint:

The optional !nontemporal metadata must reference a single metadata name corresponding to a metadata node with one i32 entry of value 1. The existence of the !nontemporal metadata on the instruction tells the optimizer and code generator that this load is not expected to be reused in the cache. The code generator may select special instructions to save cache bandwidth, such as the MOVNT instruction on x86.

That would mean the flag can be added or removed arbitrarily ("this load is not expected to be reused in the cache" -- but no semantic constraints or anything). But that's clearly wrong. LLVM doesn't acknowledge in the slightest the extra UB that can be caused by non-temporal stores (https://github.com/llvm/llvm-project/issues/64521). Therefore I have zero confidence that anyone thought about how !nontemporal interacts with all the LLVM passes that work on load (almost all of which probably just ignore the attribute entirely). I'm not even aware of any cross-platform memory model with support for nontemporal stores that they could be using here -- and they clearly need a cross-platform memory model since they are doing optimizations in the context of a C++11-style model.

RalfJung commented 5 months ago
SDE ERROR:  TID: 1064 executed instruction with an unaligned memory reference to address 0x7f27229035e0 INSTR: 0x562d8a5e21f3: IFORM: VMOVNTPS_MEMf32_ZMMf32_AVX512 :: vmovntps zmmword ptr [rax], zmm0
    IMAGE:    /checkout/target/x86_64-unknown-linux-gnu/release/deps/core_arch-59198cd2fc79a24a
    FUNCTION: _ZN9core_arch9core_arch3x867avx512f5tests20test_mm512_stream_ps20test_mm512_stream_ps17hb7f0b28acc824410E.llvm.13799798511543115899
    FUNCTION ADDR: 0x562d8a5e21c0

Hm, yes, this requires alignment, but that shouldn't be new...?

RalfJung commented 1 month ago

I think I found where LLVM defines the x86 intrinsics: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/IntrinsicsX86.td.

I found nothing with "stream" in the name, and the only "movnt" is int_x86_mmx_movnt_dq, probably accessible via llvm.x86.mmx.movnt.dq, which I assume is not the right thing.

There seem to be already quite a few asm! in stdarch so I guess using that here is acceptable? IMO it's better than just using normal loads since presumably people actually want the streaming semantics when using this operation.

RalfJung commented 1 month ago

Awesome, thanks. :)

After the next stdarch bump we can then remove the intrinsic from rustc.

Amanieu commented 1 month ago

The intrinsic is only broken on x86, it still has value on other targets.

RalfJung commented 1 month ago

Hm, fair. Maybe we should then document the intrinsic as "it is semantically equivalent to a regular load, just a hint", and on x86 actually compile it to just a load since that architecture doesn't have a "just a hint" version of this. For all other architectures we'd have to check whether what LLVM does there is sensible or not.

RalfJung commented 2 weeks ago

@Amanieu any chance we could get a stdarch bump in the rustc repo that includes this change? :)

Amanieu commented 2 weeks ago

We're waiting on a bootstrap bump that should happen next week.