rust-lang / portable-simd

The testing ground for the future of portable SIMD in Rust
Apache License 2.0
845 stars 74 forks source link

Status of AVX 512 ? #28

Open ManuelCostanzo opened 3 years ago

ManuelCostanzo commented 3 years ago

Hello !

I want to ask if this crate supports AVX 512 instructions. If not, Is it in the plans to be able to support it ? This would be the definitive rate for simd in Rust ? Because I understand that the one that is in the official documentation does not have more support.

Thanks

Lokathor commented 3 years ago

Hello.

We will support 512-bit vectors. However, you'll need to turn up the enabled features during compilation because by default Rust binaries are not compiled with avx-512 enabled.

ManuelCostanzo commented 3 years ago

Thank you for reply ! And what features I have to enable ?

Lokathor commented 3 years ago

You'd usually use a target-feature list in the RUSTFLAGS value during build.

The allowed features are the same as for the target_feature attribute

It appears that you can't enable avx-512 on stable yet.

Perhaps @Amanieu knows more? I've seen them merging work in stdarch lately.

ManuelCostanzo commented 3 years ago

I made a N-Body algorithm implementation, and on my server, compiling with target-cpu=knl works much better than with target-cpu=native.

It's like it vectorizes better, but without adding any target-feature.

Although I am not in a KNL, it is true that the server has similar instructions and for some reason it works better (i mean, the algorithm takes less time to finish)

Amanieu commented 3 years ago

Currrently we tie the stabilization of target_feature features with the implementation of the relevant intrinsics in stdarch (The AVX512 intrinsics are still incomplete). However I think we should separate these now that we have stdsimd.

workingjubilee commented 3 years ago

Knight's Landing chips lack the narrower-width SSE instructions so it is likely that some things that are lowering to SSE instructions while using -Ctarget-cpu=native are lowering to AVX instructions with -Ctarget-cpu=knl.

workingjubilee commented 3 years ago

I just pestered everyone by mentioning this in the Zulip so I should mention it here: I should note that "AVX-512" is by no means a singular unitary feature, there is avx512f ("F" for "Foundation", perhaps?) and also extension features for AVX-512 that build on top of avx512f, so that's something to be aware of. Our main attention will be on supporting the concepts of 512-bit vectors abstractly in our API and in a manner that is vendor-neutral so that the compiler can do the best job it can with a desired intention without the programmer having to get into the nitty-gritty specifics of Intel's API.

jedbrown commented 3 years ago

@ManuelCostanzo Note that gcc/clang/icc generally avoid 512-bit registers even when compiled for skylake-avx512 due to license-based downclocking, which includes stalls at frequency transitions. You have to specifically request them by something like -mprefer-vector-width=512. Ice Lake has a big improvement in downclocking so we might see compilers using 512-bit registers by default. rustc --print target-features suggests that there is no way to encourage the compiler to actually use 512-bit registers (which is what you want if you spend lots of time in sustained avx512 code).

calebzulawski commented 3 years ago

Unfortunately I believe that will also be entirely out of our hands unless LLVM provides a mechanism for encouraging it. Using target-cpu=native may help in some cases?

jedbrown commented 3 years ago

Aha! rustc -Ctarget-cpu=skylake-avx512 -Ctarget-feature=-prefer-256-bit. It's confusing because +prefer-256-bit is the default and one specifies that they want 512 by disabling it -- I'd been looking for +prefer-512-bit, which doesn't exist.

https://godbolt.org/z/nEMWz9

workingjubilee commented 3 years ago

At first I was considering to myself, "shouldn't this issue be closed?" since it's not something the Portable SIMD API can help with per se. However, past and future Jubilees, please consider: These specific instructions on targeting AVX512-enabled architectures should probably go somewhere, and from that "guide-level" perspective, this is within the scope of our mandate.

tarcieri commented 3 years ago

AVX512 would certainly be nice for cryptography. For example, curve25519-dalek has a backend leveraging AVX512-IFMA.

GHASH (used by AES-GCM) also benefits from VPCLMULQDQ, but it's already possible to leverage from Rust just by using target-cpu=skylake

The Keccak sponge function (used by the SHA3 family and the KangarooTwelve XOF) is another example of an algorithm that could benefit: https://github.com/XKCP/K12/blob/master/lib/Optimized64/KeccakP-1600-AVX512-plainC.c

workingjubilee commented 3 years ago

@tarcieri When I said "specific instructions" I meant for human usage.

Conversely, guaranteeing specific machine instructions, including for specific SIMD architectures, are compiled into the binary is not actually in-scope for the SIMD API project, as much of a paradox as that may seem, so usages like those will likely continue to depend on core::arch::x86_64, etc.

tarcieri commented 3 years ago

If I understand what you're saying, there are specific logical operations the above AVX512 use cases map to, but there may not be corresponding Rust traits for those operations.

The curve25519-dalek use case requires a multiply-accumulate operation, namely fused multiply–add.

The GHASH use case is carryless multiplication. I'm not sure what a good API is for distinguishing that from a more traditional multiply-with-carry.

Keccak is simple bitwise ops like XOR and shuffles.

calebzulawski commented 3 years ago

FMA will likely be supported at some point (regardless of AVX-512). Unfortunately llvm doesn't expose carry-less multiply (https://groups.google.com/g/llvm-dev/c/5cpOboKOBg4/m/kJ9z_xkVAQAJ) so you'd probably need to use std::arch for that.

workingjubilee commented 3 years ago

Knight's Landing chips lack the narrower-width SSE instructions so it is likely that some things that are lowering to SSE instructions while using -Ctarget-cpu=native are lowering to AVX instructions with -Ctarget-cpu=knl.

This was wrong, actually! It is Knight's Corner and Knight's Ferry that don't support SSE! KNL does support SSE, but it has the really wide vectors plus some other performance quirks that cause LLVM to favor using big fat full vectors.

jedbrown commented 3 years ago

It'll be 256-bit AVX/FMA versus AVX-512. KNL didn't suffer the license-based downclocking so compilers issue 512-bit (zmm) instructions by default. They need coaxing to issue those when targeting skylake-avx512 (https://github.com/rust-lang/stdsimd/issues/28#issuecomment-729869709). Note that the skylake target does not support AVX-512 at all.

mhnatiuk commented 2 years ago

Hi, I'm starting to learn Rust for scientific computing. Is this issue already resolved elsewhere by rust developers?

jedbrown commented 2 years ago

@mhnatiuk It depends what you're striving for. rustc makes portable binaries by default, but you can either change the target globally (see examples :point_up:; this is the most common approach in scientific computing) or compile multiple variants of hot vectorizable parts of your code and specialize at run-time (nicer for packaging and distribution).

jorgecarleitao commented 2 years ago

portable-simd does seem to hit the AVX512 instruction set when compiled with target-cpu=native".

This is implicitly deduced by the fact that the performance of a masked sum equals the sum of an un-masked sum when the mask is represented as a bitmap. See https://github.com/DataEngineeringLabs/simd-benches#bench-results-on-native for details. The particular comparison is "Sum of nullable values (Bitmap)" vs "Sum of values".

JeWaVe commented 5 months ago

Hi,

any news for this issue ? We are now in 2024 and with rustc 1.76 I get

the target featureavx512fis currently unstable

How could I help to stabilize ?

HadrienG2 commented 4 months ago

I guess the right place to ask would be https://github.com/rust-lang/stdarch/issues/310 ?

calebzulawski commented 4 months ago

Currrently we tie the stabilization of target_feature features with the implementation of the relevant intrinsics in stdarch (The AVX512 intrinsics are still incomplete). However I think we should separate these now that we have stdsimd.

I think this is the relevant comment--someone will need to spend some time splitting the feature and stabilizing the target features and leave the intrinsics for another time. I'm not sure if there's any good reason for holding back stabilization at this point.

Amanieu commented 4 months ago

I agree that it's fine not to block the target feature on the intrinsics.

tarcieri commented 4 months ago

Notably it would be nice to have the target_feature stable so ZMM registers can be used with inline assembly, even if the relevant intrinsics aren't stable

mert-kurttutan commented 6 days ago

Notably it would be nice to have the target_feature stable so ZMM registers can be used with inline assembly, even if the relevant intrinsics aren't stable

Using ZMM registers as clobbered registers (i.e. out("zmm0) _,) (and using runtime detection with rawcpu_id or cpufeatures crate) seems to work. If you dont need to have ZMM registers as input or output, this may work. @tarcieri I would love to see your insight on this method?

tarcieri commented 2 days ago

@mert-kurttutan while we could potentially go out of our way to avoid using ZMM registers as inputs/outputs, what we'd really like to eventually use are intrinsics like _mm512_aesenc_epi128, which take ZMM registers as inputs and outputs. But in the meantime, we can use an asm! polyfill instead, or at least we could if ZMM registers are stable.

Really these operations benefit the most from always being able to keep data in ZMM registers, and unless we have a stable way to get data in and out of them it involves hoisting more and more into the inline assembly to fill those ZMM registers. We also have algorithms factored into different crates where it would be nice to be able to keep data in ZMM registers even when calling functions between crates.