Open ManuelCostanzo opened 3 years ago
Hello.
We will support 512-bit vectors. However, you'll need to turn up the enabled features during compilation because by default Rust binaries are not compiled with avx-512 enabled.
Thank you for reply ! And what features I have to enable ?
You'd usually use a target-feature list in the RUSTFLAGS value during build.
The allowed features are the same as for the target_feature attribute
It appears that you can't enable avx-512 on stable yet.
Perhaps @Amanieu knows more? I've seen them merging work in stdarch lately.
I made a N-Body algorithm implementation, and on my server, compiling with target-cpu=knl works much better than with target-cpu=native.
It's like it vectorizes better, but without adding any target-feature.
Although I am not in a KNL, it is true that the server has similar instructions and for some reason it works better (i mean, the algorithm takes less time to finish)
Currrently we tie the stabilization of target_feature
features with the implementation of the relevant intrinsics in stdarch (The AVX512 intrinsics are still incomplete). However I think we should separate these now that we have stdsimd.
Knight's Landing chips lack the narrower-width SSE instructions so it is likely that some things that are lowering to SSE instructions while using -Ctarget-cpu=native
are lowering to AVX instructions with -Ctarget-cpu=knl
.
I just pestered everyone by mentioning this in the Zulip so I should mention it here: I should note that "AVX-512" is by no means a singular unitary feature, there is avx512f
("F" for "Foundation", perhaps?) and also extension features for AVX-512 that build on top of avx512f
, so that's something to be aware of. Our main attention will be on supporting the concepts of 512-bit vectors abstractly in our API and in a manner that is vendor-neutral so that the compiler can do the best job it can with a desired intention without the programmer having to get into the nitty-gritty specifics of Intel's API.
@ManuelCostanzo Note that gcc/clang/icc generally avoid 512-bit registers even when compiled for skylake-avx512
due to license-based downclocking, which includes stalls at frequency transitions. You have to specifically request them by something like -mprefer-vector-width=512
. Ice Lake has a big improvement in downclocking so we might see compilers using 512-bit registers by default. rustc --print target-features
suggests that there is no way to encourage the compiler to actually use 512-bit registers (which is what you want if you spend lots of time in sustained avx512 code).
Unfortunately I believe that will also be entirely out of our hands unless LLVM provides a mechanism for encouraging it. Using target-cpu=native
may help in some cases?
Aha! rustc -Ctarget-cpu=skylake-avx512 -Ctarget-feature=-prefer-256-bit
. It's confusing because +prefer-256-bit
is the default and one specifies that they want 512 by disabling it -- I'd been looking for +prefer-512-bit
, which doesn't exist.
At first I was considering to myself, "shouldn't this issue be closed?" since it's not something the Portable SIMD API can help with per se. However, past and future Jubilees, please consider: These specific instructions on targeting AVX512-enabled architectures should probably go somewhere, and from that "guide-level" perspective, this is within the scope of our mandate.
AVX512 would certainly be nice for cryptography. For example, curve25519-dalek has a backend leveraging AVX512-IFMA.
GHASH (used by AES-GCM) also benefits from VPCLMULQDQ, but it's already possible to leverage from Rust just by using target-cpu=skylake
The Keccak sponge function (used by the SHA3 family and the KangarooTwelve XOF) is another example of an algorithm that could benefit: https://github.com/XKCP/K12/blob/master/lib/Optimized64/KeccakP-1600-AVX512-plainC.c
@tarcieri When I said "specific instructions" I meant for human usage.
Conversely, guaranteeing specific machine instructions, including for specific SIMD architectures, are compiled into the binary is not actually in-scope for the SIMD API project, as much of a paradox as that may seem, so usages like those will likely continue to depend on core::arch::x86_64, etc.
If I understand what you're saying, there are specific logical operations the above AVX512 use cases map to, but there may not be corresponding Rust traits for those operations.
The curve25519-dalek
use case requires a multiply-accumulate operation, namely fused multiply–add.
The GHASH use case is carryless multiplication. I'm not sure what a good API is for distinguishing that from a more traditional multiply-with-carry.
Keccak is simple bitwise ops like XOR and shuffles.
FMA will likely be supported at some point (regardless of AVX-512). Unfortunately llvm doesn't expose carry-less multiply (https://groups.google.com/g/llvm-dev/c/5cpOboKOBg4/m/kJ9z_xkVAQAJ) so you'd probably need to use std::arch
for that.
Knight's Landing chips lack the narrower-width SSE instructions so it is likely that some things that are lowering to SSE instructions while using
-Ctarget-cpu=native
are lowering to AVX instructions with-Ctarget-cpu=knl
.
This was wrong, actually! It is Knight's Corner and Knight's Ferry that don't support SSE! KNL does support SSE, but it has the really wide vectors plus some other performance quirks that cause LLVM to favor using big fat full vectors.
It'll be 256-bit AVX/FMA versus AVX-512. KNL didn't suffer the license-based downclocking so compilers issue 512-bit (zmm) instructions by default. They need coaxing to issue those when targeting skylake-avx512
(https://github.com/rust-lang/stdsimd/issues/28#issuecomment-729869709). Note that the skylake
target does not support AVX-512 at all.
Hi, I'm starting to learn Rust for scientific computing. Is this issue already resolved elsewhere by rust developers?
@mhnatiuk It depends what you're striving for. rustc
makes portable binaries by default, but you can either change the target globally (see examples :point_up:; this is the most common approach in scientific computing) or compile multiple variants of hot vectorizable parts of your code and specialize at run-time (nicer for packaging and distribution).
portable-simd does seem to hit the AVX512 instruction set when compiled with target-cpu=native"
.
This is implicitly deduced by the fact that the performance of a masked sum equals the sum of an un-masked sum when the mask is represented as a bitmap. See https://github.com/DataEngineeringLabs/simd-benches#bench-results-on-native for details. The particular comparison is "Sum of nullable values (Bitmap)" vs "Sum of values".
Hi,
any news for this issue ? We are now in 2024 and with rustc 1.76 I get
the target feature
avx512fis currently unstable
How could I help to stabilize ?
I guess the right place to ask would be https://github.com/rust-lang/stdarch/issues/310 ?
Currrently we tie the stabilization of
target_feature
features with the implementation of the relevant intrinsics in stdarch (The AVX512 intrinsics are still incomplete). However I think we should separate these now that we have stdsimd.
I think this is the relevant comment--someone will need to spend some time splitting the feature and stabilizing the target features and leave the intrinsics for another time. I'm not sure if there's any good reason for holding back stabilization at this point.
I agree that it's fine not to block the target feature on the intrinsics.
Notably it would be nice to have the target_feature
stable so ZMM registers can be used with inline assembly, even if the relevant intrinsics aren't stable
Notably it would be nice to have the
target_feature
stable so ZMM registers can be used with inline assembly, even if the relevant intrinsics aren't stable
Using ZMM registers as clobbered registers (i.e. out("zmm0) _,
) (and using runtime detection with rawcpu_id or cpufeatures crate) seems to work. If you dont need to have ZMM registers as input or output, this may work.
@tarcieri I would love to see your insight on this method?
@mert-kurttutan while we could potentially go out of our way to avoid using ZMM registers as inputs/outputs, what we'd really like to eventually use are intrinsics like _mm512_aesenc_epi128
, which take ZMM registers as inputs and outputs. But in the meantime, we can use an asm!
polyfill instead, or at least we could if ZMM registers are stable.
Really these operations benefit the most from always being able to keep data in ZMM registers, and unless we have a stable way to get data in and out of them it involves hoisting more and more into the inline assembly to fill those ZMM registers. We also have algorithms factored into different crates where it would be nice to be able to keep data in ZMM registers even when calling functions between crates.
Hello !
I want to ask if this crate supports AVX 512 instructions. If not, Is it in the plans to be able to support it ? This would be the definitive rate for simd in Rust ? Because I understand that the one that is in the official documentation does not have more support.
Thanks