Open cvijdea-bd opened 3 weeks ago
It can be reduced to a minimal example which doesn't require -Zbuild-std (Godbolt link):
#![feature(core_intrinsics)]
#![feature(portable_simd)]
use std::simd::prelude::*;
#[inline(never)]
#[no_mangle]
pub fn test_lt_select_mask_raw(idxs: u8x32) -> u8x32 {
unsafe {
let m: i8x32 = core::intrinsics::simd::simd_lt(idxs, Simd::splat(32u8));
// changing to `m: u32` here gets the good codegen; using `[u8; 4]` gets the bad codegen
let m: [u8; 4] = core::intrinsics::simd::simd_bitmask(m);
let m: i8x32 =
core::intrinsics::simd::simd_select_bitmask(m, Simd::splat(-1), Simd::splat(0));
core::intrinsics::simd::simd_select(m, idxs, Simd::splat(u8::MAX))
}
}
Seems to be caused by https://github.com/rust-lang/portable-simd/blob/master/crates/core_simd/src/masks/bitmask.rs
It's worth noting that AVX2 / SSE codegen is equally bad when using the [u8; 4] simd_bitmask variant. It's just avoided by the fact that the mask::bitmask::Mask
implementation is only cfg-ed in on avx512f (in the non-bitmask case it's uses just the simd_lt + simd_select intrinsics).
Discussed on Zulip: https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd/topic/simd.3A.3AMask.20codegen.20on.20avx512
I tried this code (Godbolt link):
I expected to see this happen: simd_lt + simd_select is compiled to clean 3 instruction sequence (AVX512: vpcmpltub + vpcmpeqd + vmovdqu8-with-mask, AVX2: vpmaxub + vpcmpeqb + vpor) - this is the case with pre-built std, see Godbolt
Instead, this happened: with -Zbuild-std, vpcmpltub is followed by lots of redundant shuffling of mask registers
RUSTFLAGS=-Ctarget-cpu=sapphirerapids cargo build --release -Z build-std --target x86_64-unknown-linux-gnu
Messy assembly
``` 00000000000596b0 <_ZN9test_simd14test_lt_select17h4bc8c118b05e9d0dE>: 596b0: 62 f3 7d 28 3e 05 25 vpcmpltub k0,ymm0,YMMWORD PTR [rip+0xfffffffffffab725] # 4de0Without -Zbuild-std, the generated LLVM IR is a beautiful icmp ult followed by select.
With -Zbuild-std, the LLVM IR is as much of a mess as the generated assembly:
Messy IR
```llvm ; test_simd::test_lt_select ; Function Attrs: mustprogress nofree noinline norecurse nosync nounwind nonlazybind willreturn memory(argmem: write) uwtable define internal fastcc void @_ZN9test_simd14test_lt_select17h4bc8c118b05e9d0dE(ptr dead_on_unwind noalias nocapture noundef writable writeonly align 32 dereferenceable(32) %_0, <32 x i8> %idxs.0.val) unnamed_addr #3 !dbg !2383 { start: #dbg_declare(ptr undef, !2385, !DIExpression(), !2386) #dbg_declare(ptr undef, !2387, !DIExpression(), !2391) #dbg_declare(ptr undef, !2393, !DIExpression(), !2398) #dbg_value(<32 x i8>With -Zbuild-std, but a target-cpu without avx512 (e.g. x86-64-v3), the IR and assembly are beautiful again.
rustc --version --verbose
:Reproduced with and without lto = "thin", also on Windows, and also with different target-cpu (x86-64-v4, skylake-avx512).