rust-lang / portable-simd

The testing ground for the future of portable SIMD in Rust
Apache License 2.0
903 stars 81 forks source link

swizzle_dyn does not compile to pshufb #428

Closed cvijdea-bd closed 3 months ago

cvijdea-bd commented 3 months ago

I tried this code (Godbolt link):

#![feature(portable_simd)]
use std::simd::u8x16;

const LOOKUP: [u8; 16] = [0, 1, 2, 6, 1, 2, 3, 7, 2, 3, 4, 8, 6, 7, 8, 12];

pub fn do_the_swizzle(idx: &[u8; 16]) -> u8x16 {
    let lookup = u8x16::from_array(LOOKUP);
    let idx = u8x16::from_array(*idx);

    // if you uncomment the following line, some bounds checks are removed
    // let idx = idx & u8x16::splat(0b1111);
    // should pe pshufb, compiles to pinsrb / pextrb monstrosity
    lookup.swizzle_dyn(idx)
}

I expected to see this happen: swizzle_dyn compiles to pshufb

Instead, this happened: compiles to 16 pextrb / pinsrb pairs

Output with `-Copt-level=3 -Ctarget-cpu=skylake-avx512` ```asm .LCPI0_1: .byte 0 .byte 1 .byte 2 .byte 6 .byte 1 .byte 2 .byte 3 .byte 7 .byte 2 .byte 3 .byte 4 .byte 8 .byte 6 .byte 7 .byte 8 .byte 12 .LCPI0_2: .zero 4,15 example::do_the_swizzle::hff4dc3528cebccd9: mov rax, rdi vmovdqu xmm0, xmmword ptr [rsi] vpandd xmm0, xmm0, dword ptr [rip + .LCPI0_2]{1to4} vmovaps xmm1, xmmword ptr [rip + .LCPI0_1] vmovaps xmmword ptr [rsp - 24], xmm1 vpextrb ecx, xmm0, 0 movzx ecx, byte ptr [rsp + rcx - 24] vmovd xmm1, ecx vpextrb ecx, xmm0, 1 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 1 vpextrb ecx, xmm0, 2 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 2 vpextrb ecx, xmm0, 3 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 3 vpextrb ecx, xmm0, 4 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 4 vpextrb ecx, xmm0, 5 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 5 vpextrb ecx, xmm0, 6 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 6 vpextrb ecx, xmm0, 7 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 7 vpextrb ecx, xmm0, 8 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 8 vpextrb ecx, xmm0, 9 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 9 vpextrb ecx, xmm0, 10 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 10 vpextrb ecx, xmm0, 11 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 11 vpextrb ecx, xmm0, 12 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 12 vpextrb ecx, xmm0, 13 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 13 vpextrb ecx, xmm0, 14 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 14 vpextrb ecx, xmm0, 15 vpinsrb xmm0, xmm1, byte ptr [rsp + rcx - 24], 15 vmovdqa xmmword ptr [rdi], xmm0 ret ```

Meta

rustc --version --verbose:

rustc 1.82.0-nightly (6de928dce 2024-08-18)
cvijdea-bd commented 3 months ago

Oh, I missed the note:

/// Note that the current implementation is selected during build-time
/// of the standard library, so `cargo build -Zbuild-std` may be necessary
/// to unlock better performance, especially for larger vectors.
/// A planned compiler improvement will enable using `#[target_feature]` instead.

This likely explains it.

cvijdea-bd commented 3 months ago

Yes, testing locally with -Zbuild-std produces the expected code.