Closed cvijdea-bd closed 3 months ago
Oh, I missed the note:
/// Note that the current implementation is selected during build-time
/// of the standard library, so `cargo build -Zbuild-std` may be necessary
/// to unlock better performance, especially for larger vectors.
/// A planned compiler improvement will enable using `#[target_feature]` instead.
This likely explains it.
Yes, testing locally with -Zbuild-std produces the expected code.
I tried this code (Godbolt link):
I expected to see this happen: swizzle_dyn compiles to pshufb
Instead, this happened: compiles to 16 pextrb / pinsrb pairs
Output with `-Copt-level=3 -Ctarget-cpu=skylake-avx512`
```asm .LCPI0_1: .byte 0 .byte 1 .byte 2 .byte 6 .byte 1 .byte 2 .byte 3 .byte 7 .byte 2 .byte 3 .byte 4 .byte 8 .byte 6 .byte 7 .byte 8 .byte 12 .LCPI0_2: .zero 4,15 example::do_the_swizzle::hff4dc3528cebccd9: mov rax, rdi vmovdqu xmm0, xmmword ptr [rsi] vpandd xmm0, xmm0, dword ptr [rip + .LCPI0_2]{1to4} vmovaps xmm1, xmmword ptr [rip + .LCPI0_1] vmovaps xmmword ptr [rsp - 24], xmm1 vpextrb ecx, xmm0, 0 movzx ecx, byte ptr [rsp + rcx - 24] vmovd xmm1, ecx vpextrb ecx, xmm0, 1 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 1 vpextrb ecx, xmm0, 2 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 2 vpextrb ecx, xmm0, 3 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 3 vpextrb ecx, xmm0, 4 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 4 vpextrb ecx, xmm0, 5 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 5 vpextrb ecx, xmm0, 6 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 6 vpextrb ecx, xmm0, 7 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 7 vpextrb ecx, xmm0, 8 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 8 vpextrb ecx, xmm0, 9 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 9 vpextrb ecx, xmm0, 10 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 10 vpextrb ecx, xmm0, 11 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 11 vpextrb ecx, xmm0, 12 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 12 vpextrb ecx, xmm0, 13 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 13 vpextrb ecx, xmm0, 14 vpinsrb xmm1, xmm1, byte ptr [rsp + rcx - 24], 14 vpextrb ecx, xmm0, 15 vpinsrb xmm0, xmm1, byte ptr [rsp + rcx - 24], 15 vmovdqa xmmword ptr [rdi], xmm0 ret ```Meta
rustc --version --verbose
: