rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.86k stars 12.66k forks source link

SEE-AVX-stall when using target-cpu=znver2. Thus ~25 times slower. #120108

Open Barfussmann opened 9 months ago

Barfussmann commented 9 months ago

The performance of the following function is realy slow when i use target-cpu=znver2, my own cpu. Without a target-cpu set it is ~25 times quicker.

fn main() {
    let row = std::hint::black_box(& [125u8; 8]);
    let mut data = [0u32; 8];
    let data = std::hint::black_box(&mut data);

    for _ in 0..500_000_000u64 {
        std::hint::black_box(slow(*row, data));
    }
}
#[inline(never)]
pub fn slow(mut row: [u8; 8], data: &mut [u32; 8]) -> [u8; 8] {
    for data in data {
        *data <<= 5;
    }

    row[7] = 0;
    row
}

I have run it twice with perf. Once with the target-cpu set and once without:

$ RUSTFLAGS="-C target-cpu=znver2" perf stat -e "sse_avx_stalls" cargo run --release
   Compiling perf_test v0.1.0 (/home/barfussmann/Documents/perf_test)
    Finished release [optimized + debuginfo] target(s) in 0.20s
     Running `target/release/perf_test`

 Performance counter stats for 'cargo run --release':

    10.486.365.719      sse_avx_stalls:u                                                      
      32,070525175 seconds time elapsed
      31,812318000 seconds user
       0,125145000 seconds sys

$ perf stat -e "sse_avx_stalls" cargo run --release
   Compiling perf_test v0.1.0 (/home/barfussmann/Documents/perf_test)
    Finished release [optimized + debuginfo] target(s) in 0.19s
     Running `target/release/perf_test`

 Performance counter stats for 'cargo run --release':

               344      sse_avx_stalls:u                                                      
       1,321682733 seconds time elapsed
       1,213303000 seconds user
       0,104931000 seconds sys

The culprit seems to be SSE-AVX stalls. When looking at the assembly of the slow function with target-cpu set (Compiler Explorer: https://godbolt.org/z/69TxxzG4T). there is a AVX instruction before the SSE4a instruction "exrtq" an Amd specific instruction. There isn't a zveroupper between both instructions. This should be the causes of the SSE-AVX-stall when I'm not mistaken.


example::slow:
        vmovdqu ymm0, ymmword ptr [rsi]   // AVX
        vmovq   xmm1, rdi                  
        extrq   xmm1, 56, 0               // SSE4a
        vmovq   rax, xmm1
        vpslld  ymm0, ymm0, 5
        vmovdqu ymmword ptr [rsi], ymm0
        vzeroupper
        ret

Meta: I'm running Fedora:

$ rustc --version --verbose
rustc 1.75.0 (82e1608df 2023-12-21)
binary: rustc
commit-hash: 82e1608dfa6e0b5569232559e3d385fea5a93112
commit-date: 2023-12-21
host: x86_64-unknown-linux-gnu
release: 1.75.0
LLVM version: 17.0.6
quaternic commented 8 months ago

@Barfussmann Just to verify that the issue is due to the combination, could you also benchmark these two: (note that the signatures are slightly changed)

pub fn use_shifts(mut row: u64, data: &mut [u32; 8]) -> u64 {
    for data in data {
        *data <<= 5;
    }
    unsafe {
        std::arch::asm!(
            "shl {0}, 8",
            "shr {0}, 8",
            inout(reg) row,
            options(pure,nostack,nomem),
        );
    }
    row
}
pub unsafe fn xmm_only(mut row: [u8; 8], data1: &mut [u32; 4], data2: &mut [u32; 4]) -> [u8; 8] {
    let mut x = *data1;
    let mut y = *data2;
    for e in &mut x {
        *e <<= 5;
    }
    for e in &mut y {
        *e <<= 5;
    }

    row[7] = 0;

    *data1 = x;
    *data2 = y;
    row
}

Using the same compiler flags, -C opt-level=3 -C target-cpu=znver2, they should generate assembly that looks like:

example::use_shifts:
        vmovdqu ymm0, ymmword ptr [rsi]
        mov     rax, rdi

        shl     rax, 8
        shr     rax, 8

        vpslld  ymm0, ymm0, 5
        vmovdqu ymmword ptr [rsi], ymm0
        vzeroupper
        ret

example::xmm_only:
        vmovdqu xmm0, xmmword ptr [rsi]
        vmovdqu xmm1, xmmword ptr [rdx]
        vmovq   xmm2, rdi
        extrq   xmm2, 56, 0
        vmovq   rax, xmm2
        vpslld  xmm0, xmm0, 5
        vpslld  xmm1, xmm1, 5
        vmovdqu xmmword ptr [rsi], xmm0
        vmovdqu xmmword ptr [rdx], xmm1
        ret

Secondly, could you add -C target-feature=-avx to the flags and benchmark all three again. The assembly should no longer have VEX-prefixes:

example::slow:
        movdqu  xmm0, xmmword ptr [rsi]
        movdqu  xmm1, xmmword ptr [rsi + 16]
        pslld   xmm0, 5
        pslld   xmm1, 5
        movdqu  xmmword ptr [rsi], xmm0
        movdqu  xmmword ptr [rsi + 16], xmm1
        movq    xmm1, rdi
        extrq   xmm1, 56, 0
        movq    rax, xmm1
        ret
Barfussmann commented 8 months ago

I've check the assembly of all functions you've provided and they are the same. The Benches with the stalls always have ~60ns per Iteration. Bench with target-cpu=znver2

[barfussmann@fedora perf_test]$ RUSTFLAGS="-C target-cpu=znver2" cargo run --release
slow: 62ns
use_shifts: 2ns
xmm_only: 2ns

Bench with target-feature=-avx once with target-cpu=znver2 and once without

[barfussmann@fedora perf_test]$ RUSTFLAGS="-C target-cpu=znver2 -C target-feature=-avx" cargo run --release
slow: 2ns
use_shifts: 2ns
xmm_only: 2ns

[barfussmann@fedora perf_test]$ RUSTFLAGS="-C target-feature=-avx" cargo run --release
slow: 2ns
use_shifts: 2ns
xmm_only: 2ns
Barfussmann commented 8 months ago

I've also check one of the other SS4a instruction and insertq is also effected. The other two(MOVNTSD, MOVNTSS) I don't know if they can be generated with normal code. Code:

#[inline(never)]
pub fn slow_insertq(mut row: [u8; 8], data: &mut [u32; 8]) -> [u8; 8] {
    for data in data {
        *data <<= 5;
    }

    row[5] = 0xFF;
    row[6] = 0xFF;
    row[7] = 0xFF;
    row 
}

Asm Compiler Explorer:

example::slow_insertq:
        vmovdqu ymm0, ymmword ptr [rsi]
        vmovq   xmm2, rdi
        vpcmpeqd        xmm1, xmm1, xmm1
        insertq xmm1, xmm2, 40, 0
        vmovq   rax, xmm1
        vpslld  ymm0, ymm0, 5
        vmovdqu ymmword ptr [rsi], ymm0
        vzeroupper
        ret

Runtimes (longer runtime are because I'm on battery power but 'slow' can be used as a reference):

[barfussmann@fedora perf_test]$ RUSTFLAGS="-C target-cpu=znver2" cargo run --release
slow: 109ns
slow_insertq: 107ns
use_shifts: 3ns
xmm_only: 3ns
quaternic commented 8 months ago

Yeah, that makes sense. One potential issue is that unlike with typical SSE instructions, the compiler can't just replace these two AMD-specific instructions with an AVX-equivalent, since those don't exist.

In your original example there isn't really any benefit to using EXTRQ in the first place, since both the input and output are in general purpose registers, just two shifts would do it more efficiently.

The LLVM-IR that rustc generates seems fine, LLVM optimizes it to this IR: (variables renamed for clarity)

define i64 @example::slow(i64 %row_i64, ptr noalias nocapture noundef align 4 dereferenceable(32) %data) unnamed_addr {
start:
  %loaded_data = load <8 x i32>, ptr %data, align 4
  %shifted_data = shl <8 x i32> %loaded_data, <i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5>
  store <8 x i32> %shifted_data, ptr %data, align 4
  %row0 = bitcast i64 %row_i64 to <8 x i8>
  %row1 = insertelement <8 x i8> %row0, i8 0, i64 7
  %ret = bitcast <8 x i8> %row12 to i64
  ret i64 %ret
}

I tried to look into LLVM a bit, but I'm not that familiar with the internals. It seemed that the x86 instruction selection with sse4a available recognizes that zeroing the last byte of the vector can be done with EXTRQ, and it never recovers from that unfortunate decision.

Could you create an LLVM issue for this? I wasn't able to find anything similar.

And as a workaround, you could probably compile with -C target-cpu=znver2 -C target-feature=-sse4a to disable these particular instructions.

Barfussmann commented 8 months ago

Thank you. I've found another workaround for my original problem.

The problematic code was the following code:

fn slow<const N: usize>(mut renamed_reordered_colors: u8x8) -> u32 {
    const INDEX_MULTIPLIERS: u32x8 = u32x8::from_array([1, 4, 6, 123, 4567, 5678, 6789, 12345]);
    for i in N..8 {
        renamed_reordered_colors[i] = 0;
    }
    (renamed_reordered_colors.cast() * INDEX_MULTIPLIERS).reduce_sum()
}

It's only slow with N=5 and N=7. The code from my start example was the smallest test case in which I could reproduce the bug.

I will definitely will open an LLVM issue.