Open Barfussmann opened 9 months ago
@Barfussmann Just to verify that the issue is due to the combination, could you also benchmark these two: (note that the signatures are slightly changed)
pub fn use_shifts(mut row: u64, data: &mut [u32; 8]) -> u64 {
for data in data {
*data <<= 5;
}
unsafe {
std::arch::asm!(
"shl {0}, 8",
"shr {0}, 8",
inout(reg) row,
options(pure,nostack,nomem),
);
}
row
}
pub unsafe fn xmm_only(mut row: [u8; 8], data1: &mut [u32; 4], data2: &mut [u32; 4]) -> [u8; 8] {
let mut x = *data1;
let mut y = *data2;
for e in &mut x {
*e <<= 5;
}
for e in &mut y {
*e <<= 5;
}
row[7] = 0;
*data1 = x;
*data2 = y;
row
}
Using the same compiler flags, -C opt-level=3 -C target-cpu=znver2
, they should generate assembly that looks like:
example::use_shifts:
vmovdqu ymm0, ymmword ptr [rsi]
mov rax, rdi
shl rax, 8
shr rax, 8
vpslld ymm0, ymm0, 5
vmovdqu ymmword ptr [rsi], ymm0
vzeroupper
ret
example::xmm_only:
vmovdqu xmm0, xmmword ptr [rsi]
vmovdqu xmm1, xmmword ptr [rdx]
vmovq xmm2, rdi
extrq xmm2, 56, 0
vmovq rax, xmm2
vpslld xmm0, xmm0, 5
vpslld xmm1, xmm1, 5
vmovdqu xmmword ptr [rsi], xmm0
vmovdqu xmmword ptr [rdx], xmm1
ret
Secondly, could you add -C target-feature=-avx
to the flags and benchmark all three again. The assembly should no longer have VEX-prefixes:
example::slow:
movdqu xmm0, xmmword ptr [rsi]
movdqu xmm1, xmmword ptr [rsi + 16]
pslld xmm0, 5
pslld xmm1, 5
movdqu xmmword ptr [rsi], xmm0
movdqu xmmword ptr [rsi + 16], xmm1
movq xmm1, rdi
extrq xmm1, 56, 0
movq rax, xmm1
ret
I've check the assembly of all functions you've provided and they are the same. The Benches with the stalls always have ~60ns per Iteration.
Bench with target-cpu=znver2
[barfussmann@fedora perf_test]$ RUSTFLAGS="-C target-cpu=znver2" cargo run --release
slow: 62ns
use_shifts: 2ns
xmm_only: 2ns
Bench with target-feature=-avx
once with target-cpu=znver2
and once without
[barfussmann@fedora perf_test]$ RUSTFLAGS="-C target-cpu=znver2 -C target-feature=-avx" cargo run --release
slow: 2ns
use_shifts: 2ns
xmm_only: 2ns
[barfussmann@fedora perf_test]$ RUSTFLAGS="-C target-feature=-avx" cargo run --release
slow: 2ns
use_shifts: 2ns
xmm_only: 2ns
I've also check one of the other SS4a instruction and insertq
is also effected. The other two(MOVNTSD
, MOVNTSS
) I don't know if they can be generated with normal code.
Code:
#[inline(never)]
pub fn slow_insertq(mut row: [u8; 8], data: &mut [u32; 8]) -> [u8; 8] {
for data in data {
*data <<= 5;
}
row[5] = 0xFF;
row[6] = 0xFF;
row[7] = 0xFF;
row
}
Asm Compiler Explorer:
example::slow_insertq:
vmovdqu ymm0, ymmword ptr [rsi]
vmovq xmm2, rdi
vpcmpeqd xmm1, xmm1, xmm1
insertq xmm1, xmm2, 40, 0
vmovq rax, xmm1
vpslld ymm0, ymm0, 5
vmovdqu ymmword ptr [rsi], ymm0
vzeroupper
ret
Runtimes (longer runtime are because I'm on battery power but 'slow' can be used as a reference):
[barfussmann@fedora perf_test]$ RUSTFLAGS="-C target-cpu=znver2" cargo run --release
slow: 109ns
slow_insertq: 107ns
use_shifts: 3ns
xmm_only: 3ns
Yeah, that makes sense. One potential issue is that unlike with typical SSE instructions, the compiler can't just replace these two AMD-specific instructions with an AVX-equivalent, since those don't exist.
In your original example there isn't really any benefit to using EXTRQ
in the first place, since both the input and output are in general purpose registers, just two shifts would do it more efficiently.
The LLVM-IR that rustc generates seems fine, LLVM optimizes it to this IR: (variables renamed for clarity)
define i64 @example::slow(i64 %row_i64, ptr noalias nocapture noundef align 4 dereferenceable(32) %data) unnamed_addr {
start:
%loaded_data = load <8 x i32>, ptr %data, align 4
%shifted_data = shl <8 x i32> %loaded_data, <i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5>
store <8 x i32> %shifted_data, ptr %data, align 4
%row0 = bitcast i64 %row_i64 to <8 x i8>
%row1 = insertelement <8 x i8> %row0, i8 0, i64 7
%ret = bitcast <8 x i8> %row12 to i64
ret i64 %ret
}
I tried to look into LLVM a bit, but I'm not that familiar with the internals. It seemed that the x86 instruction selection with sse4a
available recognizes that zeroing the last byte of the vector can be done with EXTRQ
, and it never recovers from that unfortunate decision.
Could you create an LLVM issue for this? I wasn't able to find anything similar.
And as a workaround, you could probably compile with -C target-cpu=znver2 -C target-feature=-sse4a
to disable these particular instructions.
Thank you. I've found another workaround for my original problem.
The problematic code was the following code:
fn slow<const N: usize>(mut renamed_reordered_colors: u8x8) -> u32 {
const INDEX_MULTIPLIERS: u32x8 = u32x8::from_array([1, 4, 6, 123, 4567, 5678, 6789, 12345]);
for i in N..8 {
renamed_reordered_colors[i] = 0;
}
(renamed_reordered_colors.cast() * INDEX_MULTIPLIERS).reduce_sum()
}
It's only slow with N=5 and N=7. The code from my start example was the smallest test case in which I could reproduce the bug.
I will definitely will open an LLVM issue.
The performance of the following function is realy slow when i use target-cpu=znver2, my own cpu. Without a target-cpu set it is ~25 times quicker.
I have run it twice with perf. Once with the target-cpu set and once without:
The culprit seems to be SSE-AVX stalls. When looking at the assembly of the slow function with target-cpu set (Compiler Explorer: https://godbolt.org/z/69TxxzG4T). there is a AVX instruction before the SSE4a instruction "exrtq" an Amd specific instruction. There isn't a zveroupper between both instructions. This should be the causes of the SSE-AVX-stall when I'm not mistaken.
Meta: I'm running Fedora: