Open ds84182 opened 3 weeks ago
This is loop unrolling. So...
I expected to see this happen: Assembly composed of
cmov
and/oradox
instruction. Or at leastmov
+je
to a single exit branch.
Why do you think this would be better? If you think the unrolled version is slower, do you have a benchmark? If you think the code size is problematic, does -Copt-level=s
do what you want? Is that optimization setting a better fit for your codebase?
std's partition_point
implements a binary search, your simplified version is a linear scan. LLVM has no optimizations that would turn a linear search into a binary one since it wouldn't be able to know that the array is sorted/partitioned.
Is there a reason that std's partition_point
does not serve well on your code, @ds84182?
Is this code not faster for the small array size you are concerned about?
The "a bunch of jumps" approach will need fewer comparisons than the "a bunch of cmovs" approach if the zero is usually near the start of the array. Since the compiler doesn't know if this is often the case in your workload, it assumes that you know what you're doing and therefore preserves what your code does.
FWIW if want a branchless implementation, then SIMD will likely be 2-5x faster than the scalar version, depending on what target features you can afford to use. Reducing the array element size from usize would also help.
#![feature(portable_simd)]
use std::simd::cmp::SimdPartialEq;
use std::simd::Simd;
pub fn partition_point_simd(array: &[usize; 24]) -> usize {
let mut array_zext = [0; 32];
array_zext[..24].copy_from_slice(array);
let array = Simd::from_array(array_zext);
let mask = array.simd_eq(Simd::splat(0));
mask.to_bitmask().trailing_zeros() as usize
}
However, in the latest nightly, std::binary_search
is also branchless (#128254), and it turns out it's the fastest implementation on inputs where the partition point is randomly distributed in 0..=24:
pub fn partition_point_std(array: &[usize; 24]) -> usize {
array.partition_point(|x| *x != 0)
}
It's also remarkably small:
playground::partition_point_std:
mov rax, qword ptr [rcx + 96]
test rax, rax
mov edx, 12
cmove rdx, rax
lea rax, [rdx + 6]
mov r8d, eax
cmp qword ptr [rcx + 8*r8], 0
cmovne rdx, rax
lea rax, [rdx + 3]
mov r8d, eax
cmp qword ptr [rcx + 8*r8], 0
cmovne rdx, rax
lea r8, [rdx + 1]
cmp qword ptr [rcx + 8*rdx + 8], 0
cmove r8, rdx
lea rax, [r8 + 1]
cmp qword ptr [rcx + 8*r8 + 8], 0
cmove rax, r8
cmp qword ptr [rcx + 8*rax], 1
sbb rax, -1
ret
[usize; 24]
arg with -Ctarget-cpu=x86-64
test bench_partition_point_branchless ... bench: 1,209,903.75 ns/iter (+/- 683,113.25) = 15869 MB/s
test bench_partition_point_linear ... bench: 1,889,390.00 ns/iter (+/- 371,780.75) = 10162 MB/s
test bench_partition_point_simd ... bench: 792,851.25 ns/iter (+/- 637,814.88) = 24216 MB/s
test bench_partition_point_std ... bench: 558,115.62 ns/iter (+/- 306,352.19) = 34401 MB/s
[u32; 24]
arg with -Ctarget-cpu=x86-64-v3
test bench_partition_point_branchless ... bench: 312,764.69 ns/iter (+/- 104,581.22) = 30694 MB/s
test bench_partition_point_linear ... bench: 1,153,777.50 ns/iter (+/- 239,229.50) = 8320 MB/s
test bench_partition_point_simd ... bench: 198,122.81 ns/iter (+/- 39,717.12) = 48454 MB/s
test bench_partition_point_std ... bench: 235,001.56 ns/iter (+/- 98,562.62) = 40850 MB/s
partition_point_branchless
is the scalar equivalent of partition_point_simd
- it's faster in the second case because it gets auto-vectorized, but badly.
fn partition_point_branchless<const N: usize>(array: &[u32; N]) -> usize {
let mut mask: u32 = 0;
for i in 0..N {
if array[i] == 0 {
mask |= 1 << i;
}
}
return mask.trailing_zeros() as usize;
}
I tried this code:
I expected to see this happen: Assembly composed of
cmov
and/oradox
instruction. Or at leastmov
+je
to a single exit branch.Instead, this happened:
First occurs in 1.19.0 with the alternative code snippet: