Closed sukop closed 1 year ago
It would be interesting to compare the codegen here. Microbenchmarking is subject to random perturbations, e.g. unrelated code changes shift instruction alignment, or cache line sharing, or whatever.
Godbold (x86-64 clang 15.0.0, -O3) says the following (includes numhi <<= shift
):
numhi |= (numlo >> (-shift & 63)) & (-(int64_t)shift >> 63)
shl rdi, cl neg cl mov r13, rsi shr r13, cl test r10, r10 cmove r13, r10 or r13, rdi
if (shift != 0) numhi |= numlo >> (64 - shift);
shld r12, rsi, cl
And for armv8-a clang 15.0.0, -O3:
numhi |= (numlo >> (-shift & 63)) & (-(int64_t)shift >> 63)
lsl x11, x0, x8 neg w10, w8 cmp x8, #0 lsr x10, x1, x10 csel x10, xzr, x10, eq orr x12, x10, x11
if (shift != 0) numhi |= numlo >> (64 - shift);
lsr x11, x1, #1 mvn w10, w8 lsl x12, x0, x8 lsr x10, x11, x10 orr x12, x12, x10
Did you see my comment?
No.
You need recent version of clang that has m2 support, see clang++ -print-supported-cpus
No, I don't. According to Godbolt, there is no difference in codegen between -mcpu=apple-m1
and -mcpu=apple-m2
for clang-trunk.
Not only that, it also appears to be faster.
I wasn't able to show an improvement branching on the shift amount instead of the existing branchfree tricks. It's slower in every case I tried - results and benchmarks here.
# Benchmark results, lower is better
# Apple M1 Max
> clang++ -std=c++14 -O3 divlu_benchmark.cpp; ./a.out
hackers 11.5198
libdiv org 8.9480 # without branch
libdiv brn 9.0656 # with branch
# Ryzen 9 5900x
> clang++ -O3 divlu_benchmark.cpp; ./a.out
hackers 10.8785
libdiv org 9.2789 # without branch
libdiv brn 9.9835 # with branch
divq 2.5347
> g++ -O3 divlu_benchmark.cpp; ./a.out
hackers 14.5077
libdiv org 9.0915 # without branch
libdiv brn 9.5235 # with branch
divq 2.5241
> g++ -O3 -march=native divlu_benchmark.cpp; ./a.out
hackers 9.7027
libdiv org 7.8284 # without branch
libdiv brn 8.7095 # with branch
divq 2.5382
can you do ifdef on version of clang that no longer has that codegen bug
There's no need, the code is optimal. x >> (-shift & 63)
is the same or faster than x >> (64 - shift)
, but most importantly does not create undefined behavior should shift be 0.
since M1 does not have HW division apparently (BTW, M2 has BF16 and I8MM but not SM4)
None of those ISAs would be used here. Those are for floating point and/or vectorized code, but we're only using scalar integer arithmetic here. The M1 has a pipelined integer divider and that's what's being used.
@ridiculousfish When I hid everything behind an if
, like you do here, it was also slower on my machine(s). But when I tried to if
only the numhi |=
part (as above), it was faster. Perhaps that's what you see?
Is there no 128 bit division at all
Correct, ARM64 does not have a 128 ÷ 64 => 64 instruction analogous divq
on x64.
Is libdivide needed for M1 at all?
libdivide in its use for optimizing division by runtime constants is almost 2x faster than hardware for scalar, and about 4x faster for u32 SIMD, on M1. Full results are here.
The divllu
function we're discussing in this issue is a narrowing 128 ÷ 64 => 64 division. On ARM64, you can achieve the same via (uint64_t)((__uint128_t)num / denom)
; however this can be substantially slower than divllu
because the compiler may emit a runtime call to a non-narrowing division. That's why lidivide does not use __uint128_t.
LLVM has since fixed this but even so, their new version is slower than the divllu
presented here because their correction step doesn't have these optimizations.
When I hid everything behind an if... it was also slower on my machine(s). But when I tried to if only the numhi |= part (as above), it was faster
I don't see the same. Besides, that suggests we're seeing artifacts instead of real improvements. The two proposed normalizing steps are:
shift = __builtin_clzll(den);
if (shift > 0) {
den <<= shift;
numhi <<= shift;
numhi |= (numlo >> (64 - shift));
numlo <<= shift;
}
and:
shift = __builtin_clzll(den);
den <<= shift;
numhi <<= shift;
if (shift > 0) numhi |= (numlo >> (64 - shift));
numlo <<= shift;
There's no reason why an optimizing compiler should compile these differently, so there's no reason to "optimize" the first to the second.
Maybe you should write assembly by hand? SHLD is what must be used when you do 64 bit shifts
SHLD sucks which is why compilers are reluctant to generate it - e.g. on Zen3 it's 5 macro-ops and a throughput of 0.5 per cycle, compared to normal SHL with 1 op and throughput of 2 per cycle.
Basically this normalization step can be done with a branch or branchfree; on all of my test cases the branchfree version is never slower and often much faster so that's why it works that way.
@ridiculousfish According to Godbold, the first version creates a branch, whereas the later does not.
This is turning into bikeshedding so I'm going to close this. Here's my results for the "if statement but only on one line" approach; it's not faster on any of the hardware I have available and it's > 7% slower in one case. For that reason I don't want to adjust the normalization step. Thanks for filing.
Hello!
I've tried to run the benchmark from Narrowing Division Insight and Benchmarks and these are the numbers I got for "libdiv mul":
M2 7.1 i5-8500 21.4
When I changed this line from
to
I got the following numbers:
M2 7.0 i5-8500 20.4
I thought you might find it interesting and perhaps update https://github.com/ridiculousfish/libdivide/blob/master/doc/divlu.c, if you see similar speed-up.
PS: And thanks for writing libdivide!