Faster divlu - Githubissues

sukop commented 1 year ago

Hello!

I've tried to run the benchmark from Narrowing Division Insight and Benchmarks and these are the numbers I got for "libdiv mul":

M2 7.1 i5-8500 21.4

When I changed this line from

numhi |= (numlo >> (-shift & 63)) & (-(int64_t)shift >> 63);

to

if (shift != 0) numhi |= numlo >> (64 - shift);

I got the following numbers:

M2 7.0 i5-8500 20.4

I thought you might find it interesting and perhaps update https://github.com/ridiculousfish/libdivide/blob/master/doc/divlu.c, if you see similar speed-up.

PS: And thanks for writing libdivide!

ridiculousfish commented 1 year ago

It would be interesting to compare the codegen here. Microbenchmarking is subject to random perturbations, e.g. unrelated code changes shift instruction alignment, or cache line sharing, or whatever.

sukop commented 1 year ago

Godbold (x86-64 clang 15.0.0, -O3) says the following (includes numhi <<= shift):

numhi |= (numlo >> (-shift & 63)) & (-(int64_t)shift >> 63)

shl rdi, cl neg cl mov r13, rsi shr r13, cl test r10, r10 cmove r13, r10 or r13, rdi
if (shift != 0) numhi |= numlo >> (64 - shift);

shld r12, rsi, cl

And for armv8-a clang 15.0.0, -O3:

numhi |= (numlo >> (-shift & 63)) & (-(int64_t)shift >> 63)

lsl x11, x0, x8 neg w10, w8 cmp x8, #0 lsr x10, x1, x10 csel x10, xzr, x10, eq orr x12, x10, x11
if (shift != 0) numhi |= numlo >> (64 - shift);

lsr x11, x1, #1 mvn w10, w8 lsl x12, x0, x8 lsr x10, x11, x10 orr x12, x12, x10

sukop commented 1 year ago

Did you see my comment?

No.

You need recent version of clang that has m2 support, see clang++ -print-supported-cpus

No, I don't. According to Godbolt, there is no difference in codegen between -mcpu=apple-m1 and -mcpu=apple-m2 for clang-trunk.

sukop commented 1 year ago

Not only that, it also appears to be faster.

ridiculousfish commented 1 year ago

I wasn't able to show an improvement branching on the shift amount instead of the existing branchfree tricks. It's slower in every case I tried - results and benchmarks here.

# Benchmark results, lower is better
# Apple M1 Max
> clang++ -std=c++14 -O3 divlu_benchmark.cpp; ./a.out
           hackers  11.5198
        libdiv org  8.9480 # without branch
        libdiv brn  9.0656 # with branch

# Ryzen 9 5900x
> clang++ -O3 divlu_benchmark.cpp; ./a.out
           hackers  10.8785
        libdiv org  9.2789 # without branch
        libdiv brn  9.9835 # with branch
              divq  2.5347

> g++ -O3 divlu_benchmark.cpp; ./a.out
           hackers  14.5077
        libdiv org  9.0915 # without branch
        libdiv brn  9.5235 # with branch
              divq  2.5241

> g++ -O3 -march=native divlu_benchmark.cpp; ./a.out
           hackers  9.7027
        libdiv org  7.8284 # without branch
        libdiv brn  8.7095 # with branch
              divq  2.5382

can you do ifdef on version of clang that no longer has that codegen bug

There's no need, the code is optimal. x >> (-shift & 63) is the same or faster than x >> (64 - shift), but most importantly does not create undefined behavior should shift be 0.

since M1 does not have HW division apparently (BTW, M2 has BF16 and I8MM but not SM4)

None of those ISAs would be used here. Those are for floating point and/or vectorized code, but we're only using scalar integer arithmetic here. The M1 has a pipelined integer divider and that's what's being used.

sukop commented 1 year ago

@ridiculousfish When I hid everything behind an if, like you do here, it was also slower on my machine(s). But when I tried to if only the numhi |= part (as above), it was faster. Perhaps that's what you see?

ridiculousfish commented 1 year ago

Is there no 128 bit division at all

Correct, ARM64 does not have a 128 ÷ 64 => 64 instruction analogous divq on x64.

Is libdivide needed for M1 at all?

libdivide in its use for optimizing division by runtime constants is almost 2x faster than hardware for scalar, and about 4x faster for u32 SIMD, on M1. Full results are here.

The divllu function we're discussing in this issue is a narrowing 128 ÷ 64 => 64 division. On ARM64, you can achieve the same via (uint64_t)((__uint128_t)num / denom); however this can be substantially slower than divllu because the compiler may emit a runtime call to a non-narrowing division. That's why lidivide does not use __uint128_t.

LLVM has since fixed this but even so, their new version is slower than the divllu presented here because their correction step doesn't have these optimizations.

When I hid everything behind an if... it was also slower on my machine(s). But when I tried to if only the numhi |= part (as above), it was faster

I don't see the same. Besides, that suggests we're seeing artifacts instead of real improvements. The two proposed normalizing steps are:

shift = __builtin_clzll(den);
if (shift > 0) {
     den <<= shift;
     numhi <<= shift;
     numhi |= (numlo >> (64 - shift));
     numlo <<= shift;
}

and:

shift = __builtin_clzll(den);
den <<= shift;
numhi <<= shift;
if (shift > 0)  numhi |= (numlo >> (64 - shift));
numlo <<= shift;

There's no reason why an optimizing compiler should compile these differently, so there's no reason to "optimize" the first to the second.

Maybe you should write assembly by hand? SHLD is what must be used when you do 64 bit shifts

SHLD sucks which is why compilers are reluctant to generate it - e.g. on Zen3 it's 5 macro-ops and a throughput of 0.5 per cycle, compared to normal SHL with 1 op and throughput of 2 per cycle.

Basically this normalization step can be done with a branch or branchfree; on all of my test cases the branchfree version is never slower and often much faster so that's why it works that way.

sukop commented 1 year ago

@ridiculousfish According to Godbold, the first version creates a branch, whereas the later does not.

ridiculousfish commented 1 year ago

This is turning into bikeshedding so I'm going to close this. Here's my results for the "if statement but only on one line" approach; it's not faster on any of the hardware I have available and it's > 7% slower in one case. For that reason I don't want to adjust the normalization step. Thanks for filing.

sukop commented 1 year ago

And thanks for looking into this.

PS: You are benchmarking "divllu_v2_branch", and not "libdiv mul" like I am (see the very top of this thread). For archival purposes, here is what I see.

ridiculousfish / libdivide

Faster divlu #103