ridiculousfish / libdivide

Official git repository for libdivide: optimized integer division
http://libdivide.com
Other
1.09k stars 77 forks source link

Faster divlu #103

Closed sukop closed 1 year ago

sukop commented 1 year ago

Hello!

I've tried to run the benchmark from Narrowing Division Insight and Benchmarks and these are the numbers I got for "libdiv mul":

M2 7.1 i5-8500 21.4

When I changed this line from

numhi |= (numlo >> (-shift & 63)) & (-(int64_t)shift >> 63);

to

if (shift != 0) numhi |= numlo >> (64 - shift);

I got the following numbers:

M2 7.0 i5-8500 20.4

I thought you might find it interesting and perhaps update https://github.com/ridiculousfish/libdivide/blob/master/doc/divlu.c, if you see similar speed-up.

PS: And thanks for writing libdivide!

ridiculousfish commented 1 year ago

It would be interesting to compare the codegen here. Microbenchmarking is subject to random perturbations, e.g. unrelated code changes shift instruction alignment, or cache line sharing, or whatever.

sukop commented 1 year ago

Godbold (x86-64 clang 15.0.0, -O3) says the following (includes numhi <<= shift):

And for armv8-a clang 15.0.0, -O3:

sukop commented 1 year ago

Did you see my comment?

No.

You need recent version of clang that has m2 support, see clang++ -print-supported-cpus

No, I don't. According to Godbolt, there is no difference in codegen between -mcpu=apple-m1 and -mcpu=apple-m2 for clang-trunk.

sukop commented 1 year ago

Not only that, it also appears to be faster.

ridiculousfish commented 1 year ago

I wasn't able to show an improvement branching on the shift amount instead of the existing branchfree tricks. It's slower in every case I tried - results and benchmarks here.

# Benchmark results, lower is better
# Apple M1 Max
> clang++ -std=c++14 -O3 divlu_benchmark.cpp; ./a.out
           hackers  11.5198
        libdiv org  8.9480 # without branch
        libdiv brn  9.0656 # with branch

# Ryzen 9 5900x
> clang++ -O3 divlu_benchmark.cpp; ./a.out
           hackers  10.8785
        libdiv org  9.2789 # without branch
        libdiv brn  9.9835 # with branch
              divq  2.5347

> g++ -O3 divlu_benchmark.cpp; ./a.out
           hackers  14.5077
        libdiv org  9.0915 # without branch
        libdiv brn  9.5235 # with branch
              divq  2.5241

> g++ -O3 -march=native divlu_benchmark.cpp; ./a.out
           hackers  9.7027
        libdiv org  7.8284 # without branch
        libdiv brn  8.7095 # with branch
              divq  2.5382

can you do ifdef on version of clang that no longer has that codegen bug

There's no need, the code is optimal. x >> (-shift & 63) is the same or faster than x >> (64 - shift), but most importantly does not create undefined behavior should shift be 0.

since M1 does not have HW division apparently (BTW, M2 has BF16 and I8MM but not SM4)

None of those ISAs would be used here. Those are for floating point and/or vectorized code, but we're only using scalar integer arithmetic here. The M1 has a pipelined integer divider and that's what's being used.

sukop commented 1 year ago

@ridiculousfish When I hid everything behind an if, like you do here, it was also slower on my machine(s). But when I tried to if only the numhi |= part (as above), it was faster. Perhaps that's what you see?

ridiculousfish commented 1 year ago

Is there no 128 bit division at all

Correct, ARM64 does not have a 128 ÷ 64 => 64 instruction analogous divq on x64.

Is libdivide needed for M1 at all?

libdivide in its use for optimizing division by runtime constants is almost 2x faster than hardware for scalar, and about 4x faster for u32 SIMD, on M1. Full results are here.

The divllu function we're discussing in this issue is a narrowing 128 ÷ 64 => 64 division. On ARM64, you can achieve the same via (uint64_t)((__uint128_t)num / denom); however this can be substantially slower than divllu because the compiler may emit a runtime call to a non-narrowing division. That's why lidivide does not use __uint128_t.

LLVM has since fixed this but even so, their new version is slower than the divllu presented here because their correction step doesn't have these optimizations.

When I hid everything behind an if... it was also slower on my machine(s). But when I tried to if only the numhi |= part (as above), it was faster

I don't see the same. Besides, that suggests we're seeing artifacts instead of real improvements. The two proposed normalizing steps are:

shift = __builtin_clzll(den);
if (shift > 0) {
     den <<= shift;
     numhi <<= shift;
     numhi |= (numlo >> (64 - shift));
     numlo <<= shift;
}

and:

shift = __builtin_clzll(den);
den <<= shift;
numhi <<= shift;
if (shift > 0)  numhi |= (numlo >> (64 - shift));
numlo <<= shift;

There's no reason why an optimizing compiler should compile these differently, so there's no reason to "optimize" the first to the second.

Maybe you should write assembly by hand? SHLD is what must be used when you do 64 bit shifts

SHLD sucks which is why compilers are reluctant to generate it - e.g. on Zen3 it's 5 macro-ops and a throughput of 0.5 per cycle, compared to normal SHL with 1 op and throughput of 2 per cycle.

Basically this normalization step can be done with a branch or branchfree; on all of my test cases the branchfree version is never slower and often much faster so that's why it works that way.

sukop commented 1 year ago

@ridiculousfish According to Godbold, the first version creates a branch, whereas the later does not.

ridiculousfish commented 1 year ago

This is turning into bikeshedding so I'm going to close this. Here's my results for the "if statement but only on one line" approach; it's not faster on any of the hardware I have available and it's > 7% slower in one case. For that reason I don't want to adjust the normalization step. Thanks for filing.

sukop commented 1 year ago

And thanks for looking into this.

PS: You are benchmarking "divllu_v2_branch", and not "libdiv mul" like I am (see the very top of this thread). For archival purposes, here is what I see.