branchfree divide +/- 1

yb303 commented 5 years ago

On my home box I was able to only test m128 so could you please test 256 and 512 vectors before merging? Thanks!

kimwalisch commented 5 years ago

Below is a simple C++ test function that I have used to compare the assembly output of both libdivide's current branchfree divider and your branchfree divider version that supports +-1.

#include "libdivide.h"

uint64_t divide(uint64_t x, void* divider)
{
    auto& fast_d = *((libdivide::branchfree_divider<uint64_t>*) divider);
    return x / fast_d;
}

When I compile this code using our current libdivide.h and Clang 8.0.0 with -O3 on x64 (clang++ -O3 -S test.cpp) I get the following assembly output which does not contain any cmp or cmov instructions.

divide(unsigned long, void*):
        mov     rax, rdi
        mul     QWORD PTR [rsi]
        movzx   ecx, BYTE PTR [rsi+8]
        sub     rdi, rdx
        mov     rax, rdx
        shr     rdi
        add     rax, rdi
        shr     rax, cl
        ret

Now when I compile the same code using your libdivide version that supports +-1 I get the following assembly output that contains cmp and cmov instructions.

divide(unsigned long, void*):
        mov     cl, byte ptr [rsi + 8]
        mov     rax, rdi
        mul     qword ptr [rsi]
        xor     eax, eax
        cmp     cl, 64
        cmove   rax, rdi
        add     rax, rdx
        sub     rdi, rax
        shr     rdi
        add     rax, rdi
        shr     rax, cl
        ret

One could argue that your implementation is still branchfree because it does not use any jmp instructions. But when compiling your code on POWERPC64 I was even able to generate a jmp instruction, see https://godbolt.org/z/SKGjuI.

So my personal feeling is that your code is not branchfree in all circumstances.

yb303 commented 5 years ago

You're right. I think the compilers are not too smart about it. The conditional line can be changed like this: q += -((denom->more >> __builtin_ctz(LIBDIVIDE_ONE_MARKER)) & 1) & numer;

The performance of that is a bit worse than using cmov. The /1 support costs like +15% latency... maybe I should make the 1 support optional. Do you have some proper i7 to see how it's doing?

kimwalisch commented 5 years ago

Do you have some proper i7 to see how it's doing?

Yes, I have an i7-6700 Skylake CPU. I can run benchmarks when I am back home.

q += -((denom->more >> __builtin_ctz(LIBDIVIDE_ONE_MARKER) & 1) & numer;

I think we can get rid of 1 instruction for u32 and u64 and change your code to:


q += -(denom->more >> __builtin_ctz(LIBDIVIDE_ONE_MARKER) & numer;

kimwalisch commented 5 years ago

Another idea would be to add a new int8_t one to the branchfree structs:

struct libdivide_u64_branchfree_t {
    uint64_t magic;
    uint8_t more;
    int8_t one;
};

This one would be set to -1 if the number is +-1 and 0 otherwise. This would save a few more instructions:

q += numer & one;

And we would also not need to correct denom->more; using:

// without new one variable
return q >> (more & LIBDIVIDE_64_SHIFT_MASK)

// with new one variable
return q >> more

The drawback of this solution is that this relies on C/C++ integer promotion rules and it would make it a little more difficult to port libdivide to other languages.

kimwalisch commented 5 years ago

Here is yet another idea that may potentially improve performance:

uint64_t libdivide_u64_branchfree_do(uint64_t numer, const struct libdivide_u64_branchfree_t *denom) {
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);

    // Here is what we want to achieve without branches
    // if q = numer then t = numer and we return numer
    if (divider == 1 || divider == -1)
        q = numer;

    uint64_t t = ((numer - q) >> 1) + q;
    return t >> denom->more;
}

Our current idea to support +-1 is to set denom->magic = 0. But what about if we set it to the largest possible value e.g. 2^64-1 and denom->more = 0. In this case:

uint64_t q = libdivide_mullhi_u64(2^64-1, numer);

// Now q = numer-1
// So we just need to find a way to add +1 if the divider is +-1

// one is set to 1 for +-1 and 0 otherwise.
// The one variable is explained in my previous post.
q += one;

// Maybe we can also use,
// if this works we don't need to add the one variable
q += (denom->more == 0);

I will try to implement these ideas and report benchmarks...

yb303 commented 5 years ago

The 'one' idea is nice. Reduces complexity. But my i5-3470 the performance is the same as the other impl.

yb303 commented 5 years ago

I believe this approach would work 2^64, not 2^64 - 1, but I didn't feel right enlarging the u64 magic to uint128_t

yb303 commented 5 years ago

But, at least for uint32_t this got me another idea. Use the more (or one) bit to serve as bit 32 in magic, like this: uint64_t magic = denom->magic + (uint64_t)(denom->more >> 6) << 32; uint32_t q = libdivide_mullhi_u32(denom->magic, numer); // q == numer uint32_t t = ((numer - q) >> 1) + q; // t == q return t >> (denom->more & LIBDIVIDE_32_SHIFT_MASK); But I'm getting some wrong results same as with 2^32-1

yb303 commented 5 years ago

The code above was meant to be:

uint64_t magic = denom->magic + (uint64_t)(denom->more >> 6) << 32;
uint32_t q = libdivide_mullhi_u32(denom->magic, numer); // q == numer
uint32_t t = ((numer - q) >> 1) + q; // t == q
return t >> (denom->more & LIBDIVIDE_32_SHIFT_MASK);

yb303 commented 5 years ago

This works for u32:

static inline uint32_t libdivide_mullhi_u32(uint64_t x, uint64_t y) {
    uint64_t rl = x * y;
    return (uint32_t)(rl >> 32);
}
uint32_t libdivide_u32_branchfree_do(uint32_t numer, const struct libdivide_u32_branchfree_t *denom) {
    uint64_t magic = denom->magic;
    magic += (uint64_t)(denom->more >> 6) << 32;
    uint32_t q = libdivide_mullhi_u32(magic, numer);
    uint32_t t = ((numer - q) >> 1) + q;
    return t >> (denom->more & LIBDIVIDE_32_SHIFT_MASK);
}

And with good performance.

The same trick on u64 would need uint128_t.. will try out later

kimwalisch commented 5 years ago

This works for u32: And with good performance.

Great, my idea with magic = 2^64-1 and q += one does not work if the numerator is 0. The code then returns 1 instead of 0 and I don't think I can easily fix that issue. Note that you need the following changes inside tester.cpp (at line 164) to test the 1 case for unsigned u32 and u64 (and you need to disable vector division if you have not implemented that yet):

        // Don't try dividing by +/- 1 with branchfree signed
        if (ALGO == BRANCHFREE && std::is_signed<T>::value && 
            (denom == 1 || (limits::is_signed && denom == T(-1)))) {
            return;
        }

I see that your code will run very fast if the same divider is reused inside a hot loop as the compiler is able to move out of the body of the loop the additional calculations that you have introduced. This is most likely the reason why you see great performance using the benchmark program.

The primes_benchmark program however iterates over an array of dividers and hence the compiler cannot move out of the body of the loop the additional calculations that you have introduced. I suggest you also run the primes_benchmark program and report how much slowdown you experience.

yb303 commented 5 years ago

The primes benchmark timing includes the divider generation. But you're right about the invariant code motion... I'll change the benchmark a bit. Thanks!

kimwalisch commented 5 years ago

The primes benchmark timing includes the divider generation. But you're right about the invariant code motion... I'll change the benchmark a bit. Thanks!

I have now changed the primes benchmark myself and renamed it to benchmark_branchfree.cpp. The benchmark now iterates over an array of dividers and computes divisions and nothing else. The benchmark is now close to my use case in primecount. Support for +-1 in the branchfree divider would be nice but not at all cost, primecount's users don't care about +-1 support, they only care about performance...

yb303 commented 5 years ago

I've eliminated the loop invariant motion... On my i5 box:

hw      3.34 7.80
lib     1.74 1.75
lib+1   2.20 2.27

20% performance hit

kimwalisch commented 5 years ago

I've eliminated the loop invariant motion...

Can you please push your new code to your branchfree_div_1 branch (and possibly also update the code of this pull request) so I can benchmark it as well on my Intel i7-6700 CPU?

kimwalisch commented 5 years ago

Great, my idea with magic = 2^64-1 and q += one does not work if the numerator is 0. The code then returns 1 instead of 0 and I don't think I can easily fix that issue.

I have now found a trick to make it work (though I am not sure yet the new code is 100% correct because the tester program does not yet support +-1). So far I have only implemented the +1 support for u32 and u64 branchfree. The code is available on the branchfree_1 branch.

Here is what the new code looks like:

uint64_t libdivide_u64_branchfree_do(uint64_t numer, 
                                     const struct libdivide_u64_branchfree_t *denom) {
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
    // 1 divider support: '+ (denom->one & !!numer)'
    uint64_t t = ((numer - q) >> 1) + q + (denom->one & !!numer);
    return t >> denom->more;
}

The code looks fast but unfortunately I have measured a slowdown of 30% for u64 using both the benchmark and benchmark_branchfree programs on my i7-6700 CPU.

yb303 commented 5 years ago

I've pushed my code to branchfree_div_1 on two separate commit. One for the lib and one for the testers. TBH, I don't understand how come + (denom->one & !!numer) works ... got to look through the rest of it.

yb303 commented 5 years ago

The code looks fast but unfortunately I have measured a slowdown of 30% for u64 using both the benchmark and benchmark_branchfree programs on my i7-6700 CPU.

I think the placement of the addition, on t, or on q, affect out of order execution.

kimwalisch commented 5 years ago

TBH, I don't understand how come + (denom->one & !!numer) works ... got to look through the rest of it.

+ denom->one is the correction required if the divider is 1. However this correction will also return 1 if the numerator is 0 which is not correct. Hence & !!numer is used to not add one if the numerator is 0.

I will benchmark your code tomorrow.

kimwalisch commented 5 years ago

I will benchmark your code tomorrow.

Your code incurs about 35% slowdown on my i7-6700.

kimwalisch commented 5 years ago

I have now combined both our approaches:

Your q += -(denom->more >> __builtin_ctz(LIBDIVIDE_ONE_MARKER) & numer;
With my denom->one

This is what the new code looks like:

uint64_t libdivide_u64_branchfree_do(uint64_t numer, 
                                     const struct libdivide_u64_branchfree_t *denom) {
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
    uint64_t t = ((numer - q) >> 1) + q + (numer & denom->one);
    return t >> denom->more;
}

denom->one is a bitmask that is set to -1 if the divider is 1 or to 0 otherwise. This new code incurs only a 15% slowdown using GCC and uses fewer assembly instructions than any other variant I have tested so far.

This new code introduces only about 3% slowdown in one of primecount's formulas, hence the overall slowdown in primecount should be about 1%. I can live with that slowdown...

kimwalisch commented 5 years ago

I have opened a new issue https://github.com/ridiculousfish/libdivide/issues/55 where I explain the implementation details of the +-1 branchfree divider algorithms that we have investigated.

At the end of this issue I ask for more feedback from other users in order to decide how to proceed. Hence I close this pull request now.

yb303 commented 5 years ago

Given the performance impact thats fair enough. I guess the 1 support can be made optional using another algo, if need be.

kimwalisch commented 5 years ago

@yb303: I just had a brilliant idea on how we can support +-1 branchfree dividers without any performance degradation. My idea was inspired by your comment which you wrote when you finally made your code 100% branchfree:

The conditional line can be changed like this:
q += -((denom->more >> __builtin_ctz(LIBDIVIDE_ONE_MARKER)) & 1) & numer;

The performance of that is a bit worse than using cmov.

The branchfree divider was invented to speed up the use case when iterating over an array of dividers. And actually we don't care if the branchfree divider is 100% branchfree, all we care about is that it runs as fast as possible when iterating over an array of dividers. If we make an exception and allow a branch if the divider is 1 I cannot possibly imagine a real world use case where this would lead to bad performance.

Now for the implementation, I think we don't need to add the int8_t one bitmask to the branchfree structs. For the unsigned case we can use the 8th bit of the more variable (which is usually used to indicate a negative divider) to indicate that the divider is 1. So the implementation for supporting 1 as a divider in the unsigned case looks like:

This is what the new code looks like:

uint64_t libdivide_u64_branchfree_do(uint64_t numer, 
                                     const struct libdivide_u64_branchfree_t *denom) {
    if (denom->more & LIBDIVIDE_DIVIDER_1) {
        return numer;
    }
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
    uint64_t t = ((numer - q) >> 1) + q;
    return t >> denom->more;
}

On my Intel i7-6700 CPU this runs just as fast as the old code using the branchfree_benchmark.

I like the above implementation, it is a very clean solution. Now we just need to find a clean solution to support +-1 as a divider for signed branchfree...

yb303 commented 5 years ago

if (...) return numer; will definitely produce a branch. I actually don't have a good use case for a divider of 1, so yes, it was a proof of concept mostly for completeness. Benchmarks are not reality. Hitting BP should be avoided, where possible, if you're aiming for the ultimate low latency (and low jitter). That is, I'd prefer to spend another 1ns and not hit BP.
I think my tests showed that dividers of +/-1 work as is w/o any code modifications.

kimwalisch commented 5 years ago

I think my tests showed that dividers of +/-1 work as is w/o any code modifications.

Holy crap I was not aware of this :-) I have now enabled +-1 support for signed branchfree dividers: https://github.com/ridiculousfish/libdivide/commit/4a1d5a7008af7f401f8f39c2f44f3dd0116a9839

kimwalisch commented 5 years ago

Benchmarks are not reality. Hitting BP should be avoided, where possible, if you're aiming for the ultimate low latency (and low jitter). That is, I'd prefer to spend another 1ns and not hit BP.

Personally I am using libdivide inside a hot loop and hence the only thing I care about is high throughput i.e. how fast that loop completes.

Can you please describe a libdivide use case where low latency and low jitter are more important than high throughput? I don't have experience with that kind of workloads...

yb303 commented 5 years ago

When running a market data feed in a high frequency trading app, you want low latency and low jitter. Packets come from a mix of sources, and contain different messages, so there's no hot loop of crunching a lot of the same data.

kimwalisch commented 5 years ago

When running a market data feed in a high frequency trading app, you want low latency and low jitter. Packets come from a mix of sources, and contain different messages, so there's no hot loop of crunching a lot of the same data.

OK, thanks for the information. I will leave everything as is and not add a branch to the unsigned branchfree algorithm. The current algorithm might also get auto vectorized by the compiler, adding a branch to that algorithm would potentially stop this compiler optimization.

ridiculousfish / libdivide

branchfree divide +/- 1 #49