Closed yb303 closed 5 years ago
Below is a simple C++ test function that I have used to compare the assembly output of both libdivide's current branchfree divider and your branchfree divider version that supports +-1.
#include "libdivide.h"
uint64_t divide(uint64_t x, void* divider)
{
auto& fast_d = *((libdivide::branchfree_divider<uint64_t>*) divider);
return x / fast_d;
}
When I compile this code using our current libdivide.h
and Clang 8.0.0 with -O3
on x64 (clang++ -O3 -S test.cpp
) I get the following assembly output which does not contain any cmp
or cmov
instructions.
divide(unsigned long, void*):
mov rax, rdi
mul QWORD PTR [rsi]
movzx ecx, BYTE PTR [rsi+8]
sub rdi, rdx
mov rax, rdx
shr rdi
add rax, rdi
shr rax, cl
ret
Now when I compile the same code using your libdivide version that supports +-1 I get the following assembly output that contains cmp
and cmov
instructions.
divide(unsigned long, void*):
mov cl, byte ptr [rsi + 8]
mov rax, rdi
mul qword ptr [rsi]
xor eax, eax
cmp cl, 64
cmove rax, rdi
add rax, rdx
sub rdi, rax
shr rdi
add rax, rdi
shr rax, cl
ret
One could argue that your implementation is still branchfree because it does not use any jmp
instructions. But when compiling your code on POWERPC64 I was even able to generate a jmp
instruction, see https://godbolt.org/z/SKGjuI.
So my personal feeling is that your code is not branchfree in all circumstances.
You're right. I think the compilers are not too smart about it.
The conditional line can be changed like this:
q += -((denom->more >> __builtin_ctz(LIBDIVIDE_ONE_MARKER)) & 1) & numer;
The performance of that is a bit worse than using cmov. The /1 support costs like +15% latency... maybe I should make the 1 support optional. Do you have some proper i7 to see how it's doing?
Do you have some proper i7 to see how it's doing?
Yes, I have an i7-6700 Skylake CPU. I can run benchmarks when I am back home.
q += -((denom->more >> __builtin_ctz(LIBDIVIDE_ONE_MARKER) & 1) & numer;
I think we can get rid of 1 instruction for u32
and u64
and change your code to:
q += -(denom->more >> __builtin_ctz(LIBDIVIDE_ONE_MARKER) & numer;
Another idea would be to add a new int8_t one
to the branchfree structs:
struct libdivide_u64_branchfree_t {
uint64_t magic;
uint8_t more;
int8_t one;
};
This one
would be set to -1 if the number is +-1 and 0 otherwise. This would save a few more instructions:
q += numer & one;
And we would also not need to correct denom->more;
using:
// without new one variable
return q >> (more & LIBDIVIDE_64_SHIFT_MASK)
// with new one variable
return q >> more
The drawback of this solution is that this relies on C/C++ integer promotion rules and it would make it a little more difficult to port libdivide to other languages.
Here is yet another idea that may potentially improve performance:
uint64_t libdivide_u64_branchfree_do(uint64_t numer, const struct libdivide_u64_branchfree_t *denom) {
uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
// Here is what we want to achieve without branches
// if q = numer then t = numer and we return numer
if (divider == 1 || divider == -1)
q = numer;
uint64_t t = ((numer - q) >> 1) + q;
return t >> denom->more;
}
Our current idea to support +-1 is to set denom->magic = 0
. But what about if we set it to the largest possible value e.g. 2^64-1 and denom->more = 0
. In this case:
uint64_t q = libdivide_mullhi_u64(2^64-1, numer);
// Now q = numer-1
// So we just need to find a way to add +1 if the divider is +-1
// one is set to 1 for +-1 and 0 otherwise.
// The one variable is explained in my previous post.
q += one;
// Maybe we can also use,
// if this works we don't need to add the one variable
q += (denom->more == 0);
I will try to implement these ideas and report benchmarks...
The 'one' idea is nice. Reduces complexity. But my i5-3470 the performance is the same as the other impl.
I believe this approach would work 2^64, not 2^64 - 1, but I didn't feel right enlarging the u64 magic to uint128_t
But, at least for uint32_t this got me another idea. Use the more (or one) bit to serve as bit 32 in magic, like this:
uint64_t magic = denom->magic + (uint64_t)(denom->more >> 6) << 32; uint32_t q = libdivide_mullhi_u32(denom->magic, numer); // q == numer uint32_t t = ((numer - q) >> 1) + q; // t == q return t >> (denom->more & LIBDIVIDE_32_SHIFT_MASK);
But I'm getting some wrong results same as with 2^32-1
The code above was meant to be:
uint64_t magic = denom->magic + (uint64_t)(denom->more >> 6) << 32;
uint32_t q = libdivide_mullhi_u32(denom->magic, numer); // q == numer
uint32_t t = ((numer - q) >> 1) + q; // t == q
return t >> (denom->more & LIBDIVIDE_32_SHIFT_MASK);
This works for u32:
static inline uint32_t libdivide_mullhi_u32(uint64_t x, uint64_t y) {
uint64_t rl = x * y;
return (uint32_t)(rl >> 32);
}
uint32_t libdivide_u32_branchfree_do(uint32_t numer, const struct libdivide_u32_branchfree_t *denom) {
uint64_t magic = denom->magic;
magic += (uint64_t)(denom->more >> 6) << 32;
uint32_t q = libdivide_mullhi_u32(magic, numer);
uint32_t t = ((numer - q) >> 1) + q;
return t >> (denom->more & LIBDIVIDE_32_SHIFT_MASK);
}
And with good performance.
The same trick on u64 would need uint128_t.. will try out later
This works for u32: And with good performance.
Great, my idea with magic = 2^64-1
and q += one
does not work if the numerator is 0. The code then returns 1 instead of 0 and I don't think I can easily fix that issue. Note that you need the following changes inside tester.cpp
(at line 164) to test the 1 case for unsigned u32 and u64 (and you need to disable vector division if you have not implemented that yet):
// Don't try dividing by +/- 1 with branchfree signed
if (ALGO == BRANCHFREE && std::is_signed<T>::value &&
(denom == 1 || (limits::is_signed && denom == T(-1)))) {
return;
}
I see that your code will run very fast if the same divider is reused inside a hot loop as the compiler is able to move out of the body of the loop the additional calculations that you have introduced. This is most likely the reason why you see great performance using the benchmark program.
The primes_benchmark
program however iterates over an array of dividers and hence the compiler cannot move out of the body of the loop the additional calculations that you have introduced. I suggest you also run the primes_benchmark
program and report how much slowdown you experience.
The primes benchmark timing includes the divider generation. But you're right about the invariant code motion... I'll change the benchmark a bit. Thanks!
The primes benchmark timing includes the divider generation. But you're right about the invariant code motion... I'll change the benchmark a bit. Thanks!
I have now changed the primes benchmark myself and renamed it to benchmark_branchfree.cpp
. The benchmark now iterates over an array of dividers and computes divisions and nothing else. The benchmark is now close to my use case in primecount. Support for +-1 in the branchfree divider would be nice but not at all cost, primecount's users don't care about +-1 support, they only care about performance...
I've eliminated the loop invariant motion... On my i5 box:
hw 3.34 7.80
lib 1.74 1.75
lib+1 2.20 2.27
20% performance hit
I've eliminated the loop invariant motion...
Can you please push your new code to your branchfree_div_1
branch (and possibly also update the code of this pull request) so I can benchmark it as well on my Intel i7-6700 CPU?
Great, my idea with magic = 2^64-1 and q += one does not work if the numerator is 0. The code then returns 1 instead of 0 and I don't think I can easily fix that issue.
I have now found a trick to make it work (though I am not sure yet the new code is 100% correct because the tester program does not yet support +-1). So far I have only implemented the +1 support for u32 and u64 branchfree. The code is available on the branchfree_1 branch.
Here is what the new code looks like:
uint64_t libdivide_u64_branchfree_do(uint64_t numer,
const struct libdivide_u64_branchfree_t *denom) {
uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
// 1 divider support: '+ (denom->one & !!numer)'
uint64_t t = ((numer - q) >> 1) + q + (denom->one & !!numer);
return t >> denom->more;
}
The code looks fast but unfortunately I have measured a slowdown of 30% for u64 using both the benchmark and benchmark_branchfree programs on my i7-6700 CPU.
I've pushed my code to branchfree_div_1 on two separate commit. One for the lib and one for the testers.
TBH, I don't understand how come + (denom->one & !!numer)
works ... got to look through the rest of it.
The code looks fast but unfortunately I have measured a slowdown of 30% for u64 using both the benchmark and benchmark_branchfree programs on my i7-6700 CPU.
I think the placement of the addition, on t
, or on q
, affect out of order execution.
TBH, I don't understand how come + (denom->one & !!numer) works ... got to look through the rest of it.
+ denom->one
is the correction required if the divider is 1. However this correction will also return 1 if the numerator is 0 which is not correct. Hence & !!numer
is used to not add one if the numerator is 0.
I will benchmark your code tomorrow.
I will benchmark your code tomorrow.
Your code incurs about 35% slowdown on my i7-6700.
I have now combined both our approaches:
q += -(denom->more >> __builtin_ctz(LIBDIVIDE_ONE_MARKER) & numer;
denom->one
This is what the new code looks like:
uint64_t libdivide_u64_branchfree_do(uint64_t numer,
const struct libdivide_u64_branchfree_t *denom) {
uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
uint64_t t = ((numer - q) >> 1) + q + (numer & denom->one);
return t >> denom->more;
}
denom->one
is a bitmask that is set to -1 if the divider is 1 or to 0 otherwise. This new code incurs only a 15% slowdown using GCC and uses fewer assembly instructions than any other variant I have tested so far.
This new code introduces only about 3% slowdown in one of primecount's formulas, hence the overall slowdown in primecount should be about 1%. I can live with that slowdown...
I have opened a new issue https://github.com/ridiculousfish/libdivide/issues/55 where I explain the implementation details of the +-1 branchfree divider algorithms that we have investigated.
At the end of this issue I ask for more feedback from other users in order to decide how to proceed. Hence I close this pull request now.
Given the performance impact thats fair enough. I guess the 1 support can be made optional using another algo, if need be.
@yb303: I just had a brilliant idea on how we can support +-1 branchfree dividers without any performance degradation. My idea was inspired by your comment which you wrote when you finally made your code 100% branchfree:
The conditional line can be changed like this:
q += -((denom->more >> __builtin_ctz(LIBDIVIDE_ONE_MARKER)) & 1) & numer;
The performance of that is a bit worse than using cmov.
The branchfree divider was invented to speed up the use case when iterating over an array of dividers. And actually we don't care if the branchfree divider is 100% branchfree, all we care about is that it runs as fast as possible when iterating over an array of dividers. If we make an exception and allow a branch if the divider is 1 I cannot possibly imagine a real world use case where this would lead to bad performance.
Now for the implementation, I think we don't need to add the int8_t one
bitmask to the branchfree structs. For the unsigned case we can use the 8th bit of the more
variable (which is usually used to indicate a negative divider) to indicate that the divider is 1. So the implementation for supporting 1 as a divider in the unsigned case looks like:
This is what the new code looks like:
uint64_t libdivide_u64_branchfree_do(uint64_t numer,
const struct libdivide_u64_branchfree_t *denom) {
if (denom->more & LIBDIVIDE_DIVIDER_1) {
return numer;
}
uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
uint64_t t = ((numer - q) >> 1) + q;
return t >> denom->more;
}
On my Intel i7-6700 CPU this runs just as fast as the old code using the branchfree_benchmark
.
I like the above implementation, it is a very clean solution. Now we just need to find a clean solution to support +-1 as a divider for signed branchfree...
if (...) return numer;
will definitely produce a branch.
I actually don't have a good use case for a divider of 1, so yes, it was a proof of concept mostly for completeness.
Benchmarks are not reality. Hitting BP should be avoided, where possible, if you're aiming for the ultimate low latency (and low jitter). That is, I'd prefer to spend another 1ns and not hit BP.I think my tests showed that dividers of +/-1 work as is w/o any code modifications.
Holy crap I was not aware of this :-) I have now enabled +-1 support for signed branchfree dividers: https://github.com/ridiculousfish/libdivide/commit/4a1d5a7008af7f401f8f39c2f44f3dd0116a9839
Benchmarks are not reality. Hitting BP should be avoided, where possible, if you're aiming for the ultimate low latency (and low jitter). That is, I'd prefer to spend another 1ns and not hit BP.
Personally I am using libdivide inside a hot loop and hence the only thing I care about is high throughput i.e. how fast that loop completes.
Can you please describe a libdivide use case where low latency and low jitter are more important than high throughput? I don't have experience with that kind of workloads...
When running a market data feed in a high frequency trading app, you want low latency and low jitter. Packets come from a mix of sources, and contain different messages, so there's no hot loop of crunching a lot of the same data.
When running a market data feed in a high frequency trading app, you want low latency and low jitter. Packets come from a mix of sources, and contain different messages, so there's no hot loop of crunching a lot of the same data.
OK, thanks for the information. I will leave everything as is and not add a branch to the unsigned branchfree algorithm. The current algorithm might also get auto vectorized by the compiler, adding a branch to that algorithm would potentially stop this compiler optimization.
On my home box I was able to only test m128 so could you please test 256 and 512 vectors before merging? Thanks!