Special support for 63-bit division (unsigned)?

zielaj commented 2 years ago

Hi,

I was wondering whether it would make sense to provide special support for 63-bit division. The motivation is that such divisions can, I think, be done a bit faster than 64-bit divisions, and in all cases in which I personally needed fast division, 63 bits were enough.

For example, the current branch-free 64-bit division code is

uint64_t libdivide_u64_branchfree_do(
    uint64_t numer, const struct libdivide_u64_branchfree_t *denom) {
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
    uint64_t t = ((numer - q) >> 1) + q;
    return t >> denom->more;
}

On x86-64 the last two lines of that function compile to something like this:

sub    rax,rdi           ; numer - q
shr    rax,1             ; >> 1
add    rax,rdi           ; + q
shrx   rax,rax,r10       ; >> denom->more

These are four 1-cycle instructions that have sequential data dependencies (via rax), so they have a combined latency of 4 cycles.

If both the numerator and denominator were only 63-bits, I think this can be improved to

uint64_t libdivide_u63_branchfree_do(
    uint64_t numer, const struct libdivide_u63_branchfree_t *denom) {
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
    return (numer + q) >> denom->more_plus_1;   // more_plus_1 == more + 1
}

Here the last line should compile to something like this, which should take 2 cycles rather than 4.

add    rax,rdi           ; numer + q
shrx   rax,rax,r10       ; >> denom->more_plus_one

As a bonus, division by 1 would work, by setting magic to 0 or 1, and more_plus_one to 0.

On the other hand, there are code maintenance and API clarity drawbacks, which should be weighed against this proposal.

ridiculousfish commented 2 years ago

Yes and even better: if your numerator is N-1 bits, then we only need an N bit magic number, and we can do without the add path entirely:

uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
return t >> denom->more;

I'm not sure how to evaluate whether N-1 bit division is broadly useful. The API surface area is already large and it's becoming unwieldy to maintain. N-1 bit division would nearly double the size.

zielaj commented 2 years ago

That's fair. I'll close this thread. If N-1 bit division is indeed broadly useful, someone will eventually re-open this thread and describe their use case.

Just for the record, I was just playing around with the idea of keeping the API the same and making a dynamic decision whether to use N-1 bit division. In order to minimize branch mispredictions, this decision would have to be sticky: once we've seen a numerator with bit N-1 set, the next 100 or so invocations of the division function would use the full N bit division, even if the numerators were small. The necessary state would be opportunistically kept on the stack (hoping that it would be preserved across function invocations, with should be true in hot loops, and shouldn't matter otherwise). Unfortunately, this approach turned out to generate way too much overhead (at least in my implementation).

ridiculousfish / libdivide

Special support for 63-bit division (unsigned)? #95