Investigate LKK remainder and divisibility algorithms

StephanTLavavej commented 5 years ago

See Faster remainders when the divisor is a constant: beating compilers and libdivide and the paper Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries by Daniel Lemire, Owen Kaser, and Nathan Kurz. This has the potential to improve f2s and possibly d2s.

Ideally, compilers would implement this, so we could continue using % 5 and so forth in the source code.

ulfjack commented 5 years ago

Thanks for filing! I've read the blog post a couple days ago, and wanted to take a look at where it might be applicable.

vinniefalco commented 5 years ago

Amazing that advances in basic integer operations are still being made. If I am reading this right, does this mean that for an hash table it is possible to calculate the constants for multiplication once when the number of buckets change and then use those values to calculate the remainder after dividing the digest by the number of buckets?

cassioneri commented 3 years ago

I don't think LKK can be useful for Ryu since, AFAIK, it only uses divisors that are powers of 2, 5 and 10 known at compile time. My two cents:

LKK looks good when the divisor is a runtime variable. Indeed, Section 5.3 of LKK's paper compares their algorithm with Granlund-Montgomery (GM) (a.k.a. modular inverse) and concludes the superiority of LKK.

To check divisibility by an odd divisor d, i.e., n % d == 0, both approaches first calculate two numbers m and b that depend only on d such that m * n <= b if, and only if, n % d == 0. Indeed, in the following quote from LKK's paper, m is what they refer to as "our c" (for LKK) and "their required constant" (for modular inverse). They also mention "a multiplication and comparison" which are those required in the evaluation of m * n <= b. (When d is even, GM requires an additional rotate instruction.)

In the unsigned case, our LKK has a significant code-size advantage over GM—approximately 3 arithmetic instructions to compute our c versus about 11 to compute their required constant. Both fast approaches use a multiplication and comparison for each subsequent divisibility check. GM requires an additional instruction to rotate the result of the multiplication.

However, when d is known at compile time then m and b are calculated during compilation, removing the 3 (for LKK) or 11 (for modular inverse) arithmetic instructions from the emitted code. Hence, in this case and when d is odd, in both approaches only m * n <= b is evaluated at runtime. In that case, modular inverse has an advantage: for 32-bits operands, it only needs 32-bits operations whereas LKK needs 64-bits operations. (For 64-bits operands, LKK needs the 128-bits product of two 64-bits multiplicands.)

Finally, at the time of the paper's publication, gcc and clang suffer from two problems (fixed in versions 9 of both compilers) which affected performance:

1) Even for x86-64 CPUs, rather than profiting from full 64-bits instructions and registers, they were using a legacy x86 mul where the 64-bits product of two 32-bits multiplicands is stored in eax and edx.

2) They did not implement modular inverse and, instead, evaluated n % 5 == 0 as n == 5 * (n / 5).

msvc and icc still suffer from these two problems.

https://godbolt.org/z/4q6nav

ulfjack / ryu

Investigate LKK remainder and divisibility algorithms #98