Fastest way of trimming trailing zeros

Currently we use the method proposed in the paper by Granlund-Montgomery: using modular inverse and std::rotr.

However, there are at least two competitors in this league. Assuming $q$-bit integers, and let $d$ be the divisor, then:

Lemire's method: multiply $\left\lceil\frac{2^{q}}{d}\right\rceil$, and compare the lower $q$-bits with $\left\lceil\frac{2^{q}}{d}\right\rceil$. Regardless of the comparison result, the upper $q$-bits is the quotient of the division. Thus it needs $2q$-bits widening multiplication, upper $q$-bits for the quotient and lower $q$-bits for the divisibility test, but there is no shift, and only one constant needs to be loaded into the register.
A different generalization of Granlund-Montgomery for non-odd divisors: Applying Theorem 4.2 from our paper, it is possible to generalize Granlund-Montgomery style modular inverse method into non-odd divisors. In this new method, bit-rotation is replaced by regular bit-shift, and the shift needs to be done only when the comparison succeeds so the input is determined to be a multiple of $d$, while Granlund-Montgomery unconditionally does the bit-rotation. One should note that this method does not work for all $q$-bit unsigned integers, but the range of valid inputs is enough for our application.

For future reference for myself, I write the code for the last method here:

int remove_trailing_zeros(std::uint32_t& n) noexcept {
    int s = 0;
    while (true) {
        auto const nm_mod = std::uint32_t(n * UINT32_C(42949673));
        if (nm_mod < UINT32_C(42949673)) {
            s += 2;
            n = std::uint32_t(nm_mod >> 2);
        }
        else {
            break;
        }
    }
    auto const nm_mod = std::uint32_t(n * UINT32_C(1288490189));
    if (nm_mod < UINT32_C(429496731)) {
        s += 1;
        n = std::uint32_t(nm_mod >> 1);
    }
    return s;
}

int remove_trailing_zeros(std::uint64_t& n) noexcept {
    int s = 0;
    while (true) {
        auto const nm_mod = std::uint64_t(n * UINT64_C(14941862699704736809));
        if (nm_mod < UINT64_C(184467440737095517)) {
            s += 2;
            n = std::uint64_t(nm_mod >> 2);
        }
        else {
            break;
        }
    }
    auto const nm_mod = std::uint64_t(n * UINT64_C(5534023222112865485));
    if (nm_mod < UINT64_C(1844674407370955163)) {
        s += 1;
        n = std::uint64_t(nm_mod >> 1);
    }
    return s;
}

According to a computation, the $32$-bit version is valid up to n = 1'073'741'899, and the $64$-bit version is valid up to n = 4'611'686'018'427'387'999. In Dragonbox, we need these routines for n's up to 16'777'215 for the $32$-bit version, and n's up to 9'007'199'254'740'991 for the $64$-bit version, so the input range is more than enough.

jk-jeon / dragonbox

Fastest way of trimming trailing zeros #62