jk-jeon / dragonbox

Reference implementation of Dragonbox in C++
Apache License 2.0
588 stars 37 forks source link

Optimize decimal_length_minus_1 #19

Closed Alexhuszagh closed 2 years ago

Alexhuszagh commented 3 years ago

As discussed in https://github.com/dtolnay/dragonbox/issues/1, the algorithm for calculating the decimal length is rather inefficient, since the number of digits can be computed with minimal branching using a shr, add, and a table lookup instruction as described here.

The current implementation is:

JKJ_FORCEINLINE static constexpr std::uint32_t decimal_length_minus_1(std::uint32_t const v) {
    assert(v < 1000000000);
    if (v >= 100000000) { return 8; }
    if (v >= 10000000) { return 7; }
    if (v >= 1000000) { return 6; }
    if (v >= 100000) { return 5; }
    if (v >= 10000) { return 4; }
    if (v >= 1000) { return 3; }
    if (v >= 100) { return 2; }
    if (v >= 10) { return 1; }
    return 0;
}

Which optimizes quite poorly (for example, on x86):

decimal_length_minus_1(unsigned int):
        mov     eax, 8
        cmp     edi, 99999999
        ja      .L1
        mov     eax, 7
        cmp     edi, 9999999
        ja      .L1
        mov     eax, 6
        cmp     edi, 999999
        ja      .L1
        mov     eax, 5
        cmp     edi, 99999
        ja      .L1
        mov     eax, 4
        cmp     edi, 9999
        ja      .L1
        mov     eax, 3
        cmp     edi, 999
        ja      .L1
        mov     eax, 2
        cmp     edi, 99
        ja      .L1
        xor     eax, eax
        cmp     edi, 9
        seta    al
.L1:
        ret

Using a built-in ctlz function, we can make this much more efficient:

#ifdef _MSC_VER
#include <intrin.h>
#endif

JKJ_FORCEINLINE static std::uint32_t ctlz(std::uint32_t v) {
    // Optimize for GCC/Clang when possible.
#ifdef __has_builtin
  #if __has_builtin(__builtin_clz)
    return __builtin_clz(v);
  #endif
#endif

    // Optimize for MSVC
#ifdef _MSC_VER
  #if defined(_M_X64) || defined(_M_ARM64)
    unsigned long leading_zero = 0;
    _BitScanReverse(&leading_zero, v);
    return static_cast<std::uint32_t>(31 - leading_zero);
  #endif
#endif

    // Fallback algorithm
    int count = 0;
    if (v & std::uint32_t(0xffff0000)) {
        v >>= 16;
        count += 16;
    }
    if (v & std::uint32_t(0xff00)) {
        v >>= 8;
        count += 8;
    }
    if (v & std::uint32_t(0xf0)) {
        v >>= 4;
        count += 4;
    }
    if (v & std::uint32_t(0xc)) {
        v >>= 2;
        count += 2;
    }
    if (v & std::uint32_t(0x2)) {
        v >>= 1;
        count += 1;
    }
    return 31 - count;
}

JKJ_FORCEINLINE static std::uint32_t fast_log2(std::uint32_t const v) {
    return 31 - ctlz(v | 1);
}

JKJ_FORCEINLINE static std::uint32_t decimal_length_minus_1(std::uint32_t const v) {
    static constexpr std::uint32_t table[] = {9, 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999};
    std::uint32_t y = (9 * fast_log2(v)) >> 5;
    y += v > table[y];
    return y;
}

Most of the bloat is just to ensure we have a portable way to count leading zeros. And this optimizes quite nicely with the built-in optimizations:

decimal_length_minus_1(unsigned int):
        mov     eax, edi
        mov     edx, 31
        or      eax, 1
        bsr     eax, eax
        xor     eax, 31
        sub     edx, eax
        lea     eax, [rdx+rdx*8]
        shr     eax, 5
        mov     edx, eax
        cmp     DWORD PTR decimal_length_minus_1(unsigned int)::table[0+rdx*4], edi
        adc     eax, 0
        ret

For MSVC on x86_64, we have:

unsigned int decimal_length_minus_1(unsigned int) PROC                    ; decimal_length_minus_1, COMDAT
        xor     eax, eax
        mov     edx, ecx
        or      edx, 1
        bsr     r8d, edx
        lea     r9d, DWORD PTR [r8+r8*8]
        shr     r9d, 5
        lea     r8, OFFSET FLAT:unsigned int const * const `unsigned int decimal_length_minus_1(unsigned int)'::`2'::table
        cmp     ecx, DWORD PTR [r8+r9*4]
        seta    al
        add     eax, r9d
        ret     0
unsigned int decimal_length_minus_1(unsigned int) ENDP  

And for MSVC on ARM64, we have:

|unsigned int decimal_length_minus_1(unsigned int)| PROC              ; decimal_length_minus_1
        orr         w8,w0,#1
        clz         w9,w8
        mov         w10,#0x1F
        sub         w10,w10,w9
        add         w8,w10,w10,lsl #3
        adrp        x9,|unsigned int const * const `unsigned int decimal_length_minus_1(unsigned int)'::`2'::table|
        add         x9,x9,|unsigned int const * const `unsigned int decimal_length_minus_1(unsigned int)'::`2'::table|
        lsr         w11,w8,#5
        ldr         w8,[x9,w11 uxtw #2]
        cmp         w0,w8
        csethi      w10
        add         w0,w10,w11
        ret

        ENDP  ; |unsigned int decimal_length_minus_1(unsigned int)|, decimal_length_minus_1

Since the fallback ctlz algorithm isn't very good, we could do a simple optimization for MSVC and GCC/Clang, and for another other compiler keep the old algorithm.

Alcaro commented 3 years ago

Good idea, but may I propose another optimization?

JKJ_FORCEINLINE static std::uint32_t fast_log2(std::uint32_t const v) {
    return 31 - ctlz(v | 1);
}

Change - to ^, and GCC will emit three opcodes less.

...though MSVC prefers -, so may want to ifdef it. (Clang optimizes both equally.)

e: turns out MSVC prefers - only because you use - in that ctlz function. Easy fix.

Alexhuszagh commented 3 years ago

Ooh that could work nicely too (sorry, my initial reply assumed you meant ctlz(v ^ 1), which would not work and fail spectacularly when v == 0).

jk-jeon commented 3 years ago

I'm actually aware of this trick. The reason why I'm hesitant to applying it is that actually it results in slower (even if shorter) code in terms of the uniform average performance, b/c most numbers are long (i.e., have many digits). Of course in practice short numbers occur more often than long numbers so uniform average performance is not the best metric, but it is not a negligible metric neither. Also, I think the resulting performance boost/regression may not be in the robustly measurable range.

However, I think at least the code size benefit is obvious, and I guess it does have a meaningful impact on inlinability.

Currently I cannot afford enough time to carefully analyze the benefit/cost of this. Maybe later someday.

Alexhuszagh commented 3 years ago

I'm actually aware of this trick. The reason why I'm hesitant to applying it is that actually it results in slower (even if shorter) code in terms of the uniform average performance, b/c most numbers are long (i.e., have many digits). Of course in practice short numbers occur more often than long numbers so uniform average performance is not the best metric, but it is not a negligible metric neither.

Just from the implementation standpoint, this is definitely true for how binary64 calls the algorithm, since by default the policy isn't to trim trailing zeros, so there's a high likelihood even with common floats of having a large number of trailing zeros. But, this isn't necessarily true for binary32 representations, which trim trailing zeros prior to this being called, or if trailing zeros is called for binary64, for common floats.

Also, this effectively amortizes the time of calculating the number of digits, while providing good performance for common cases, which I believe should be a desirable property.

Also, I think the resulting performance boost/regression may not be in the robustly measurable range.

This probably isn't a bottleneck, so I'd have to agree.

Currently I cannot afford enough time to carefully analyze the benefit/cost of this. Maybe later someday.

I look forward to it. This probably minimally impacts overall performance, just due to other bottlenecks in the room, so I understand this is very low priority.

jk-jeon commented 3 years ago

Another candidate: https://godbolt.org/z/czrK918KP.

(Note that the input number is never 0 in our case so we don't need to take care of that special case.)

I think this implementation will be slightly better than what @Alexhuszagh proposed, as it uses one less table entries. It uses one or two more instructions, but as adc is usually quite slow, I expect this one to actually perform better than the one with adc.

But anyway, I'm leaning towards the side of leaving it as it is, since the suggested changes don't seem to provide a better performance in general. But I will leave this issue open until someone provides a more comprehensive analysis.

EDIT: Sorry, I the number of table entries must be same. Thus the implementation above should be modified a little bit.

Alexhuszagh commented 3 years ago

Another candidate: https://godbolt.org/z/czrK918KP.

(Note that the input number is never 0 in our case so we don't need to take care of that special case.)

I think this implementation will be slightly better than what @Alexhuszagh proposed, as it uses one less table entries. It uses one or two more instructions, but as adc is usually quite slow, I expect this one to actually perform better than the one with adc.

But anyway, I'm leaning towards the side of leaving it as it is, since the suggested changes don't seem to provide a better performance in general. But I will leave this issue open until someone provides a more comprehensive analysis.

Looks good, I'll probably benchmark this with a few contrived examples to see if it makes any difference. If not, I'll close this.

Alcaro commented 3 years ago

not if i benchmark them first

I took five ilog10 designs, reimplemented them, and tossed them into Quick-Bench: https://quick-bench.com/q/ohRRzsfTxtVf7sAFLZ_D6q8fcWU https://quick-bench.com/q/pwAl0hVZ6wpxe5c6W4J8cAOrWck (Quick-Bench glitches up if there are too many tests, had to split it)

ilog10a is the current one as quoted in OP; it simply compares to powers of 10 in descending order. ilog10b is a trivial variant of ilog10a, which compares to 100000 first, halving the worst-case number of comparisons (but adding one more comparison for large numbers). ilog10c is OP's proposal, using log2 to find which power of 10 to compare to (except I replaced 9>>5 with 9/32 to remove a set of parentheses - it optimizes back to bitshift). ilog10d is ilog10c, but another lookup table instead of *9/32. ilog10e is jeon's counterproposal, which uses a different table, such that the conditional can be replaced with an addition and a few bitshifts. A quite clever design.

The functions are benchmarked six times each. First with randomly distributed odd numbers 1 through 0x1FFFFFFF = 536870911 (it stops below 4294967295 to not hit ilog10a's assert), second with numbers 1 through 255, third with each number length picked sequentially (length 1 then 2 then 3 then 4...), then numbers of uniformly distributed random length, and the last two exclusively test the numbers 5 and 55555555.

The answers are quite conclusive: jeon's proposal, E, has vastly superior performance characteristics (though D wins on code size ignoring tables, and A wins in readability by beginners, if speed isn't your top priority).

A and B do an excellent job on many input distributions, often finishing 2x faster than C and D, but they also fall apart completely if the branch predictor can't do its job. And, as mentioned, their code size is rather unimpressive. B wins over A for short inputs, but A wins for longer ones, as expected. (Oddly enough, B remains the slowest for random lengths.)

C and D have slower best case (jeon is right, adc is slow), but their speed remains constant across the board, and they easily defeat A and B at their worst. D is uniformly slightly slower than C. (Their speed does vary a little; I suspect either it's just noise, or the branch predictor does something to the adc.)

E is, like C and D, constant time, and it's about 3x faster than them. It wins over everything else on five of six input distributions (sometimes by a large margin, sometimes narrowly), only losing to A and B on the large constant, where both are already fast in absolute numbers.

e: part of this analysis was written for an earlier version of the benchmark, which didn't shuffle the inputs. Fixed. e2: Added two more benchmarks, repeatedly testing the constants 5 and 55555555. Also added A vs B, and C vs D. e3: Created a sixth implementation. It's inferior to E in every way, not gonna analyze it. https://quick-bench.com/q/qZ7CimPBeYMEA7_ZT5bzF5J-dqA

jk-jeon commented 3 years ago

@Alcaro Thanks a lot.

D is uniformly slightly slower than C. (Their speed does vary a little; I suspect either it's just noise, or the branch predictor does something to the adc.)

This is expected. Multiply by 9 can be done very fast by abusing the lea instruction, so *9/32 is nothing but lea and then shr. Definitely better than a table lookup.

It seems that GCC actually favors C over E. Don't know why. Looking at the assembly, I think Clang actually does a better job here, so it's likely GCC's fault.

So it sounds reasonable to actually adapt ilog10e. I'll update the code shortly.

ilog10e is jeon's counterproposal, which uses a different table, such that the conditional can be replaced with an addition and a few bitshifts. A quite clever design.

This is not my idea. I recall I saw this when I was reading Dr. Lemire's blog. I'll find the one who suggested this and correctly attribute the inventor through the comment.

Alcaro commented 3 years ago

That'd be George Spelvin, and everyone who contributed to the versions his idea is based on (Josh Bleecher Snyder, Kendall Willets, and of course Daniel Lemire himself). https://lemire.me/blog/2021/06/03/computing-the-number-of-digits-of-an-integer-even-faster/#comment-586067

Willets' idea (the blog OP, henceforth known as G) is also worth testing. While it needs a bigger table (256 bytes for 32bit inputs, 1KB for 64bit - E only needs 36 or 152, respectively), it needs fewer instructions, at least for 32bit inputs on 64bit hardware. On 32bit hardware or 64bit inputs, it gains a few more instructions, but it's still worth investigating both variants. https://quick-bench.com/q/-Ar6_ubBXxxUeesHRZB1yQdgf0k (Quick-Bench doesn't offer 32bit hardware, so only the latter is tested.)

Turns out G is the fastest; G64 is identical to E, but G32 defeats both. And on GCC, the two Gs are identical, both being a fair bit faster than E. I don't know how the extra adc is free; I vaguely suspect instruction scheduling shenanigans, possibly involving false dependencies. https://quick-bench.com/q/07u1DKbzK9MOD8HH7Nj8QGUK7K8

Considering how rarely it's called (once per formatted number), I'm personally leaning towards E, but it's your choice.

(C++20 added ilog2 to the standard library, as std::countl_zero / std::bit_width. C++26 should totally add an ilog10.)

jk-jeon commented 2 years ago

Since I implemented a faster jeaiii-style printing algorithm (692f67f4baa76cda7327c1cf43e0aa8277950a8c) which does not compute the length upfront (rather, it does a binary search and specialize each branch), there is no need to implement the trick discussed in this issue. So I close this.