microsoft / STL

MSVC's implementation of the C++ Standard Library.
Other
10.19k stars 1.5k forks source link

<bit>: Is the __isa_available check for lzcnt worth the cost? #2849

Open lhecker opened 2 years ago

lhecker commented 2 years ago

std::bit_ceil emits quite a bit of assembly on x64. This seems to occur mostly due the branch in _Checked_x86_x64_countl_zero and to a lesser extent due to the branch in bit_ceil itself.

I've written a variant which produces a more compact result: https://godbolt.org/z/q4EEz83aW (It also removes the extra branch on ARM64 by using conditional assignments.)

I've checked the PR that introduced the code (https://github.com/microsoft/STL/pull/795) and it appears as if the cost of this if condition wasn't discussed. The if condition generally makes sense though: bsr is costly on AMD CPUs (up to Zen3, 4 cycles/op latency), whereas lzcnt is very fast on any architecture (<= 1 cycle). But it takes up 3 slots in the CPU's branch target buffer (contemporary hardware has ~4096 slots, newly added branches incur an extra 5-20 cycle latency), generates larger binaries after inlining and unfortunately the added instructions seem to add about ~5 cycles of latency themselves, offsetting the cost of bsr.

This makes me wonder: Should we drop the __isa_available check?

lhecker commented 2 years ago

I've benchmarked this on a Zen3 CPU (5950X) using google/benchmark using the following simple code:

volatile unsigned int init;

static void BM_custom(benchmark::State& state) {
    auto v = init;
    for (auto _: state) {
        benchmark::DoNotOptimize(bit_ceil(v++));
    }
}
BENCHMARK(BM_custom);

Here are the results:

-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_std           1.50 ns         1.51 ns    560000000
BM_custom        1.06 ns         1.05 ns    640000000

It appears as if such a change would have a positive benefit even on the Zen3 architecture, despite bsr being implemented in microcode (source).

AlexGuteniev commented 2 years ago

2133 previous investigation

lhecker commented 2 years ago

Ah, duh, I only searched for existing, open issues. 🤦‍♂️

The benchmark you posted there results in some "weird" numbers for me. This is fine so far:

std::countl_zero: 1.360940
std::countr_zero: 0.896549

// = AVX2 ❌
// Expected to be slower on AMD, due to bsf/bsr being implemented in microcode.
std::_Countl_zero_bsr: 2.15933
std::_Countr_zero_bsf: 1.71853

// = AVX2 ✅
std::_Countl_zero_lzcnt: 0.439769    <-- 3x as fast!
std::_Countr_zero_tzcnt: 0.727553    <-- only 1.2x as fast - why?

It gets weird when I mix & match the functions, for instance:

std::_Countl_zero_bsr: 2.15463    <-- matched 1:1 with the expected value above
std::countr_zero:      1.29592    <-- why is this suddenly slower?

The solution here appears to be to add a std::this_thread::sleep_for(std::chrono::milliseconds(1000)); in between the two test runs so that the behavior of the former loop doesn't affect the latter. (The benchmark only measures throughput after all.)

lzcnt/tzcnt are a lot faster on AMD as expected, but why does it affect countl_zero so much more than countr_zero, despite both instructions being microcode? It's not the branch in _Countl_zero_bsr. If I remove that one, the duration stays the same (thanks branch predictor!).


So I plugged the test into google/benchmark: https://gist.github.com/lhecker/5ea8322a8c920312b5c2f3a7bedf06e4 (Here's the equivalent godbolt link: https://godbolt.org/z/nojnYsx83) Since my question isn't just about raw performance, but also about binary size and how it would work if used in a larger code base I at least added bit_ceil tests to the above benchmark.

This results in more stable numbers (at approx. 5GHz, edited for clarity):

Run on (32 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
std::countl_zero                0.440 ns        0.422 ns   1000000000
std::_Countl_zero_bsr            1.25 ns         1.26 ns    497777778
std::_Countl_zero_lzcnt         0.427 ns        0.391 ns   1000000000

_Countl_zero_bsr                 1.03 ns         1.03 ns    746666667   <-- branchless

std::countr_zero                0.428 ns        0.406 ns   1000000000
std::_Countr_zero_bsf           0.845 ns        0.837 ns    896000000   <-- already branchless
std::_Countr_zero_tzcnt         0.634 ns        0.625 ns   1000000000 *

std::bit_ceil                   0.843 ns        0.854 ns    896000000
bit_ceil_optimized              0.846 ns        0.837 ns    896000000   <-- removed 1 of 3 branches
bit_ceil_optimized_checked      0.852 ns        0.837 ns    896000000   <-- removed 2 of 3 branches
bit_ceil_optimized_bsr           1.21 ns         1.17 ns    560000000   <-- removed 3 of 3 branches

* consistent outlier, should be better than countr_zero 🤔

(I love how std::_Countr_zero_bsf matches the new _Countl_zero_bsr down to 10 picoseconds. 😅)

This makes me believe that we should make the change to _Countl_zero_bsr (to make it branchless) in either case - if anything it can only have a positive impact. I suspect the same to be true on Intel. But a branchy std::bit_ceil is still about 40% faster than the branchless bit_ceil_optimized_bsr variant on Zen3. I'd be curious what the difference is on Intel but I suspect it will be a lot less there, since it has fast bsf/bsr.

@AlexGuteniev What would your choice be? Given that the __isa_available bloats the size of the assembly by about 2x, is it worth the 0.3ns throughput improvement on AMD? 🤔

AlexGuteniev commented 2 years ago

On an Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz

2022-07-07T16:39:34+03:00
Running C:\Users\Aleksandr.Gutenev\source\repos\ConsoleApplication15\x64\Release\ConsoleApplication15.exe
Run on (12 X 2208 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)
-------------------------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------
std::countl_zero                0.650 ns        0.656 ns   1000000000 items_per_second=1.52381G/s
std::_Countl_zero_bsr           0.836 ns        0.837 ns    896000000 items_per_second=1.19467G/s
std::_Countl_zero_lzcnt         0.682 ns        0.688 ns   1000000000 items_per_second=1.45455G/s
_Countl_zero_bsr                 1.11 ns         1.10 ns    640000000 items_per_second=910.222M/s
std::countr_zero                0.675 ns        0.663 ns    896000000 items_per_second=1.50905G/s
std::_Countr_zero_bsf            1.04 ns         1.03 ns    640000000 items_per_second=975.238M/s
std::_Countr_zero_tzcnt         0.638 ns        0.625 ns   1000000000 items_per_second=1.6G/s
std::bit_ceil                    1.78 ns         1.76 ns    407272727 items_per_second=566.64M/s
bit_ceil_optimized               1.36 ns         1.35 ns    497777778 items_per_second=740.879M/s
bit_ceil_optimized_checked       1.42 ns         1.41 ns    497777778 items_per_second=707.951M/s
bit_ceil_optimized_bsr           1.31 ns         1.32 ns    497777778 items_per_second=758.519M/s
lhecker commented 2 years ago

A colleague kindly ran the benchmark on an i7-9700k PC for me with similar results:

Run on (8 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 12288 KiB (x1)
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
std::countl_zero                0.591 ns        0.547 ns   1000000000
std::_Countl_zero_bsr           0.704 ns        0.203 ns   1000000000
std::_Countl_zero_lzcnt         0.675 ns        0.141 ns   1000000000
_Countl_zero_bsr                0.942 ns        0.219 ns   1000000000
std::countr_zero                0.575 ns        0.172 ns   1000000000
std::_Countr_zero_bsf           0.909 ns        0.281 ns   1000000000
std::_Countr_zero_tzcnt         0.527 ns        0.234 ns   1000000000
std::bit_ceil                    1.15 ns        0.328 ns   1000000000
bit_ceil_optimized               1.18 ns        0.266 ns   1000000000
bit_ceil_optimized_checked       1.18 ns        0.391 ns   1000000000
bit_ceil_optimized_bsr           1.12 ns        0.203 ns   1000000000
AlexGuteniev commented 2 years ago

So it looks like that micro-benchmark still shows that branch doesn't cost a lot, and its cost comparable to the unexplained variations, yet bsr/lzcnt and bsf/tzcnt difference is clearly significant and worth a branch.

sylveon commented 2 years ago

Plus, speculation will eventually make that branch "essentially free" because the value of __isa_available will never change.

lhecker commented 2 years ago

So it looks like that micro-benchmark still shows that branch doesn't cost a lot, and its cost comparable to the unexplained variations

It does cost a lot: In binary size. To be specific: Function binary size (Bytes)
std::bit_ceil 75
bit_ceil_optimized 87
bit_ceil_optimized_checked 58
bit_ceil_optimized_bsr 31

This is practically the same argument for me as for the branchless clamp issue I opened: Should we really optimize the STL for hot paths to the detriment of everything else? bit_ceil_optimized in particular shows this: By removing a branch in std::bit_ceil we don't reduce binary size - we increase it even further, because the optimizer at /O2 forces even more, redundant code into the two __isa_available branches.

So I ask: Is a 0.3-0.5 ns (1-2 cycles) improvement really "worth a branch"? Is it worth doubling the binary size of a function? Wouldn't a user of the STL have to be running a few million countl_zero at least before the difference is noticeable? Speculation will make the branch essentially free, that's true. But it's only true for small-ish hot paths like in the microbenchmarks we're running. Those 2 countl_zero branches will force out at least 2 branches from the target buffer. If those were important, it'll cost an extra 5-20 cycle latency to get them back in. Is the __isa_available check still cost free then? You'd have to be running at least 10-40 countl_zeros in a loop (!) before it offsets that cost. It's certainly not cost free if you don't loop.

Basically, to summarize, my argument is identical to clamp: Since the difference in performance is so miniscule, we should in my personal opinion err on the side of compactness, because that's the safer choice.

barcharcraz commented 2 years ago

Plus, speculation will eventually make that branch "essentially free" because the value of __isa_available will never change.

even so, it will take a space in the btb, and it may be cleared or evicted frequently. We could at least annotate the branch to indicate it's likely the CPU has lzcnt. It's really, really, really hard to benchmark the effects of code size.

As a practical matter the additional code size could also prevent the function from being inlined at all, which would definitely be worse than the slow version had inlining happened (esp if that then goes and prevents autovectorization)

if we get really creative we could eliminate the btb entry by making the check a static branch with a "JUMP_LABEL" type construct, but that could make code size worse since it may break page sharing for that page.

We could also move the bsr verison into a function, so it won't be inlined with the lzcnt code (ofc if you actually need bsr that is rather painful).

llvm-mca results for CPUs that are slow with bsr would be interesting, since that could show some estimate of the impact of the btb entry and I$ difference.

I'm pretty warm to this idea.

AlexGuteniev commented 2 years ago

Basically, to summarize, my argument is identical to clamp: Since the difference in performance is so miniscule, we should in my personal opinion err on the side of compactness, because that's the safer choice.

My opinion is that size impact is worth the optimization. The default /O2 implies /Ot, not /Os, and while /Ot makes similar sacrifices of space for even smaller gain, it is still preferred.

Also, the main reason to prefer cmov clamp is stable perf for unpredicted input. Here, the checked version has this property, whereas bsr has a branch to check for zero input, so the unconditional bsr is the one which has surprises.

We could also move the bsr verison into a function, so it won't be inlined with the lzcnt code (ofc if you actually need bsr that is rather painful).

Might get us something, but not too much. The call instruction itself still has a significant size.

lhecker commented 2 years ago

Also, the main reason to prefer cmov clamp is stable perf for unpredicted input. Here, the checked version has this property, whereas bsr has a branch to check for zero input, so the unconditional bsr is the one which has surprises.

I have wonderful news then! 🙂 Check out my godbolt link above - it features a branchless bsr variant (_Countl_zero_bsr) which is the foundation of this issue. Without it, making std::bit_ceil as compact (entirely branchless) and performant as I did wouldn't have been possible (resulting in bit_ceil_optimized_bsr).

Here's how it works: _BitScanReverse doesn't actually return a proper value, but rather the ZF register. x86 has no branchless return, but it does have a lot of branchless assignments. So we only need to make a 1 line change for it to turn branchless: https://godbolt.org/z/j5xPxGoEo

By the way, I just noticed that the benchmark gist I posted had a bug - it used bsf instead of bsr. I've updated the gist and 2 benchmark results accordingly. (Some impact to _Countl_zero_bsr on AMD and no impact to anything else, not even on AMD. Weird. Goes to show that microbenchmarks of single instructions like that aren't that meaningful, huh?)

StephanTLavavej commented 2 years ago

We talked about this at the weekly maintainer meeting. We don't yet have sufficient information to reach a conclusion - what we'd like to see is:

lhecker commented 2 years ago

@StephanTLavavej I hope the summary below is alright. Let me know if I should expand on anything in particular. 🙂

Comparisons on a CPU where BSR is slow (maybe Zen 2, we're not sure?)

All AMD CPUs have slow BSR (including Zen 3). All Intel CPUs have fast BSR (at least on par with lzcnt).


A simplified comparison of the status quo versus the recommended branchless BSR technique

std::countl_zero:

CPU Is bsr fast? Before (ns/op) After (ns/op) Before (int/s) After (int/s)
AMD 5950X 0.440 1.030 (+134.09% 🟥) 2272.73M 970.87M (-57.28% 🟥)
Intel i7 8750H 0.650 1.110 (+70.77% 🟥) 1538.46M 900.90M (-41.44% 🟥)
Intel i7 9700k 0.591 0.942 (+59.39% 🟥) 1692.05M 1061.57M (-37.26% 🟥)

std::bit_ceil (std::countl_zero + 2 arithmetic operations):

a) Real world time (inconsistent results for 9700k):

CPU Is bsr fast? Before (ns/op) After (ns/op) Before (int/s) After (int/s)
AMD 5950X 0.843 1.21 (+43.53% 🟥) 1186.24M 826.45M (-30.33% 🟥)
Intel i7 8750H 1.780 1.31 (-26.40% 🟩) 561.80M 763.36M (+35.88% 🟩)
Intel i7 9700k 1.150 1.12 (-2.61% 🟩) 869.57M 892.86M (+2.68% 🟩)

b) CPU time (consistent results for 9700k):

CPU Is bsr fast? Before (ns/op) After (ns/op) Before (int/s) After (int/s)
AMD 5950X 0.854 1.170 (+37.00% 🟥) 1170.96M 854.70M (-27.01% 🟥)
Intel i7 8750H 1.760 1.320 (-25.00% 🟩) 568.18M 757.58M (+33.33% 🟩)
Intel i7 9700k 0.328 0.203 (-38.11% 🟩) 3048.78M 4926.11M (+61.58% 🟩)

Binary size:

Function Before (Bytes) After (Bytes)
std::countl_zero 24 18 (-25.00% 🟩)
std::bit_ceil 75 31 (-58.67% 🟩)

Possible results from llvm-mca as @barcharcraz mentioned

I'm not familiar with how to run llvm-mca with MSVC as the compiler unfortunately. So I went ahead, compiled it with MSVC, ported the assembly over to clang and ran it for std::bit_ceil on Zen 2:

/ Throughput (lower is better) llvm-mca output
Before 9.06 https://godbolt.org/z/e95MrPTqn
After 6.79 https://godbolt.org/z/Ycfca1T4n

My thoughts given the above:

sylveon commented 2 years ago

According to these numbers, on newer Intel CPUs the branchless BSR version only is 2.68% faster, while it results in a 30.33% throughput loss on new AMD CPUs. I do not really see a major win in going branchless, because it makes Intel processors only a sliver faster, but makes AMD CPUs a third slower.

The win in code size is nice, but if codesize is that important, code should be compiled with /O1. Adding a check in the implementation for /O1 and skipping the lzcnt detection in that case makes a lot more sense than entirely nuking it, IMO.

lhecker commented 2 years ago

FYI The 9700k isn't newer than the 8750H. They're both the same architecture but a different stepping with the same performance characteristics (Coffee Lake H and S). My understanding is that the difference is caused by the desktop vs. mobile power envelope and frequency scaling. If we measure the actual CPU time spent running the benchmark, improvement on Intel is consistent:

CPU Is bsr fast? Before (CPU time) After (CPU time)
AMD 5950X 0.854 1.170 (+37.00% 🟥)
Intel i7 8750H 1.760 1.320 (-25.00% 🟩)
Intel i7 9700k 0.328 0.203 (-38.11% 🟩)

Was there ever a decision what this STL is optimized for? Real time throughput or CPU time throughput (or: workstations vs. laptops, unconstrained vs. constrained thermal/power envelope)? As seen above they aren't always identical and optimizing for one might sometimes be to the detriment of the other. I picked real time throughput as that's consistent with benchmarks done on this GitHub so far (which use std::chrono), but I'd personally go with CPU time throughput in the future (using GetThreadTimes). If we want to know how it performs on modern architectures, we'd still have to test that, preferably on 11th or 12th gen Intel, since those are truly different (but I suspect the results to be similar).

RSpliet commented 1 year ago

Came here to see if there was already a thread on optimising the BSR-based fallback, am not disappointed. I happen to have spent some time optimising similar code recently given how I'm working on a C++17 project that doesn't have \<bit> yet. I found that @lhecker's suggested implementation can probably be slightly improved upon by substituting the subtraction with an XOR. Since XOR has an immediate value encoding, we save ourselves a register and an instruction. I invite you to take a look at the proposed implementation "branchless_xor()" on https://godbolt.org/z/E9d8KW4sa . You can imagine a similar implementation for the 16- and 64-bit variants. One caveat is that we only get a CMOV substitution if at least /O1 is provided. On the surface the /O0 implementation looks no worse than the previously proposed "branchless()" implementation.

I'm afraid I don't have the right tools set up to benchmark this, so apologies for not taking further steps to try and get this over the line. That said, I hope that you can use this suggestion in your advantage!