Open lhecker opened 2 years ago
I've benchmarked this on a Zen3 CPU (5950X) using google/benchmark using the following simple code:
volatile unsigned int init;
static void BM_custom(benchmark::State& state) {
auto v = init;
for (auto _: state) {
benchmark::DoNotOptimize(bit_ceil(v++));
}
}
BENCHMARK(BM_custom);
Here are the results:
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_std 1.50 ns 1.51 ns 560000000
BM_custom 1.06 ns 1.05 ns 640000000
It appears as if such a change would have a positive benefit even on the Zen3 architecture, despite bsr
being implemented in microcode (source).
Ah, duh, I only searched for existing, open issues. 🤦♂️
The benchmark you posted there results in some "weird" numbers for me. This is fine so far:
std::countl_zero: 1.360940
std::countr_zero: 0.896549
// = AVX2 ❌
// Expected to be slower on AMD, due to bsf/bsr being implemented in microcode.
std::_Countl_zero_bsr: 2.15933
std::_Countr_zero_bsf: 1.71853
// = AVX2 ✅
std::_Countl_zero_lzcnt: 0.439769 <-- 3x as fast!
std::_Countr_zero_tzcnt: 0.727553 <-- only 1.2x as fast - why?
It gets weird when I mix & match the functions, for instance:
std::_Countl_zero_bsr: 2.15463 <-- matched 1:1 with the expected value above
std::countr_zero: 1.29592 <-- why is this suddenly slower?
The solution here appears to be to add a std::this_thread::sleep_for(std::chrono::milliseconds(1000));
in between the two test runs so that the behavior of the former loop doesn't affect the latter. (The benchmark only measures throughput after all.)
lzcnt
/tzcnt
are a lot faster on AMD as expected, but why does it affect countl_zero
so much more than countr_zero
, despite both instructions being microcode? It's not the branch in _Countl_zero_bsr
. If I remove that one, the duration stays the same (thanks branch predictor!).
So I plugged the test into google/benchmark: https://gist.github.com/lhecker/5ea8322a8c920312b5c2f3a7bedf06e4
(Here's the equivalent godbolt link: https://godbolt.org/z/nojnYsx83)
Since my question isn't just about raw performance, but also about binary size and how it would work if used in a larger code base I at least added bit_ceil
tests to the above benchmark.
This results in more stable numbers (at approx. 5GHz, edited for clarity):
Run on (32 X 3400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 512 KiB (x16)
L3 Unified 32768 KiB (x2)
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
std::countl_zero 0.440 ns 0.422 ns 1000000000
std::_Countl_zero_bsr 1.25 ns 1.26 ns 497777778
std::_Countl_zero_lzcnt 0.427 ns 0.391 ns 1000000000
_Countl_zero_bsr 1.03 ns 1.03 ns 746666667 <-- branchless
std::countr_zero 0.428 ns 0.406 ns 1000000000
std::_Countr_zero_bsf 0.845 ns 0.837 ns 896000000 <-- already branchless
std::_Countr_zero_tzcnt 0.634 ns 0.625 ns 1000000000 *
std::bit_ceil 0.843 ns 0.854 ns 896000000
bit_ceil_optimized 0.846 ns 0.837 ns 896000000 <-- removed 1 of 3 branches
bit_ceil_optimized_checked 0.852 ns 0.837 ns 896000000 <-- removed 2 of 3 branches
bit_ceil_optimized_bsr 1.21 ns 1.17 ns 560000000 <-- removed 3 of 3 branches
* consistent outlier, should be better than countr_zero 🤔
(I love how std::_Countr_zero_bsf
matches the new _Countl_zero_bsr
down to 10 picoseconds. 😅)
This makes me believe that we should make the change to _Countl_zero_bsr
(to make it branchless) in either case - if anything it can only have a positive impact. I suspect the same to be true on Intel.
But a branchy std::bit_ceil
is still about 40% faster than the branchless bit_ceil_optimized_bsr
variant on Zen3. I'd be curious what the difference is on Intel but I suspect it will be a lot less there, since it has fast bsf
/bsr
.
@AlexGuteniev What would your choice be? Given that the __isa_available
bloats the size of the assembly by about 2x, is it worth the 0.3ns throughput improvement on AMD? 🤔
On an Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
2022-07-07T16:39:34+03:00
Running C:\Users\Aleksandr.Gutenev\source\repos\ConsoleApplication15\x64\Release\ConsoleApplication15.exe
Run on (12 X 2208 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 9216 KiB (x1)
-------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------
std::countl_zero 0.650 ns 0.656 ns 1000000000 items_per_second=1.52381G/s
std::_Countl_zero_bsr 0.836 ns 0.837 ns 896000000 items_per_second=1.19467G/s
std::_Countl_zero_lzcnt 0.682 ns 0.688 ns 1000000000 items_per_second=1.45455G/s
_Countl_zero_bsr 1.11 ns 1.10 ns 640000000 items_per_second=910.222M/s
std::countr_zero 0.675 ns 0.663 ns 896000000 items_per_second=1.50905G/s
std::_Countr_zero_bsf 1.04 ns 1.03 ns 640000000 items_per_second=975.238M/s
std::_Countr_zero_tzcnt 0.638 ns 0.625 ns 1000000000 items_per_second=1.6G/s
std::bit_ceil 1.78 ns 1.76 ns 407272727 items_per_second=566.64M/s
bit_ceil_optimized 1.36 ns 1.35 ns 497777778 items_per_second=740.879M/s
bit_ceil_optimized_checked 1.42 ns 1.41 ns 497777778 items_per_second=707.951M/s
bit_ceil_optimized_bsr 1.31 ns 1.32 ns 497777778 items_per_second=758.519M/s
A colleague kindly ran the benchmark on an i7-9700k PC for me with similar results:
Run on (8 X 3600 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 12288 KiB (x1)
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
std::countl_zero 0.591 ns 0.547 ns 1000000000
std::_Countl_zero_bsr 0.704 ns 0.203 ns 1000000000
std::_Countl_zero_lzcnt 0.675 ns 0.141 ns 1000000000
_Countl_zero_bsr 0.942 ns 0.219 ns 1000000000
std::countr_zero 0.575 ns 0.172 ns 1000000000
std::_Countr_zero_bsf 0.909 ns 0.281 ns 1000000000
std::_Countr_zero_tzcnt 0.527 ns 0.234 ns 1000000000
std::bit_ceil 1.15 ns 0.328 ns 1000000000
bit_ceil_optimized 1.18 ns 0.266 ns 1000000000
bit_ceil_optimized_checked 1.18 ns 0.391 ns 1000000000
bit_ceil_optimized_bsr 1.12 ns 0.203 ns 1000000000
So it looks like that micro-benchmark still shows that branch doesn't cost a lot, and its cost comparable to the unexplained variations, yet bsr
/lzcnt
and bsf
/tzcnt
difference is clearly significant and worth a branch.
Plus, speculation will eventually make that branch "essentially free" because the value of __isa_available will never change.
So it looks like that micro-benchmark still shows that branch doesn't cost a lot, and its cost comparable to the unexplained variations
It does cost a lot: In binary size. To be specific: | Function | binary size (Bytes) |
---|---|---|
std::bit_ceil | 75 | |
bit_ceil_optimized | 87 | |
bit_ceil_optimized_checked | 58 | |
bit_ceil_optimized_bsr | 31 |
This is practically the same argument for me as for the branchless clamp
issue I opened: Should we really optimize the STL for hot paths to the detriment of everything else? bit_ceil_optimized
in particular shows this: By removing a branch in std::bit_ceil
we don't reduce binary size - we increase it even further, because the optimizer at /O2
forces even more, redundant code into the two __isa_available
branches.
So I ask: Is a 0.3-0.5 ns (1-2 cycles) improvement really "worth a branch"? Is it worth doubling the binary size of a function? Wouldn't a user of the STL have to be running a few million countl_zero
at least before the difference is noticeable?
Speculation will make the branch essentially free, that's true. But it's only true for small-ish hot paths like in the microbenchmarks we're running. Those 2 countl_zero
branches will force out at least 2 branches from the target buffer. If those were important, it'll cost an extra 5-20 cycle latency to get them back in. Is the __isa_available
check still cost free then? You'd have to be running at least 10-40 countl_zero
s in a loop (!) before it offsets that cost. It's certainly not cost free if you don't loop.
Basically, to summarize, my argument is identical to clamp
: Since the difference in performance is so miniscule, we should in my personal opinion err on the side of compactness, because that's the safer choice.
Plus, speculation will eventually make that branch "essentially free" because the value of __isa_available will never change.
even so, it will take a space in the btb, and it may be cleared or evicted frequently. We could at least annotate the branch to indicate it's likely the CPU has lzcnt. It's really, really, really hard to benchmark the effects of code size.
As a practical matter the additional code size could also prevent the function from being inlined at all, which would definitely be worse than the slow version had inlining happened (esp if that then goes and prevents autovectorization)
if we get really creative we could eliminate the btb entry by making the check a static branch with a "JUMP_LABEL" type construct, but that could make code size worse since it may break page sharing for that page.
We could also move the bsr verison into a function, so it won't be inlined with the lzcnt code (ofc if you actually need bsr that is rather painful).
llvm-mca results for CPUs that are slow with bsr would be interesting, since that could show some estimate of the impact of the btb entry and I$ difference.
I'm pretty warm to this idea.
Basically, to summarize, my argument is identical to
clamp
: Since the difference in performance is so miniscule, we should in my personal opinion err on the side of compactness, because that's the safer choice.
My opinion is that size impact is worth the optimization. The default /O2
implies /Ot
, not /Os
, and while /Ot
makes similar sacrifices of space for even smaller gain, it is still preferred.
Also, the main reason to prefer cmov
clamp is stable perf for unpredicted input. Here, the checked version has this property, whereas bsr
has a branch to check for zero input, so the unconditional bsr is the one which has surprises.
We could also move the bsr verison into a function, so it won't be inlined with the lzcnt code (ofc if you actually need bsr that is rather painful).
Might get us something, but not too much. The call instruction itself still has a significant size.
Also, the main reason to prefer
cmov
clamp is stable perf for unpredicted input. Here, the checked version has this property, whereasbsr
has a branch to check for zero input, so the unconditional bsr is the one which has surprises.
I have wonderful news then! 🙂
Check out my godbolt link above - it features a branchless bsr
variant (_Countl_zero_bsr
) which is the foundation of this issue. Without it, making std::bit_ceil
as compact (entirely branchless) and performant as I did wouldn't have been possible (resulting in bit_ceil_optimized_bsr
).
Here's how it works: _BitScanReverse
doesn't actually return a proper value, but rather the ZF
register. x86 has no branchless return
, but it does have a lot of branchless assignments. So we only need to make a 1 line change for it to turn branchless: https://godbolt.org/z/j5xPxGoEo
By the way, I just noticed that the benchmark gist I posted had a bug - it used bsf
instead of bsr
. I've updated the gist and 2 benchmark results accordingly. (Some impact to _Countl_zero_bsr
on AMD and no impact to anything else, not even on AMD. Weird. Goes to show that microbenchmarks of single instructions like that aren't that meaningful, huh?)
We talked about this at the weekly maintainer meeting. We don't yet have sufficient information to reach a conclusion - what we'd like to see is:
@StephanTLavavej I hope the summary below is alright. Let me know if I should expand on anything in particular. 🙂
Comparisons on a CPU where BSR is slow (maybe Zen 2, we're not sure?)
All AMD CPUs have slow BSR (including Zen 3). All Intel CPUs have fast BSR (at least on par with lzcnt
).
A simplified comparison of the status quo versus the recommended branchless BSR technique
std::countl_zero
:
CPU | Is bsr fast? |
Before (ns/op) | After (ns/op) | Before (int /s) |
After (int /s) |
---|---|---|---|---|---|
AMD 5950X | ❌ | 0.440 | 1.030 (+134.09% 🟥) | 2272.73M | 970.87M (-57.28% 🟥) |
Intel i7 8750H | ✅ | 0.650 | 1.110 (+70.77% 🟥) | 1538.46M | 900.90M (-41.44% 🟥) |
Intel i7 9700k | ✅ | 0.591 | 0.942 (+59.39% 🟥) | 1692.05M | 1061.57M (-37.26% 🟥) |
std::bit_ceil
(std::countl_zero
+ 2 arithmetic operations):
a) Real world time (inconsistent results for 9700k):
CPU | Is bsr fast? |
Before (ns/op) | After (ns/op) | Before (int /s) |
After (int /s) |
---|---|---|---|---|---|
AMD 5950X | ❌ | 0.843 | 1.21 (+43.53% 🟥) | 1186.24M | 826.45M (-30.33% 🟥) |
Intel i7 8750H | ✅ | 1.780 | 1.31 (-26.40% 🟩) | 561.80M | 763.36M (+35.88% 🟩) |
Intel i7 9700k | ✅ | 1.150 | 1.12 (-2.61% 🟩) | 869.57M | 892.86M (+2.68% 🟩) |
b) CPU time (consistent results for 9700k):
CPU | Is bsr fast? |
Before (ns/op) | After (ns/op) | Before (int /s) |
After (int /s) |
---|---|---|---|---|---|
AMD 5950X | ❌ | 0.854 | 1.170 (+37.00% 🟥) | 1170.96M | 854.70M (-27.01% 🟥) |
Intel i7 8750H | ✅ | 1.760 | 1.320 (-25.00% 🟩) | 568.18M | 757.58M (+33.33% 🟩) |
Intel i7 9700k | ✅ | 0.328 | 0.203 (-38.11% 🟩) | 3048.78M | 4926.11M (+61.58% 🟩) |
Binary size:
Function | Before (Bytes) | After (Bytes) |
---|---|---|
std::countl_zero |
24 | 18 (-25.00% 🟩) |
std::bit_ceil |
75 | 31 (-58.67% 🟩) |
Possible results from llvm-mca as @barcharcraz mentioned
I'm not familiar with how to run llvm-mca with MSVC as the compiler unfortunately.
So I went ahead, compiled it with MSVC, ported the assembly over to clang and ran it for std::bit_ceil
on Zen 2:
/ | Throughput (lower is better) | llvm-mca output |
---|---|---|
Before | 9.06 | https://godbolt.org/z/e95MrPTqn |
After | 6.79 | https://godbolt.org/z/Ycfca1T4n |
My thoughts given the above:
bsr
over lzcnt
for std::countl_zero
.According to these numbers, on newer Intel CPUs the branchless BSR version only is 2.68% faster, while it results in a 30.33% throughput loss on new AMD CPUs. I do not really see a major win in going branchless, because it makes Intel processors only a sliver faster, but makes AMD CPUs a third slower.
The win in code size is nice, but if codesize is that important, code should be compiled with /O1
. Adding a check in the implementation for /O1
and skipping the lzcnt
detection in that case makes a lot more sense than entirely nuking it, IMO.
FYI The 9700k isn't newer than the 8750H. They're both the same architecture but a different stepping with the same performance characteristics (Coffee Lake H and S). My understanding is that the difference is caused by the desktop vs. mobile power envelope and frequency scaling. If we measure the actual CPU time spent running the benchmark, improvement on Intel is consistent:
CPU | Is bsr fast? |
Before (CPU time) | After (CPU time) |
---|---|---|---|
AMD 5950X | ❌ | 0.854 | 1.170 (+37.00% 🟥) |
Intel i7 8750H | ✅ | 1.760 | 1.320 (-25.00% 🟩) |
Intel i7 9700k | ✅ | 0.328 | 0.203 (-38.11% 🟩) |
Was there ever a decision what this STL is optimized for? Real time throughput or CPU time throughput (or: workstations vs. laptops, unconstrained vs. constrained thermal/power envelope)? As seen above they aren't always identical and optimizing for one might sometimes be to the detriment of the other.
I picked real time throughput as that's consistent with benchmarks done on this GitHub so far (which use std::chrono
), but I'd personally go with CPU time throughput in the future (using GetThreadTimes
).
If we want to know how it performs on modern architectures, we'd still have to test that, preferably on 11th or 12th gen Intel, since those are truly different (but I suspect the results to be similar).
Came here to see if there was already a thread on optimising the BSR-based fallback, am not disappointed. I happen to have spent some time optimising similar code recently given how I'm working on a C++17 project that doesn't have \<bit> yet. I found that @lhecker's suggested implementation can probably be slightly improved upon by substituting the subtraction with an XOR. Since XOR has an immediate value encoding, we save ourselves a register and an instruction. I invite you to take a look at the proposed implementation "branchless_xor()" on https://godbolt.org/z/E9d8KW4sa . You can imagine a similar implementation for the 16- and 64-bit variants. One caveat is that we only get a CMOV substitution if at least /O1 is provided. On the surface the /O0 implementation looks no worse than the previously proposed "branchless()" implementation.
I'm afraid I don't have the right tools set up to benchmark this, so apologies for not taking further steps to try and get this over the line. That said, I hope that you can use this suggestion in your advantage!
std::bit_ceil
emits quite a bit of assembly on x64. This seems to occur mostly due the branch in_Checked_x86_x64_countl_zero
and to a lesser extent due to the branch inbit_ceil
itself.I've written a variant which produces a more compact result: https://godbolt.org/z/q4EEz83aW (It also removes the extra branch on ARM64 by using conditional assignments.)
I've checked the PR that introduced the code (https://github.com/microsoft/STL/pull/795) and it appears as if the cost of this if condition wasn't discussed. The if condition generally makes sense though:
bsr
is costly on AMD CPUs (up to Zen3, 4 cycles/op latency), whereaslzcnt
is very fast on any architecture (<= 1 cycle). But it takes up 3 slots in the CPU's branch target buffer (contemporary hardware has ~4096 slots, newly added branches incur an extra 5-20 cycle latency), generates larger binaries after inlining and unfortunately the added instructions seem to add about ~5 cycles of latency themselves, offsetting the cost ofbsr
.This makes me wonder: Should we drop the
__isa_available
check?