golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
121.6k stars 17.41k forks source link

cmd/compile: greedy basic block layout #66420

Open y1yang0 opened 4 months ago

y1yang0 commented 4 months ago

Proposal Details

Fine-grained performance metrics (such as L1-iTLB, IPC, branch-load, branch-load-miss, etc.) and optimization experiments(#63670) have repeatedly shown that block layout can noticably impact code execution efficiency.

In contrast to the current case-by-case tuning approach, I propose adopting the traditional and time-tested Pettis-Hansen (PH) basic block layout algorithm. Its core concept is to place hot block as closely together as possible, allowing basic blocks to fall through whenever feasible, thereby reducing jump instructions. This principle is considered the golden rule in the field of block layout, and state-of-the-art algorithms like extTSP and almost all variants are based on this idea, incorporating advanced heuristic techniques.

The PH algorithm relies on a weighted chain graph, where the weights represent the frequency of edges. In the absence of PGO information, we can only resort to branch prediction results from likelyadjust pass. In the future, we can incorporate PGO data as weights to make the algorithm even more effective.

Experiment Results

image

The x-axis represents testcase id, the y-axis indicates the performance change in percentage points, and negative values denote performance improvement.

gopherbot commented 4 months ago

Change https://go.dev/cl/572975 mentions this issue: cmd/compile: greedy basic block layout

mdempsky commented 4 months ago

What does the graph represent? What tests were run? What is a "testcase ID"? What's the difference between the orange and blue lines? How do you interpret the graph to conclude the CL is net positive? Thanks.

mknyszek commented 4 months ago

https://go.dev/cl/c/go/+/571535 is another CL related to basic block reordering.

y1yang0 commented 4 months ago

@mknyszek Thanks for your reminder, I'll take a closer look at it.

@mdempsky Hi,

What does the graph represent?

A Chain Graph is composed of chains and edges. It aggregates a series of blocks into chains, which are then interconnected by edges. The purpose of this is to minimize jmp instructions and maximize fallthrough occurrences as much as possible.

image

A vivid and real-life example can be referenced in the image below.

image

What tests were run? What is a "testcase ID"? What's the difference between the orange and blue lines? How do you interpret the graph to conclude the CL is net positive?

id means testcase1pkg: archive/tar testcase2 pkg: archive/zip etc. Due to the length of the package names, I've used IDs to represent them on the x-axis. orange means round1 test and blue indicates round2.

Each round runs all go tests with count=10 at given CPU set

taskset -c 0-32 $GOROOT/bin/go test -bench=. -timeout=99h -count=10  ./...

Changes within 2% are considered reasonable and tolerable fluctuations, which allows us to disregard the majority of cases. However, it has notably improved several cases, specifically those with sharp fluctuations exceeding 6%. On the other hand, this is a simple and time-tested algorithm; variations of it are used worldwide for block layout implementation, and I believe Go is also an excellent candidate for its application. Furthermore, the current graph does not contain PGO information; each edge is either 100% taken or 0% taken. Yet, it still manages to achieve such results. We can optimistically and reasonably expect that a graph augmented with PGO information will yield very favorable outcomes.

mdempsky commented 4 months ago

Sorry, I was asking what the graph in your "experimental results" indicated. Your argument seems to be that the graph demonstrates an experimental improvement on the benchmarks, but it looks like noise to me.

Typically we look at the results from running benchstat.

orange means round1 test and blue indicates round2.

Okay. What are "round1" and "round2"?

Can you please instead use "old" and "new"? Or "baseline" and "experiment"? These are unambiguous terms.

Thanks.

alexanius commented 4 months ago

In this CL https://go.dev/cl/c/go/+/ the basic block counters are implemented and the likely information is corrected. You can use it in your algorithm.

y1yang0 commented 4 months ago

Can you please instead use "old" and "new"? Or "baseline" and "experiment"? These are unambiguous terms.

The raw benchstat results are as follow

round2.log round1.log

They are too big for inspection, that's why I draw the line char. Each point on x-xis represents geomean of bench result for each package. "Round" inicates completing a full run of all go tests. Please let me know if there is nything else that is unclear to you. Thanks!

In this CL https://go.dev/cl/c/go/+/ the basic block counters are implemented and the likely information is corrected. You can use it in your algorithm.

Thanks, this could be a follow-up enhancement for PH block lyout IMHO. I'll take a closer look on Monday

y1yang0 commented 4 months ago

What does the graph represent? What tests were run? What is a "testcase ID"? What's the difference between the orange and blue lines? How do you interpret the graph to conclude the CL is net positive? Thanks.

Hi @mdempsky , do you have any plan on reviewing this patch? Thanks

thanm commented 4 months ago

For performance testing best practice is currently to run the "sweet" and "bent" benchmarks, e.g.

git clone https://go.googlesource.com/benchmarks
cd benchmarks/cmd/bench
go build -o bench.exe .
./bench.exe -goroot `go env GOROOT`

then capture the output of the "bench.exe" run. Once you have two runs worth (one with your change and one without), then create a final report based on

benchstat base-output.txt new-output.txt

This will give you a better picture overall than running the std library package benchmarks (which is what it sounds like you are doing?).

y1yang0 commented 2 months ago

Hi @mdempsky @thanm , sorry for the delay.

This will give you a better picture overall than running the std library package benchmarks (which is what it sounds like you are doing?).

Yes, the above benchmark results are all std library package benchmarks.

The golang/benchmarks results are as follows:

goos: linux
goarch: arm64
pkg: 
                       │   old1.log   │           new1.log           │
                       │ total-bytes  │ total-bytes   vs base        │
Ethereum_bitutil         3.696Mi ± 0%   3.808Mi ± 0%  +3.02% (n=1)
Uber_zap                 6.572Mi ± 0%   6.765Mi ± 0%  +2.94% (n=1)
Aws_jsonutil             7.922Mi ± 0%   8.092Mi ± 0%  +2.14% (n=1)
Gonum_community          4.263Mi ± 0%   4.392Mi ± 0%  +3.04% (n=1)
Gonum_lapack_native      5.798Mi ± 0%   5.979Mi ± 0%  +3.12% (n=1)
Dustin_humanize          4.013Mi ± 0%   4.136Mi ± 0%  +3.08% (n=1)
Commonmark_markdown      5.058Mi ± 0%   5.210Mi ± 0%  +2.99% (n=1)
Kanzi                    4.030Mi ± 0%   4.141Mi ± 0%  +2.75% (n=1)
Gtank_blake2s            3.909Mi ± 0%   4.027Mi ± 0%  +3.02% (n=1)
Cespare_mph              3.522Mi ± 0%   3.621Mi ± 0%  +2.80% (n=1)
Gonum_blas_native        4.749Mi ± 0%   4.865Mi ± 0%  +2.44% (n=1)
Spexs2                   3.985Mi ± 0%   4.105Mi ± 0%  +3.00% (n=1)
Shopify_sarama           10.90Mi ± 0%
Ajstarks_deck_generate   3.595Mi ± 0%   3.694Mi ± 0%  +2.76% (n=1)
Aws_restjson             12.42Mi ± 0%   12.73Mi ± 0%  +2.52% (n=1)
K8s_cache                14.74Mi ± 0%   15.14Mi ± 0%  +2.67% (n=1)
Aws_restxml              12.49Mi ± 0%   12.81Mi ± 0%  +2.54% (n=1)
Ethereum_ecies           4.336Mi ± 0%   4.441Mi ± 0%  +2.43% (n=1)
Ericlagergren_decimal    4.030Mi ± 0%   4.152Mi ± 0%  +3.03% (n=1)
Gonum_traverse           3.676Mi ± 0%   3.780Mi ± 0%  +2.83% (n=1)
Aws_jsonrpc              12.14Mi ± 0%   12.44Mi ± 0%  +2.52% (n=1)
Uber_tally               5.388Mi ± 0%   5.542Mi ± 0%  +2.85% (n=1)
Gonum_path               4.090Mi ± 0%   4.204Mi ± 0%  +2.79% (n=1)
Ethereum_ethash          9.273Mi ± 0%   9.529Mi ± 0%  +2.76% (n=1)
Gonum_mat                6.436Mi ± 0%   6.651Mi ± 0%  +3.35% (n=1)
Bindata                  10.98Mi ± 0%   11.11Mi ± 0%  +1.19% (n=1)
Ethereum_trie            6.619Mi ± 0%   6.810Mi ± 0%  +2.88% (n=1)
Ethereum_corevm          6.240Mi ± 0%   6.395Mi ± 0%  +2.48% (n=1)
Dustin_broadcast         3.572Mi ± 0%   3.672Mi ± 0%  +2.79% (n=1)
Capnproto2               4.617Mi ± 0%   4.753Mi ± 0%  +2.95% (n=1)
K8s_workqueue            7.509Mi ± 0%   7.737Mi ± 0%  +3.04% (n=1)
Semver                   3.991Mi ± 0%   4.109Mi ± 0%  +2.95% (n=1)
Gonum_topo               3.950Mi ± 0%   4.066Mi ± 0%  +2.94% (n=1)
Hugo_hugolib             45.51Mi ± 0%   46.96Mi ± 0%  +3.20% (n=1)
Ethereum_core            13.46Mi ± 0%   13.85Mi ± 0%  +2.91% (n=1)
Cespare_xxhash           3.533Mi ± 0%   3.634Mi ± 0%  +2.88% (n=1)
Benhoyt_goawk_1_18       5.529Mi ± 0%   5.700Mi ± 0%  +3.09% (n=1)
Bloom_bloom              4.377Mi ± 0%   4.503Mi ± 0%  +2.88% (n=1)
geomean                  6.024Mi        6.094Mi       +2.80%       ¹
¹ benchmark set differs from baseline; geomeans may not be comparable

                       │   old1.log    │           new1.log            │
                       │  text-bytes   │  text-bytes    vs base        │
Ethereum_bitutil          1.051Mi ± 0%    1.066Mi ± 0%  +1.47% (n=1)
Uber_zap                  2.013Mi ± 0%    2.041Mi ± 0%  +1.43% (n=1)
Aws_jsonutil              2.346Mi ± 0%    2.375Mi ± 0%  +1.22% (n=1)
Gonum_community           1.268Mi ± 0%    1.284Mi ± 0%  +1.30% (n=1)
Gonum_lapack_native       1.883Mi ± 0%    1.899Mi ± 0%  +0.86% (n=1)
Dustin_humanize           1.171Mi ± 0%    1.188Mi ± 0%  +1.44% (n=1)
Commonmark_markdown       1.479Mi ± 0%    1.500Mi ± 0%  +1.46% (n=1)
Kanzi                     1.148Mi ± 0%    1.163Mi ± 0%  +1.32% (n=1)
Gtank_blake2s             1.140Mi ± 0%    1.158Mi ± 0%  +1.53% (n=1)
Cespare_mph              1014.7Ki ± 0%   1030.0Ki ± 0%  +1.51% (n=1)
Gonum_blas_native         1.508Mi ± 0%    1.523Mi ± 0%  +0.99% (n=1)
Spexs2                    1.163Mi ± 0%    1.181Mi ± 0%  +1.53% (n=1)
Shopify_sarama            3.610Mi ± 0%
Ajstarks_deck_generate    1.012Mi ± 0%    1.027Mi ± 0%  +1.49% (n=1)
Aws_restjson              3.928Mi ± 0%    3.981Mi ± 0%  +1.35% (n=1)
K8s_cache                 5.121Mi ± 0%    5.170Mi ± 0%  +0.95% (n=1)
Aws_restxml               3.959Mi ± 0%    4.013Mi ± 0%  +1.36% (n=1)
Ethereum_ecies            1.257Mi ± 0%    1.274Mi ± 0%  +1.34% (n=1)
Ericlagergren_decimal     1.158Mi ± 0%    1.175Mi ± 0%  +1.48% (n=1)
Gonum_traverse            1.028Mi ± 0%    1.043Mi ± 0%  +1.50% (n=1)
Aws_jsonrpc               3.808Mi ± 0%    3.857Mi ± 0%  +1.31% (n=1)
Uber_tally                1.559Mi ± 0%    1.580Mi ± 0%  +1.37% (n=1)
Gonum_path                1.185Mi ± 0%    1.200Mi ± 0%  +1.32% (n=1)
Ethereum_ethash           2.942Mi ± 0%    2.982Mi ± 0%  +1.37% (n=1)
Gonum_mat                 2.200Mi ± 0%    2.222Mi ± 0%  +1.01% (n=1)
Bindata                   1.345Mi ± 0%    1.364Mi ± 0%  +1.44% (n=1)
Ethereum_trie             2.116Mi ± 0%    2.144Mi ± 0%  +1.30% (n=1)
Ethereum_corevm           1.922Mi ± 0%    1.948Mi ± 0%  +1.33% (n=1)
Dustin_broadcast          1.007Mi ± 0%    1.023Mi ± 0%  +1.51% (n=1)
Capnproto2                1.379Mi ± 0%    1.399Mi ± 0%  +1.47% (n=1)
K8s_workqueue             2.347Mi ± 0%    2.379Mi ± 0%  +1.35% (n=1)
Semver                    1.166Mi ± 0%    1.184Mi ± 0%  +1.55% (n=1)
Gonum_topo                1.131Mi ± 0%    1.147Mi ± 0%  +1.45% (n=1)
Hugo_hugolib              18.35Mi ± 0%    18.70Mi ± 0%  +1.91% (n=1)
Ethereum_core             4.447Mi ± 0%    4.502Mi ± 0%  +1.24% (n=1)
Cespare_xxhash           1017.5Ki ± 0%   1032.8Ki ± 0%  +1.51% (n=1)
Benhoyt_goawk_1_18        1.715Mi ± 0%    1.741Mi ± 0%  +1.52% (n=1)
Bloom_bloom               1.308Mi ± 0%    1.328Mi ± 0%  +1.48% (n=1)
geomean                   1.789Mi         1.779Mi       +1.38%       ¹
¹ benchmark set differs from baseline; geomeans may not be comparable

                       │   old1.log   │           new1.log           │
                       │  data-bytes  │  data-bytes   vs base        │
Ethereum_bitutil         39.92Ki ± 0%   39.92Ki ± 0%   0.00% (n=1)
Uber_zap                 57.95Ki ± 0%   57.95Ki ± 0%   0.00% (n=1)
Aws_jsonutil             48.14Ki ± 0%   48.14Ki ± 0%   0.00% (n=1)
Gonum_community          66.45Ki ± 0%   66.45Ki ± 0%   0.00% (n=1)
Gonum_lapack_native      42.45Ki ± 0%   42.45Ki ± 0%   0.00% (n=1)
Dustin_humanize          40.64Ki ± 0%   40.64Ki ± 0%   0.00% (n=1)
Commonmark_markdown      57.23Ki ± 0%   57.23Ki ± 0%   0.00% (n=1)
Kanzi                    39.36Ki ± 0%   39.36Ki ± 0%   0.00% (n=1)
Gtank_blake2s            39.98Ki ± 0%   39.98Ki ± 0%   0.00% (n=1)
Cespare_mph              39.08Ki ± 0%   39.08Ki ± 0%   0.00% (n=1)
Gonum_blas_native        78.65Ki ± 0%   78.65Ki ± 0%   0.00% (n=1)
Spexs2                   39.77Ki ± 0%   39.77Ki ± 0%   0.00% (n=1)
Shopify_sarama           87.14Ki ± 0%
Ajstarks_deck_generate   39.14Ki ± 0%   39.14Ki ± 0%   0.00% (n=1)
Aws_restjson             76.05Ki ± 0%   76.05Ki ± 0%   0.00% (n=1)
K8s_cache                72.20Ki ± 0%   72.20Ki ± 0%   0.00% (n=1)
Aws_restxml              76.20Ki ± 0%   76.20Ki ± 0%   0.00% (n=1)
Ethereum_ecies           40.08Ki ± 0%   40.08Ki ± 0%   0.00% (n=1)
Ericlagergren_decimal    41.48Ki ± 0%   41.48Ki ± 0%   0.00% (n=1)
Gonum_traverse           41.33Ki ± 0%   41.33Ki ± 0%   0.00% (n=1)
Aws_jsonrpc              74.70Ki ± 0%   74.70Ki ± 0%   0.00% (n=1)
Uber_tally               49.73Ki ± 0%   49.73Ki ± 0%   0.00% (n=1)
Gonum_path               60.75Ki ± 0%   60.75Ki ± 0%   0.00% (n=1)
Ethereum_ethash          71.48Ki ± 0%   71.48Ki ± 0%   0.00% (n=1)
Gonum_mat                72.11Ki ± 0%   72.11Ki ± 0%   0.00% (n=1)
Bindata                  45.36Ki ± 0%   45.36Ki ± 0%   0.00% (n=1)
Ethereum_trie            52.67Ki ± 0%   52.67Ki ± 0%   0.00% (n=1)
Ethereum_corevm          51.80Ki ± 0%   51.80Ki ± 0%   0.00% (n=1)
Dustin_broadcast         38.70Ki ± 0%   38.70Ki ± 0%   0.00% (n=1)
Capnproto2               44.92Ki ± 0%   44.92Ki ± 0%   0.00% (n=1)
K8s_workqueue            49.39Ki ± 0%   49.39Ki ± 0%   0.00% (n=1)
Semver                   40.58Ki ± 0%   40.58Ki ± 0%   0.00% (n=1)
Gonum_topo               48.55Ki ± 0%   48.55Ki ± 0%   0.00% (n=1)
Hugo_hugolib             272.7Ki ± 0%   272.7Ki ± 0%   0.00% (n=1)
Ethereum_core            90.55Ki ± 0%   90.55Ki ± 0%   0.00% (n=1)
Cespare_xxhash           38.58Ki ± 0%   38.58Ki ± 0%   0.00% (n=1)
Benhoyt_goawk_1_18       93.44Ki ± 0%   93.44Ki ± 0%   0.00% (n=1)
Bloom_bloom              41.27Ki ± 0%   41.27Ki ± 0%   0.00% (n=1)
geomean                  54.86Ki        54.18Ki       +0.00%       ¹
¹ benchmark set differs from baseline; geomeans may not be comparable

                       │   old1.log   │           new1.log           │
                       │ rodata-bytes │ rodata-bytes  vs base        │
Ethereum_bitutil         456.2Ki ± 0%   456.6Ki ± 0%  +0.09% (n=1)
Uber_zap                 965.3Ki ± 0%   965.9Ki ± 0%  +0.07% (n=1)
Aws_jsonutil             2.398Mi ± 0%   2.398Mi ± 0%  +0.02% (n=1)
Gonum_community          534.8Ki ± 0%   535.2Ki ± 0%  +0.07% (n=1)
Gonum_lapack_native      739.4Ki ± 0%   740.2Ki ± 0%  +0.12% (n=1)
Dustin_humanize          511.8Ki ± 0%   512.2Ki ± 0%  +0.08% (n=1)
Commonmark_markdown      780.7Ki ± 0%   781.2Ki ± 0%  +0.06% (n=1)
Kanzi                    478.2Ki ± 0%   478.5Ki ± 0%  +0.06% (n=1)
Gtank_blake2s            478.4Ki ± 0%   478.8Ki ± 0%  +0.08% (n=1)
Cespare_mph              437.4Ki ± 0%   437.7Ki ± 0%  +0.07% (n=1)
Gonum_blas_native        575.0Ki ± 0%   575.7Ki ± 0%  +0.14% (n=1)
Spexs2                   493.4Ki ± 0%   493.7Ki ± 0%  +0.06% (n=1)
Shopify_sarama           1.606Mi ± 0%
Ajstarks_deck_generate   462.7Ki ± 0%   463.0Ki ± 0%  +0.07% (n=1)
Aws_restjson             3.107Mi ± 0%   3.107Mi ± 0%  +0.03% (n=1)
K8s_cache                2.046Mi ± 0%   2.047Mi ± 0%  +0.07% (n=1)
Aws_restxml              3.110Mi ± 0%   3.111Mi ± 0%  +0.03% (n=1)
Ethereum_ecies           612.0Ki ± 0%   612.5Ki ± 0%  +0.07% (n=1)
Ericlagergren_decimal    498.5Ki ± 0%   498.9Ki ± 0%  +0.09% (n=1)
Gonum_traverse           469.0Ki ± 0%   469.3Ki ± 0%  +0.08% (n=1)
Aws_jsonrpc              3.041Mi ± 0%   3.041Mi ± 0%  +0.02% (n=1)
Uber_tally               788.4Ki ± 0%   788.8Ki ± 0%  +0.05% (n=1)
Gonum_path               525.5Ki ± 0%   525.8Ki ± 0%  +0.06% (n=1)
Ethereum_ethash          1.390Mi ± 0%   1.391Mi ± 0%  +0.06% (n=1)
Gonum_mat                764.8Ki ± 0%   766.1Ki ± 0%  +0.16% (n=1)
Bindata                  559.6Ki ± 0%   560.2Ki ± 0%  +0.10% (n=1)
Ethereum_trie            936.5Ki ± 0%   937.7Ki ± 0%  +0.12% (n=1)
Ethereum_corevm          905.2Ki ± 0%   905.9Ki ± 0%  +0.09% (n=1)
Dustin_broadcast         445.4Ki ± 0%   445.7Ki ± 0%  +0.07% (n=1)
Capnproto2               628.0Ki ± 0%   628.9Ki ± 0%  +0.15% (n=1)
K8s_workqueue            1.010Mi ± 0%   1.011Mi ± 0%  +0.12% (n=1)
Semver                   504.4Ki ± 0%   504.7Ki ± 0%  +0.06% (n=1)
Gonum_topo               521.5Ki ± 0%   521.9Ki ± 0%  +0.06% (n=1)
Hugo_hugolib             8.921Mi ± 0%   8.927Mi ± 0%  +0.06% (n=1)
Ethereum_core            2.163Mi ± 0%   2.165Mi ± 0%  +0.10% (n=1)
Cespare_xxhash           439.3Ki ± 0%   439.7Ki ± 0%  +0.07% (n=1)
Benhoyt_goawk_1_18       764.3Ki ± 0%   765.2Ki ± 0%  +0.12% (n=1)
Bloom_bloom              549.8Ki ± 0%   550.2Ki ± 0%  +0.07% (n=1)
geomean                  850.4Ki        836.1Ki       +0.08%       ¹
¹ benchmark set differs from baseline; geomeans may not be comparable

                       │   old1.log    │           new1.log            │
                       │ pclntab-bytes │ pclntab-bytes  vs base        │
Ethereum_bitutil          728.0Ki ± 0%    750.4Ki ± 0%  +3.08% (n=1)
Uber_zap                  1.308Mi ± 0%    1.348Mi ± 0%  +3.00% (n=1)
Aws_jsonutil              1.145Mi ± 0%    1.173Mi ± 0%  +2.51% (n=1)
Gonum_community           830.9Ki ± 0%    855.5Ki ± 0%  +2.96% (n=1)
Gonum_lapack_native       1.004Mi ± 0%    1.028Mi ± 0%  +2.47% (n=1)
Dustin_humanize           789.7Ki ± 0%    813.8Ki ± 0%  +3.05% (n=1)
Commonmark_markdown      1018.1Ki ± 0%   1046.6Ki ± 0%  +2.80% (n=1)
Kanzi                     768.8Ki ± 0%    790.9Ki ± 0%  +2.87% (n=1)
Gtank_blake2s             773.2Ki ± 0%    796.8Ki ± 0%  +3.06% (n=1)
Cespare_mph               689.9Ki ± 0%    710.8Ki ± 0%  +3.03% (n=1)
Gonum_blas_native         893.0Ki ± 0%    914.6Ki ± 0%  +2.41% (n=1)
Spexs2                    796.0Ki ± 0%    820.4Ki ± 0%  +3.07% (n=1)
Shopify_sarama            2.250Mi ± 0%
Ajstarks_deck_generate    704.0Ki ± 0%    725.0Ki ± 0%  +2.99% (n=1)
Aws_restjson              2.106Mi ± 0%    2.171Mi ± 0%  +3.04% (n=1)
K8s_cache                 3.120Mi ± 0%    3.186Mi ± 0%  +2.11% (n=1)
Aws_restxml               2.115Mi ± 0%    2.180Mi ± 0%  +3.06% (n=1)
Ethereum_ecies            837.9Ki ± 0%    861.8Ki ± 0%  +2.85% (n=1)
Ericlagergren_decimal     803.3Ki ± 0%    828.2Ki ± 0%  +3.10% (n=1)
Gonum_traverse            721.2Ki ± 0%    742.9Ki ± 0%  +3.02% (n=1)
Aws_jsonrpc               2.049Mi ± 0%    2.111Mi ± 0%  +2.99% (n=1)
Uber_tally                1.045Mi ± 0%    1.076Mi ± 0%  +2.96% (n=1)
Gonum_path                809.5Ki ± 0%    832.8Ki ± 0%  +2.89% (n=1)
Ethereum_ethash           1.897Mi ± 0%    1.957Mi ± 0%  +3.15% (n=1)
Gonum_mat                 1.212Mi ± 0%    1.247Mi ± 0%  +2.84% (n=1)
Bindata                   888.9Ki ± 0%    916.4Ki ± 0%  +3.09% (n=1)
Ethereum_trie             1.363Mi ± 0%    1.406Mi ± 0%  +3.12% (n=1)
Ethereum_corevm           1.291Mi ± 0%    1.326Mi ± 0%  +2.71% (n=1)
Dustin_broadcast          702.7Ki ± 0%    724.0Ki ± 0%  +3.04% (n=1)
Capnproto2                921.6Ki ± 0%    950.8Ki ± 0%  +3.16% (n=1)
K8s_workqueue             1.585Mi ± 0%    1.629Mi ± 0%  +2.77% (n=1)
Semver                    791.4Ki ± 0%    815.5Ki ± 0%  +3.05% (n=1)
Gonum_topo                771.7Ki ± 0%    795.4Ki ± 0%  +3.08% (n=1)
Hugo_hugolib              8.440Mi ± 0%    8.767Mi ± 0%  +3.88% (n=1)
Ethereum_core             2.857Mi ± 0%    2.951Mi ± 0%  +3.30% (n=1)
Cespare_xxhash            693.4Ki ± 0%    714.3Ki ± 0%  +3.02% (n=1)
Benhoyt_goawk_1_18        1.079Mi ± 0%    1.113Mi ± 0%  +3.18% (n=1)
Bloom_bloom               881.5Ki ± 0%    908.0Ki ± 0%  +3.00% (n=1)
geomean                   1.124Mi         1.136Mi       +2.97%       ¹
¹ benchmark set differs from baseline; geomeans may not be comparable

                       │   old1.log   │            new1.log            │
                       │ zdebug-bytes │ zdebug-bytes  vs base          │
Ethereum_bitutil         0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Uber_zap                 0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Aws_jsonutil             0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Gonum_community          0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Gonum_lapack_native      0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Dustin_humanize          0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Commonmark_markdown      0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Kanzi                    0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Gtank_blake2s            0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Cespare_mph              0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Gonum_blas_native        0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Spexs2                   0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Shopify_sarama           0.000 ± 0%
Ajstarks_deck_generate   0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Aws_restjson             0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
K8s_cache                0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Aws_restxml              0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Ethereum_ecies           0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Ericlagergren_decimal    0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Gonum_traverse           0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Aws_jsonrpc              0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Uber_tally               0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Gonum_path               0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Ethereum_ethash          0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Gonum_mat                0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Bindata                  0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Ethereum_trie            0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Ethereum_corevm          0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Dustin_broadcast         0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Capnproto2               0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
K8s_workqueue            0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Semver                   0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Gonum_topo               0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Hugo_hugolib             0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Ethereum_core            0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Cespare_xxhash           0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Benhoyt_goawk_1_18       0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
Bloom_bloom              0.000 ± 0%       0.000 ± 0%   0.00% (n=1)
geomean                             ¹                 +0.00%       ² ¹
¹ summaries must be >0 to compute geomean
² benchmark set differs from baseline; geomeans may not be comparable

pkg: github.com/Masterminds/semver
                            │  old1.log   │              new1.log              │
                            │   sec/op    │   sec/op     vs base               │
ValidateVersionTildeFail-32   613.2n ± 2%   591.8n ± 1%  -3.49% (p=0.000 n=10)

pkg: github.com/Shopify/sarama
                          │  old1.log   │
                          │   sec/op    │
Broker_Open-32              109.4µ ± 4%
Broker_No_Metrics_Open-32   48.43µ ± 4%
geomean                     72.77µ

pkg: github.com/ajstarks/deck/generate
           │  old1.log   │              new1.log              │
           │   sec/op    │   sec/op     vs base               │
Arc-32       2.153µ ± 0%   2.131µ ± 2%  -1.00% (p=0.001 n=10)
Polygon-32   4.593µ ± 0%   4.624µ ± 1%  +0.66% (p=0.001 n=10)
geomean      3.145µ        3.139µ       -0.17%

pkg: github.com/aws/aws-sdk-go/private/protocol/json/jsonutil
              │  old1.log   │              new1.log              │
              │   sec/op    │   sec/op     vs base               │
BuildJSON-32    4.209µ ± 0%   4.232µ ± 0%  +0.53% (p=0.000 n=10)
StdlibJSON-32   2.487µ ± 0%   2.508µ ± 0%  +0.86% (p=0.000 n=10)
geomean         3.235µ        3.258µ       +0.70%

pkg: github.com/benhoyt/goawk/interp
                       │  old1.log   │              new1.log              │
                       │   sec/op    │   sec/op     vs base               │
RecursiveFunc-32         11.37µ ± 1%   11.82µ ± 1%  +3.97% (p=0.000 n=10)
RegexMatch-32            910.0n ± 0%   903.9n ± 1%       ~ (p=0.052 n=10)
RepeatExecProgram-32     11.39µ ± 1%   11.40µ ± 0%       ~ (p=0.481 n=10)
RepeatNew-32             68.09n ± 2%   67.80n ± 1%       ~ (p=0.481 n=10)
RepeatIOExecProgram-32   22.31µ ± 1%   22.45µ ± 1%       ~ (p=0.123 n=10)
RepeatIONew-32           818.9n ± 0%   813.9n ± 1%  -0.62% (p=0.009 n=10)
geomean                  2.296µ        2.307µ       +0.48%

pkg: github.com/bits-and-blooms/bloom/v3
                      │  old1.log   │              new1.log              │
                      │   sec/op    │   sec/op     vs base               │
SeparateTestAndAdd-32   342.1n ± 2%   347.9n ± 2%       ~ (p=0.143 n=10)
CombinedTestAndAdd-32   346.6n ± 3%   355.1n ± 3%       ~ (p=0.093 n=10)
geomean                 344.4n        351.5n       +2.07%

pkg: github.com/dustin/go-broadcast
                      │  old1.log   │              new1.log              │
                      │   sec/op    │   sec/op     vs base               │
DirectSend-32           385.2n ± 2%   381.5n ± 1%       ~ (p=0.159 n=10)
ParallelDirectSend-32   395.9n ± 2%   396.1n ± 1%       ~ (p=0.987 n=10)
ParallelBrodcast-32     547.0n ± 1%   546.2n ± 1%       ~ (p=0.868 n=10)
MuxBrodcast-32          543.8n ± 3%   557.6n ± 6%       ~ (p=0.210 n=10)
geomean                 461.5n        463.2n       +0.37%

pkg: github.com/dustin/go-humanize
                 │  old1.log   │           new1.log            │
                 │   sec/op    │   sec/op     vs base          │
ParseBigBytes-32   1.536µ ± 3%   1.532µ ± 2%  ~ (p=0.782 n=10)

pkg: github.com/egonelbre/spexs2/_benchmark
              │  old1.log  │             new1.log              │
              │   sec/op   │   sec/op    vs base               │
Run/10k/1-32    19.18 ± 1%   18.86 ± 1%  -1.68% (p=0.000 n=10)
Run/10k/16-32   3.666 ± 5%   3.630 ± 3%       ~ (p=0.393 n=10)
geomean         8.386        8.274       -1.33%

pkg: github.com/ericlagergren/decimal/benchmarks
                                       │  old1.log   │              new1.log              │
                                       │   sec/op    │   sec/op     vs base               │
Pi/foo=ericlagergren_(Go)/prec=100-32    124.7µ ± 0%   125.2µ ± 0%  +0.36% (p=0.006 n=10)
Pi/foo=ericlagergren_(GDA)/prec=100-32   258.2µ ± 0%   257.8µ ± 0%       ~ (p=0.218 n=10)
Pi/foo=shopspring/prec=100-32            457.1µ ± 1%   459.4µ ± 1%       ~ (p=0.393 n=10)
Pi/foo=apmckinlay/prec=100-32            2.345µ ± 0%   2.269µ ± 0%  -3.24% (p=0.000 n=10)
Pi/foo=go-inf/prec=100-32                120.3µ ± 1%   119.3µ ± 1%       ~ (p=0.190 n=10)
Pi/foo=float64/prec=100-32               3.314µ ± 0%   3.316µ ± 0%  +0.05% (p=0.014 n=10)
geomean                                  48.96µ        48.68µ       -0.57%

pkg: github.com/ethereum/go-ethereum/common/bitutil
                         │  old1.log   │              new1.log               │
                         │   sec/op    │   sec/op     vs base                │
FastTest2KB-32             690.5n ± 0%   690.4n ± 0%        ~ (p=1.000 n=10)
BaseTest2KB-32             690.6n ± 0%   689.5n ± 0%   -0.15% (p=0.021 n=10)
Encoding4KBVerySparse-32   12.45µ ± 0%   11.17µ ± 0%  -10.25% (p=0.000 n=10)
geomean                    1.810µ        1.745µ        -3.59%

                         │   old1.log   │             new1.log             │
                         │     B/op     │     B/op      vs base            │
Encoding4KBVerySparse-32   9.750Ki ± 0%   9.750Ki ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

                         │  old1.log  │            new1.log            │
                         │ allocs/op  │ allocs/op   vs base            │
Encoding4KBVerySparse-32   15.00 ± 0%   15.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

pkg: github.com/ethereum/go-ethereum/consensus/ethash
                  │  old1.log   │              new1.log              │
                  │   sec/op    │   sec/op     vs base               │
HashimotoLight-32   1.723m ± 8%   1.619m ± 7%  -6.06% (p=0.043 n=10)

pkg: github.com/ethereum/go-ethereum/core
                             │  old1.log   │              new1.log              │
                             │   sec/op    │   sec/op     vs base               │
PendingDemotion10000-32        112.0n ± 2%   114.0n ± 0%  +1.74% (p=0.012 n=10)
FuturePromotion10000-32        2.416n ± 0%   2.417n ± 0%       ~ (p=0.955 n=10)
PoolBatchInsert10000-32        741.6m ± 0%   744.8m ± 0%  +0.43% (p=0.000 n=10)
PoolBatchLocalInsert10000-32   822.1m ± 0%   826.2m ± 0%  +0.49% (p=0.000 n=10)
geomean                        113.3µ        114.1µ       +0.67%

pkg: github.com/ethereum/go-ethereum/core/vm
            │  old1.log   │              new1.log              │
            │   sec/op    │   sec/op     vs base               │
OpDiv128-32   91.33n ± 1%   90.87n ± 1%  -0.50% (p=0.027 n=10)

pkg: github.com/ethereum/go-ethereum/crypto/ecies
                    │  old1.log   │              new1.log              │
                    │   sec/op    │   sec/op     vs base               │
GenerateKeyP256-32    15.96µ ± 0%   15.95µ ± 0%       ~ (p=0.382 n=10)
GenSharedKeyP256-32   59.00µ ± 0%   59.03µ ± 0%       ~ (p=0.271 n=10)
GenSharedKeyS256-32   58.53µ ± 0%   58.49µ ± 0%  -0.05% (p=0.002 n=10)
geomean               38.05µ        38.04µ       -0.03%

pkg: github.com/ethereum/go-ethereum/trie
                                │  old1.log   │              new1.log              │
                                │   sec/op    │   sec/op     vs base               │
HashFixedSize/10K-32              3.663m ± 1%   3.688m ± 1%       ~ (p=0.190 n=10)
CommitAfterHashFixedSize/10K-32   14.41m ± 1%   14.41m ± 2%       ~ (p=0.684 n=10)
geomean                           7.267m        7.289m       +0.30%

                                │   old1.log   │              new1.log               │
                                │     B/op     │     B/op      vs base               │
HashFixedSize/10K-32              6.071Mi ± 0%   6.071Mi ± 0%       ~ (p=0.616 n=10)
CommitAfterHashFixedSize/10K-32   8.633Mi ± 0%   8.632Mi ± 0%       ~ (p=0.353 n=10)
geomean                           7.239Mi        7.239Mi       -0.00%

                                │  old1.log   │              new1.log              │
                                │  allocs/op  │  allocs/op   vs base               │
HashFixedSize/10K-32              77.30k ± 0%   77.30k ± 0%       ~ (p=0.471 n=10)
CommitAfterHashFixedSize/10K-32   79.99k ± 0%   79.99k ± 0%       ~ (p=0.362 n=10)
geomean                           78.63k        78.63k       -0.00%

pkg: github.com/flanglet/kanzi-go/benchmark
        │   old1.log   │              new1.log               │
        │    sec/op    │    sec/op     vs base               │
BWTS-32   0.4279n ± 2%   0.4297n ± 1%       ~ (p=0.684 n=10)
FPAQ-32    15.35m ± 0%    15.35m ± 0%       ~ (p=0.631 n=10)
LZ-32      838.2µ ± 2%    815.4µ ± 2%  -2.71% (p=0.001 n=10)
MTFT-32    900.0µ ± 0%    918.5µ ± 0%  +2.06% (p=0.000 n=10)
geomean    47.18µ         47.15µ       -0.07%

pkg: github.com/gohugoio/hugo/hugolib
                            │  old1.log   │              new1.log              │
                            │   sec/op    │   sec/op     vs base               │
MergeByLanguage-32            528.3n ± 1%   526.8n ± 1%       ~ (p=0.148 n=10)
ResourceChainPostProcess-32   43.84m ± 1%   44.25m ± 1%  +0.94% (p=0.035 n=10)
ReplaceShortcodeTokens-32     2.064µ ± 1%   2.082µ ± 0%  +0.87% (p=0.041 n=10)
geomean                       36.29µ        36.48µ       +0.51%

pkg: github.com/gtank/blake2s
          │  old1.log   │              new1.log              │
          │   sec/op    │   sec/op     vs base               │
Hash8K-32   21.19µ ± 0%   21.09µ ± 0%  -0.46% (p=0.000 n=10)

          │   old1.log   │              new1.log               │
          │     B/s      │     B/s       vs base               │
Hash8K-32   368.8Mi ± 0%   370.5Mi ± 0%  +0.46% (p=0.000 n=10)

pkg: github.com/kevinburke/go-bindata
           │  old1.log   │           new1.log            │
           │   sec/op    │   sec/op     vs base          │
Bindata-32   137.4m ± 1%   136.3m ± 2%  ~ (p=0.105 n=10)

           │   old1.log   │              new1.log               │
           │     B/op     │     B/op      vs base               │
Bindata-32   183.0Mi ± 0%   183.0Mi ± 0%  -0.00% (p=0.004 n=10)

           │  old1.log   │           new1.log            │
           │  allocs/op  │  allocs/op   vs base          │
Bindata-32   5.795k ± 0%   5.793k ± 0%  ~ (p=0.338 n=10)

           │   old1.log   │            new1.log            │
           │     B/s      │     B/s       vs base          │
Bindata-32   47.64Mi ± 1%   48.05Mi ± 1%  ~ (p=0.109 n=10)

pkg: github.com/uber-go/tally
                                │  old1.log   │              new1.log              │
                                │   sec/op    │   sec/op     vs base               │
ScopeTaggedNoCachedSubscopes-32   2.683µ ± 4%   2.673µ ± 2%       ~ (p=0.579 n=10)
HistogramAllocation-32            1.366µ ± 3%   1.352µ ± 3%       ~ (p=0.541 n=10)
geomean                           1.914µ        1.901µ       -0.68%

                       │   old1.log   │            new1.log            │
                       │     B/op     │     B/op      vs base          │
HistogramAllocation-32   1.164Ki ± 0%   1.165Ki ± 0%  ~ (p=0.158 n=10)

                       │  old1.log  │            new1.log            │
                       │ allocs/op  │ allocs/op   vs base            │
HistogramAllocation-32   20.00 ± 0%   20.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

pkg: gitlab.com/golang-commonmark/markdown
                          │  old1.log   │              new1.log              │
                          │   sec/op    │   sec/op     vs base               │
RenderSpecNoHTML-32         5.197m ± 1%   5.300m ± 1%  +1.97% (p=0.000 n=10)
RenderSpec-32               5.220m ± 1%   5.400m ± 1%  +3.43% (p=0.000 n=10)
RenderSpecBlackFriday2-32   3.744m ± 1%   3.760m ± 1%  +0.42% (p=0.023 n=10)
geomean                     4.666m        4.756m       +1.93%

pkg: go.uber.org/zap/zapcore
                                              │   old1.log   │              new1.log               │
                                              │    sec/op    │    sec/op     vs base               │
BufferedWriteSyncer/write_file_with_buffer-32   259.0n ±  4%   266.0n ±  4%       ~ (p=0.123 n=10)
MultiWriteSyncer/2_discarder-32                 7.301n ±  1%   7.285n ±  1%       ~ (p=0.671 n=10)
MultiWriteSyncer/4_discarder-32                 8.025n ±  2%   7.947n ±  2%       ~ (p=0.796 n=10)
MultiWriteSyncer/4_discarder_with_buffer-32     259.9n ±  4%   261.6n ±  4%       ~ (p=0.542 n=10)
WriteSyncer/write_file_with_no_buffer-32        981.5n ±  6%   923.2n ±  3%  -5.94% (p=0.030 n=10)
ZapConsole-32                                   698.6n ±  2%   694.1n ±  2%       ~ (p=0.631 n=10)
JSONLogMarshalerFunc-32                         847.7n ±  3%   841.3n ±  1%       ~ (p=0.353 n=10)
ZapJSON-32                                      424.4n ±  1%   427.7n ±  1%       ~ (p=0.225 n=10)
StandardJSON-32                                 630.3n ±  2%   631.3n ±  1%       ~ (p=0.494 n=10)
Sampler_Check/7_keys-32                         18.73n ± 10%   18.63n ±  5%       ~ (p=0.796 n=10)
Sampler_Check/50_keys-32                        5.141n ±  1%   5.143n ±  2%       ~ (p=0.955 n=10)
Sampler_Check/100_keys-32                       5.130n ±  6%   5.137n ±  8%       ~ (p=0.393 n=10)
Sampler_CheckWithHook/7_keys-32                 78.52n ± 13%   80.55n ± 11%       ~ (p=1.000 n=10)
Sampler_CheckWithHook/50_keys-32                78.57n ± 13%   80.58n ± 11%       ~ (p=0.971 n=10)
Sampler_CheckWithHook/100_keys-32               78.61n ± 13%   80.61n ± 11%       ~ (p=1.000 n=10)
TeeCheck-32                                     70.58n ±  4%   70.42n ±  7%       ~ (p=0.393 n=10)
geomean                                         86.80n         86.93n        +0.15%

pkg: golang.org/x/benchmarks/gc_latency
                                    │   old1.log   │              new1.log              │
                                    │ p99.999-sec  │ p99.999-sec   vs base              │
GCLatency/how=stack/fluff=false-32    104.6µ ± ∞ ¹   110.3µ ± ∞ ¹       ~ (p=0.310 n=5)
GCLatency/how=heap/fluff=false-32     104.6µ ± ∞ ¹   103.2µ ± ∞ ¹       ~ (p=0.690 n=5)
GCLatency/how=global/fluff=false-32   103.0µ ± ∞ ¹   103.6µ ± ∞ ¹       ~ (p=1.000 n=5)
geomean                               104.1µ         105.6µ        +1.50%
¹ need >= 6 samples for confidence interval at level 0.95

                                    │   old1.log   │              new1.log              │
                                    │ p99.9999-sec │ p99.9999-sec  vs base              │
GCLatency/how=stack/fluff=false-32    6.570m ± ∞ ¹   6.472m ± ∞ ¹       ~ (p=1.000 n=5)
GCLatency/how=heap/fluff=false-32     184.9µ ± ∞ ¹   171.0µ ± ∞ ¹       ~ (p=1.000 n=5)
GCLatency/how=global/fluff=false-32   428.9µ ± ∞ ¹   449.0µ ± ∞ ¹       ~ (p=0.151 n=5)
geomean                               804.6µ         792.1µ        -1.55%
¹ need >= 6 samples for confidence interval at level 0.95

pkg: golang.org/x/benchmarks/sweet/benchmarks/biogo-igor
          │  old1.log  │             new1.log              │
          │   sec/op   │   sec/op    vs base               │
BiogoIgor   9.539 ± 3%   9.779 ± 2%  +2.52% (p=0.029 n=10)

          │     old1.log      │              new1.log               │
          │ average-RSS-bytes │ average-RSS-bytes  vs base          │
BiogoIgor        64.62Mi ± 1%        65.49Mi ± 3%  ~ (p=0.218 n=10)

          │    old1.log    │             new1.log             │
          │ peak-RSS-bytes │ peak-RSS-bytes  vs base          │
BiogoIgor     85.62Mi ± 2%     86.64Mi ± 1%  ~ (p=0.123 n=10)

          │   old1.log    │            new1.log             │
          │ peak-VM-bytes │ peak-VM-bytes  vs base          │
BiogoIgor    1.243Gi ± 0%    1.243Gi ± 0%  ~ (p=0.897 n=10)

pkg: golang.org/x/benchmarks/sweet/benchmarks/biogo-krishna
             │  old1.log  │             new1.log              │
             │   sec/op   │   sec/op    vs base               │
BiogoKrishna   13.31 ± 1%   13.49 ± 1%  +1.35% (p=0.000 n=10)

             │     old1.log      │              new1.log               │
             │ average-RSS-bytes │ average-RSS-bytes  vs base          │
BiogoKrishna        3.697Gi ± 0%        3.699Gi ± 0%  ~ (p=0.353 n=10)

             │    old1.log    │             new1.log             │
             │ peak-RSS-bytes │ peak-RSS-bytes  vs base          │
BiogoKrishna     4.115Gi ± 0%     4.115Gi ± 0%  ~ (p=0.755 n=10)

             │   old1.log    │            new1.log             │
             │ peak-VM-bytes │ peak-VM-bytes  vs base          │
BiogoKrishna    5.305Gi ± 0%    5.304Gi ± 0%  ~ (p=0.376 n=10)

pkg: golang.org/x/benchmarks/sweet/benchmarks/bleve-index
                   │  old1.log  │           new1.log           │
                   │   sec/op   │   sec/op    vs base          │
BleveIndexBatch100   4.341 ± 2%   4.421 ± 1%  ~ (p=0.105 n=10)

                   │     old1.log      │              new1.log               │
                   │ average-RSS-bytes │ average-RSS-bytes  vs base          │
BleveIndexBatch100        185.0Mi ± 2%        185.0Mi ± 2%  ~ (p=0.481 n=10)

                   │    old1.log    │             new1.log             │
                   │ peak-RSS-bytes │ peak-RSS-bytes  vs base          │
BleveIndexBatch100     265.2Mi ± 3%     261.1Mi ± 4%  ~ (p=0.796 n=10)

                   │   old1.log    │            new1.log             │
                   │ peak-VM-bytes │ peak-VM-bytes  vs base          │
BleveIndexBatch100    3.844Gi ± 2%    3.844Gi ± 0%  ~ (p=0.670 n=10)

pkg: golang.org/x/benchmarks/sweet/benchmarks/etcd
        │  old1.log   │              new1.log              │
        │   sec/op    │   sec/op     vs base               │
EtcdPut   7.442m ± 2%   7.482m ± 2%       ~ (p=0.739 n=10)
EtcdSTM   107.9m ± 3%   108.1m ± 2%       ~ (p=0.912 n=10)
geomean   28.33m        28.44m       +0.40%

        │     old1.log      │                 new1.log                 │
        │ average-RSS-bytes │ average-RSS-bytes  vs base               │
EtcdPut        112.4Mi ± 2%        111.6Mi ± 5%       ~ (p=0.631 n=10)
EtcdSTM        98.31Mi ± 4%        98.33Mi ± 5%       ~ (p=0.190 n=10)
geomean        105.1Mi             104.8Mi       -0.34%

        │    old1.log    │               new1.log                │
        │ peak-RSS-bytes │ peak-RSS-bytes  vs base               │
EtcdPut     152.9Mi ± 2%     153.6Mi ± 4%       ~ (p=1.000 n=10)
EtcdSTM     128.4Mi ± 3%     128.9Mi ± 3%       ~ (p=0.853 n=10)
geomean     140.1Mi          140.7Mi       +0.43%

        │   old1.log    │               new1.log               │
        │ peak-VM-bytes │ peak-VM-bytes  vs base               │
EtcdPut    11.32Gi ± 0%    11.32Gi ± 0%  +0.00% (p=0.040 n=10)
EtcdSTM    11.25Gi ± 0%    11.25Gi ± 0%  +0.00% (p=0.001 n=10)
geomean    11.29Gi         11.29Gi       +0.00%

        │    old1.log     │                new1.log                │
        │ p50-latency-sec │ p50-latency-sec  vs base               │
EtcdPut       6.333m ± 2%       6.404m ± 3%       ~ (p=0.353 n=10)
EtcdSTM       81.95m ± 3%       82.08m ± 1%       ~ (p=0.912 n=10)
geomean       22.78m            22.93m       +0.64%

        │    old1.log     │                new1.log                │
        │ p90-latency-sec │ p90-latency-sec  vs base               │
EtcdPut       13.48m ± 3%       13.45m ± 5%       ~ (p=0.796 n=10)
EtcdSTM       220.1m ± 3%       217.8m ± 2%       ~ (p=0.631 n=10)
geomean       54.47m            54.12m       -0.64%

        │    old1.log     │                new1.log                │
        │ p99-latency-sec │ p99-latency-sec  vs base               │
EtcdPut       18.91m ± 4%       19.25m ± 5%       ~ (p=0.912 n=10)
EtcdSTM       472.6m ± 4%       467.3m ± 7%       ~ (p=0.684 n=10)
geomean       94.53m            94.84m       +0.33%

        │  old1.log   │              new1.log              │
        │    ops/s    │    ops/s     vs base               │
EtcdPut   131.1k ± 2%   131.1k ± 2%       ~ (p=0.739 n=10)
EtcdSTM   9.229k ± 3%   9.205k ± 2%       ~ (p=0.957 n=10)
geomean   34.78k        34.73k       -0.13%

pkg: golang.org/x/benchmarks/sweet/benchmarks/gopher-lua
                     │  old1.log  │           new1.log           │
                     │   sec/op   │   sec/op    vs base          │
GopherLuaKNucleotide   22.01 ± 0%   22.03 ± 0%  ~ (p=0.912 n=10)

                     │     old1.log      │              new1.log               │
                     │ average-RSS-bytes │ average-RSS-bytes  vs base          │
GopherLuaKNucleotide        36.66Mi ± 1%        36.81Mi ± 1%  ~ (p=0.631 n=10)

                     │    old1.log    │             new1.log             │
                     │ peak-RSS-bytes │ peak-RSS-bytes  vs base          │
GopherLuaKNucleotide     40.60Mi ± 4%     39.91Mi ± 5%  ~ (p=0.197 n=10)

                     │   old1.log    │            new1.log             │
                     │ peak-VM-bytes │ peak-VM-bytes  vs base          │
GopherLuaKNucleotide    1.176Gi ± 0%    1.176Gi ± 0%  ~ (p=0.265 n=10)

pkg: golang.org/x/benchmarks/sweet/benchmarks/markdown
                    │  old1.log   │           new1.log            │
                    │   sec/op    │   sec/op     vs base          │
MarkdownRenderXHTML   186.7m ± 1%   188.2m ± 1%  ~ (p=0.143 n=10)

                    │     old1.log      │                 new1.log                 │
                    │ average-RSS-bytes │ average-RSS-bytes  vs base               │
MarkdownRenderXHTML        22.27Mi ± 4%        22.50Mi ± 2%  +1.02% (p=0.022 n=10)

                    │    old1.log    │               new1.log                │
                    │ peak-RSS-bytes │ peak-RSS-bytes  vs base               │
MarkdownRenderXHTML     22.29Mi ± 1%     22.79Mi ± 7%  +2.21% (p=0.001 n=10)

                    │   old1.log    │            new1.log             │
                    │ peak-VM-bytes │ peak-VM-bytes  vs base          │
MarkdownRenderXHTML    1.175Gi ± 0%    1.175Gi ± 0%  ~ (p=0.376 n=10)

pkg: gonum.org/v1/gonum/blas/gonum
                         │  old1.log   │              new1.log              │
                         │   sec/op    │   sec/op     vs base               │
Dnrm2MediumPosInc-32       1.970µ ± 0%   1.827µ ± 0%  -7.26% (p=0.000 n=10)
DasumMediumUnitaryInc-32   666.5n ± 0%   667.3n ± 0%  +0.11% (p=0.000 n=10)
geomean                    1.146µ        1.104µ       -3.64%

pkg: gonum.org/v1/gonum/graph/community
                            │  old1.log   │           new1.log            │
                            │   sec/op    │   sec/op     vs base          │
LouvainDirectedMultiplex-32   16.10m ± 2%   16.05m ± 1%  ~ (p=0.190 n=10)

pkg: gonum.org/v1/gonum/graph/topo
                          │  old1.log   │              new1.log              │
                          │   sec/op    │   sec/op     vs base               │
TarjanSCCGnp_10_tenth-32    5.919µ ± 1%   5.884µ ± 1%  -0.59% (p=0.043 n=10)
TarjanSCCGnp_1000_half-32   60.74m ± 0%   58.22m ± 0%  -4.15% (p=0.000 n=10)
geomean                     599.6µ        585.3µ       -2.39%

pkg: gonum.org/v1/gonum/graph/traverse
                                     │  old1.log   │              new1.log               │
                                     │   sec/op    │    sec/op     vs base               │
WalkAllBreadthFirstGnp_10_tenth-32     3.053µ ± 0%    3.055µ ± 0%       ~ (p=0.590 n=10)
WalkAllBreadthFirstGnp_1000_tenth-32   9.615m ± 0%   10.025m ± 0%  +4.26% (p=0.000 n=10)
geomean                                171.3µ         175.0µ       +2.15%

pkg: gonum.org/v1/gonum/lapack/gonum
                      │  old1.log   │              new1.log              │
                      │   sec/op    │   sec/op     vs base               │
Dgeev/Circulant10-32    25.50µ ± 1%   25.39µ ± 0%  -0.46% (p=0.000 n=10)
Dgeev/Circulant100-32   8.195m ± 0%   8.259m ± 0%  +0.79% (p=0.000 n=10)
geomean                 457.2µ        457.9µ       +0.16%

pkg: gonum.org/v1/gonum/mat
                                  │   old1.log   │              new1.log               │
                                  │    sec/op    │    sec/op     vs base               │
MulWorkspaceDense1000Hundredth-32   16.87m ±  0%   16.85m ±  0%  -0.11% (p=0.035 n=10)
ScaleVec10000Inc20-32               23.78µ ± 10%   23.22µ ± 17%       ~ (p=0.912 n=10)
geomean                             633.4µ         625.6µ        -1.23%

pkg: k8s.io/client-go/tools/cache
                           │  old1.log   │              new1.log              │
                           │   sec/op    │   sec/op     vs base               │
Listener-32                  1.151µ ± 3%   1.129µ ± 4%       ~ (p=0.149 n=10)
ReflectorResyncChanMany-32   551.7n ± 1%   548.8n ± 1%       ~ (p=0.280 n=10)
geomean                      796.9n        787.3n       -1.20%

            │  old1.log  │            new1.log            │
            │    B/op    │    B/op     vs base            │
Listener-32   16.00 ± 0%   16.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

            │  old1.log  │            new1.log            │
            │ allocs/op  │ allocs/op   vs base            │
Listener-32   1.000 ± 0%   1.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

pkg: k8s.io/client-go/util/workqueue
                                                         │  old1.log   │              new1.log              │
                                                         │   sec/op    │   sec/op     vs base               │
ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-32     317.2µ ± 3%   323.8µ ± 1%  +2.09% (p=0.005 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:10-32    38.90µ ± 2%   38.31µ ± 1%  -1.53% (p=0.003 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:100-32   17.60µ ± 1%   17.53µ ± 1%       ~ (p=0.079 n=10)
ParallelizeUntil/pieces:999,workers:10,chunkSize:13-32     31.58µ ± 1%   30.88µ ± 1%  -2.23% (p=0.000 n=10)
geomean                                                    51.17µ        50.90µ       -0.53%

pkg: zombiezen.com/go/capnproto2
                               │  old1.log   │              new1.log              │
                               │   sec/op    │   sec/op     vs base               │
TextMovementBetweenSegments-32   423.0µ ± 0%   421.0µ ± 1%  -0.47% (p=0.035 n=10)
Growth_MultiSegment-32           15.19m ± 0%   15.09m ± 0%  -0.65% (p=0.000 n=10)
geomean                          2.534m        2.520m       -0.56%

                       │   old1.log   │            new1.log            │
                       │     B/op     │     B/op      vs base          │
Growth_MultiSegment-32   1.572Mi ± 0%   1.572Mi ± 0%  ~ (p=0.239 n=10)

                       │  old1.log  │            new1.log            │
                       │ allocs/op  │ allocs/op   vs base            │
Growth_MultiSegment-32   21.00 ± 0%   21.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

                       │   old1.log   │              new1.log               │
                       │     B/s      │     B/s       vs base               │
Growth_MultiSegment-32   65.85Mi ± 0%   66.29Mi ± 0%  +0.66% (p=0.000 n=10)

I think current performance results of both std and golang/benchmakrs are sufficient to drive this patch forward. It has a slight overall advantage compared to the current algorithm and offers a significant benefit for functions with many blocks. Furthermore, to optimize code layout using PGO information in the future, it is a reasonable choice to replace case-by-case tuning algorithms with the time-tested PH algorithm. I look forward to your comments and reviews.

y1yang0 commented 2 months ago

Gentle PING: @mdempsky @thanm

y1yang0 commented 1 month ago

PING: Anyone?

randall77 commented 1 month ago

When we last left the CL (572975, right?) it had trybot failures. Those need to be fixed at some point. (Generally it is probably better to ping on the CL than the issue.)

More generally, it is unclear to me at least that this is actually a performance improvement, as opposed to just noise. I think it would be good to take 2 benchmarks, one which improves by a bunch (perhaps Encoding4KBVerySparse) and one which gets worse (perhaps WalkAllBreadthFirstGnp_1000_tenth) and investigate whether those performance changes are real, and if so, why. What does the different block ordering do that makes it better/worse?

I think we're still waiting for an answer for how this compares to CL 571535. That CL uses pgo information, but otherwise is it the same algorithm? A different one? If different, how do we choose which one to use?

y1yang0 commented 1 month ago

@randall77

Thank you for the reply.

I hope for a positive start to the review process, and I am happy to fix the errors that the trybot points out.

More generally, it is unclear to me at least that this is actually a performance improvement, as opposed to just noise.

In the first chart of performance results I provided, I ran all std benchmarks (archive, bufio, bytes, crypto, ...) twice, and the results (blue line and orange line) show a high degree of similarity. This means that the same packages consistently receive similar proportions of performance improvements, which is no mere coincidence. I believe it sufficiently demonstrates that the performance results are not noise.

I think we're still waiting for an answer for how this compares to CL 571535. That CL uses pgo information, but otherwise is it the same algorithm? A different one? If different, how do we choose which one to use?

Nearly all basic block algorithms (such as exttsp, cache) are variations of the PH algorithm, which can be derived by altering the PH algorithm's weight computation formula. Another rationale for proposing the current CL is its modest size and clarity, which could simplify the review process. This mitigates one of the concerns I have perceived—why many large patches from external contributors do not get merged/move forward. If we clear the first hurdle, that is, if PH gets integrated, then we can proceed to incorporate PGO information into the PH algorithm.

randall77 commented 1 month ago

In the first chart of performance results I provided, I ran all std benchmarks (archive, bufio, bytes, crypto, ...) twice, and the results (blue line and orange line) show a high degree of similarity. This means that the same packages consistently receive similar proportions of performance improvements, which is no mere coincidence. I believe it sufficiently demonstrates that the performance results are not noise.

Unfortunately just running twice is not enough to avoid noise. Noise comes from many sources, some of which are repeatable given the same binary. See, for example, https://go-review.googlesource.com/c/go/+/562157

I really need to see a few examples of where this makes a difference. A particular change this CL makes in this function makes this benchmark faster. And here's the difference in assembly, note the layout changes make this jump instruction go away. That sort of thing.

y1yang0 commented 1 month ago

I really need to see a few examples of where this makes a difference. A particular change this CL makes in this function makes this benchmark faster. And here's the difference in assembly, note the layout changes make this jump instruction go away. That sort of thing.

@randall77 Sure~ one of examples is src/runtime/map_test.go,benchmarkMapAssignInt32, reproducible improvement

bench source

func benchmarkMapAssignInt32(b *testing.B, n int) {
    a := make(map[int32]int)
    for i := 0; i < b.N; i++ {
        a[int32(i&(n-1))] = i // mapassign_fast32
    }
}

benchstat diff

goos: linux
goarch: arm64
pkg: runtime
                         │ baseline2.log │              opt2.log               │
                         │    sec/op     │   sec/op     vs base                │
MapAssign/Int32/256-32       12.91n ± 3%   10.69n ± 3%  -17.20% (p=0.000 n=10)

perf diff

## opt
 Performance counter stats for './runtime.test -test.bench BenchmarkMapAssign -test.run ^$ -test.count 10':

           949,493      L1-icache-load-misses     #    0.00% of all L1-icache accesses
    30,013,349,581      L1-icache-loads                                             
     1,876,125,523      iTLB-loads                                                  
           107,861      iTLB-load-misses          #    0.01% of all iTLB cache accesses
    29,201,122,034      branch-loads                                                
       200,589,891      branch-load-misses                                          

      10.754113684 seconds time elapsed

      10.687244000 seconds user
       0.266782000 seconds sys

0.686% branch load miss

## baseline
 Performance counter stats for './runtime.test -test.bench BenchmarkMapAssign -test.run ^$ -test.count 10':

           962,545      L1-icache-load-misses     #    0.00% of all L1-icache accesses
    33,098,634,858      L1-icache-loads                                             
     3,331,798,155      iTLB-loads                                                  
           113,829      iTLB-load-misses          #    0.00% of all iTLB cache accesses
    27,895,114,371      branch-loads                                                
       401,172,794      branch-load-misses                                          

      11.904014754 seconds time elapsed

      11.924230000 seconds user
       0.168739000 seconds sys

1.438% branch load miss

ssa diff

victim of branch-load-miss is image corresponding ssa is image

The base handles b37 'likely' and 'unlikely' without distinction, while 'opt' shows a preference for 'likely'. This is precisely the issue of concern for 'ph'. See more traces in comment#2

ssa.zip

randall77 commented 1 month ago

Hmm, I don't see that improvement, but my work machine is x86:

$ benchstat ./old ~/sandbox/tmp/src/new 
goos: linux
goarch: amd64
pkg: runtime
cpu: 12th Gen Intel(R) Core(TM) i7-12700
                       │    ./old    │ /usr/local/google/home/khr/sandbox/tmp/src/new │
                       │   sec/op    │            sec/op             vs base          │
MapAssign/Int32/256-20   5.643n ± 2%                    5.572n ± 2%  ~ (p=0.240 n=10)

I will try again on my arm (M2 ultra) when I get home.

randall77 commented 1 month ago

Looking at the generated arm64 assembly for runtime.mapassign_fast32, with your CL the code grows from 808 bytes to 824 bytes. That seems like a general trend, for example the binary generated by go test -c runtime is 3.2% bigger. My suspicion is the following. For the code

if c {
    f()
}
g()
return

The tip compiler does

if !c goto skip
call f
skip:
call g
ret

Whereas with your CL it does

if c goto ifbody
rest:
call g
ret
ifbody:
call f
goto rest

The latter is larger (by 1 unconditional jump) but is probably better (forward branches are default predicted not taken?) when c is seldom true.

3+% is a lot of binary size increase. On microbenchmarks binary size doesn't really matter, but on real applications that extra size can lead to more general slowdowns. It's not clear if that would end up countering the observed speedup or not. Maybe it would be worth doing under PGO when we know a function is hot.

y1yang0 commented 1 month ago

Whereas with your CL it does

if c goto ifbody
rest:
call g
ret
ifbody:
call f
goto rest

Yes

but is probably better (forward branches are default predicted not taken?)

Yes

3+% is a lot of binary size increase. On microbenchmarks binary size doesn't really matter, but on real applications that extra size can lead to more general slowdowns. It's not clear if that would end up countering the observed speedup or not. Maybe it would be worth doing under PGO when we know a function is hot.

Based on test performance, the cost of increasing binary size is that a few cases see significant performance improvements(10%+), while most cases do not show obvious changes. The performance with PGO may lead to noticeable improvements for the majority of cases, which is exactly what I want to do next. Perhaps in the future, we could consider a configurable block layout algorithm, like GOEXPERIMENT=ph/basic. Additionally, the fact that a large number of compilers apply this algorithm also strengthens our confidence to some extent.

randall77 commented 1 month ago

On my M2 ultra, the performance of your CL is quite good.

goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M2 Ultra
                       │ /Users/khr/sandbox/tmp/src/old │                ./new                │
                       │             sec/op             │   sec/op     vs base                │
MapAssign/Int32/256-24                      8.300n ± 3%   6.490n ± 2%  -21.81% (p=0.000 n=10)
y1yang0 commented 1 month ago

So, considering all the information given, is it enough to start our review process? Or is there any additional information required from me? Thanks.

randall77 commented 1 month ago

I will take a look at your CL today. I think we still have to figure out the right way to decide when to turn it on.

y1yang0 commented 4 weeks ago

I've found a heuristic spot where we could consider sorting adjacent chains together(9fd9a54). For the given chains:

before:

== Chains:
id:1 priority:1 blocks:[b1 b19]
id:12 priority:2 blocks:[b12 b13 b21]
id:10 priority:2 blocks:[b10 b11 b20]
id:16 priority:2 blocks:[b16 b17 b23]
id:14 priority:2 blocks:[b14 b15 b22]
id:2 priority:2 blocks:[b2 b18 b5]
id:3 priority:2 blocks:[b4 b7 b8]
id:5 priority:1 blocks:[b6 b24]
id:8 priority:1 blocks:[b3 b9]
== BlockOrder:
[b1 b19 b12 b13 b21 b10 b11 b20 b16 b17 b23 b14 b15 b22 b2 b18 b5 b4 b7 b8 b6 b24 b3 b9]

Even if [b1 b19] and [b2 b18 b5] are very close to each other, the final generated block order will still arrange them far apart. If we consider arranging adjacent chains together, we could generate better block order:

after:

== Chains:
id:1 priority:1 blocks:[b1 b19]
id:2 priority:2 blocks:[b2 b18 b5]
id:3 priority:2 blocks:[b4 b7 b8]
id:12 priority:2 blocks:[b12 b13 b21]
id:10 priority:2 blocks:[b10 b11 b20]
id:16 priority:2 blocks:[b16 b17 b23]
id:14 priority:2 blocks:[b14 b15 b22]
id:5 priority:1 blocks:[b6 b24]
id:8 priority:1 blocks:[b3 b9]
== BlockOrder:
[b1 b19 b2 b18 b5 b4 b7 b8 b12 b13 b21 b10 b11 b20 b16 b17 b23 b14 b15 b22 b6 b24 b3 b9]

--bench

before:
FindBitRange64/Pattern00Size2-32                   0.6684n ±  0%    0.8384n ±  0%  +25.43% (p=0.000 n=10)
FindBitRange64/Pattern00Size8-32                    1.003n ±  0%     1.003n ±  0%        ~ (p=1.000 n=10)
FindBitRange64/Pattern00Size32-32                   1.003n ±  0%     1.003n ±  0%        ~ (p=0.628 n=10)
FindBitRange64/PatternFFFFFFFFFFFFFFFFSize2-32     0.6684n ±  0%    0.8375n ±  0%  +25.29% (p=0.000 n=10)
FindBitRange64/PatternFFFFFFFFFFFFFFFFSize8-32      1.675n ±  0%     1.854n ±  0%  +10.69% (p=0.000 n=10)
FindBitRange64/PatternFFFFFFFFFFFFFFFFSize32-32     2.773n ±  0%     2.937n ±  0%   +5.90% (p=0.000 n=10)
FindBitRange64/PatternAASize2-32                   0.6683n ±  0%    0.8365n ±  0%  +25.16% (p=0.000 n=10)
FindBitRange64/PatternAASize8-32                    1.004n ±  0%     1.003n ±  0%   -0.10% (p=0.017 n=10)
FindBitRange64/PatternAASize32-32                   1.004n ±  0%     1.003n ±  0%   -0.10% (p=0.000 n=10)
FindBitRange64/PatternAAAAAAAAAAAAAAAASize2-32     0.6685n ±  0%    0.8376n ±  0%  +25.28% (p=0.000 n=10)
FindBitRange64/PatternAAAAAAAAAAAAAAAASize8-32      1.003n ±  0%     1.003n ±  0%        ~ (p=0.303 n=10)
FindBitRange64/PatternAAAAAAAAAAAAAAAASize32-32     1.003n ±  0%     1.003n ±  0%        ~ (p=0.263 n=10)
FindBitRange64/Pattern80000000AAAAAAAASize2-32     0.6684n ±  0%    0.8370n ±  0%  +25.22% (p=0.000 n=10)
FindBitRange64/Pattern80000000AAAAAAAASize8-32      1.003n ±  0%     1.003n ±  0%        ~ (p=0.139 n=10)
FindBitRange64/Pattern80000000AAAAAAAASize32-32     1.003n ±  0%     1.003n ±  0%        ~ (p=0.474 n=10)
FindBitRange64/PatternAAAAAAAA00000001Size2-32     0.6684n ±  0%    0.8361n ±  0%  +25.09% (p=0.000 n=10)
FindBitRange64/PatternAAAAAAAA00000001Size8-32      1.003n ±  0%     1.003n ±  0%        ~ (p=0.450 n=10)
FindBitRange64/PatternAAAAAAAA00000001Size32-32     1.003n ±  0%     1.003n ±  0%        ~ (p=0.365 n=10)
FindBitRange64/PatternBBBBBBBBBBBBBBBBSize2-32     0.6686n ±  0%    0.8376n ±  0%  +25.28% (p=0.000 n=10)
FindBitRange64/PatternBBBBBBBBBBBBBBBBSize8-32      1.506n ±  0%     1.602n ±  1%   +6.37% (p=0.000 n=10)
FindBitRange64/PatternBBBBBBBBBBBBBBBBSize32-32     1.505n ±  0%     1.600n ±  1%   +6.31% (p=0.000 n=10)
FindBitRange64/Pattern80000000BBBBBBBBSize2-32     0.6684n ±  0%    0.8360n ±  0%  +25.07% (p=0.000 n=10)
FindBitRange64/Pattern80000000BBBBBBBBSize8-32      1.505n ±  0%     1.629n ±  1%   +8.27% (p=0.000 n=10)
FindBitRange64/Pattern80000000BBBBBBBBSize32-32     1.506n ±  0%     1.631n ±  1%   +8.30% (p=0.000 n=10)
FindBitRange64/PatternBBBBBBBB00000001Size2-32     0.6686n ±  0%    0.8370n ±  0%  +25.19% (p=0.000 n=10)
FindBitRange64/PatternBBBBBBBB00000001Size8-32      1.505n ±  0%     1.632n ±  1%   +8.47% (p=0.000 n=10)
FindBitRange64/PatternBBBBBBBB00000001Size32-32     1.505n ±  0%     1.628n ±  2%   +8.17% (p=0.000 n=10)
FindBitRange64/PatternCCCCCCCCCCCCCCCCSize2-32     0.6684n ±  0%    0.8364n ±  0%  +25.13% (p=0.000 n=10)
FindBitRange64/PatternCCCCCCCCCCCCCCCCSize8-32      1.506n ±  0%     1.629n ±  1%   +8.16% (p=0.000 n=10)
FindBitRange64/PatternCCCCCCCCCCCCCCCCSize32-32     1.506n ±  0%     1.631n ±  2%   +8.37% (p=0.000 n=10)
FindBitRange64/Pattern4444444444444444Size2-32     0.6686n ±  0%    0.8373n ±  0%  +25.23% (p=0.000 n=10)
FindBitRange64/Pattern4444444444444444Size8-32      1.003n ±  0%     1.003n ±  0%        ~ (p=1.000 n=10)
FindBitRange64/Pattern4444444444444444Size32-32     1.003n ±  0%     1.003n ±  0%        ~ (p=0.628 n=10)
FindBitRange64/Pattern4040404040404040Size2-32     0.6684n ±  0%    0.8375n ±  0%  +25.28% (p=0.000 n=10)
FindBitRange64/Pattern4040404040404040Size8-32      1.003n ±  0%     1.002n ±  0%        ~ (p=0.237 n=10)
FindBitRange64/Pattern4040404040404040Size32-32     1.003n ±  0%     1.003n ±  0%        ~ (p=0.365 n=10)
FindBitRange64/Pattern4000400040004000Size2-32     0.6684n ±  0%    0.8373n ±  0%  +25.27% (p=0.000 n=10)
FindBitRange64/Pattern4000400040004000Size8-32      1.003n ±  0%     1.003n ±  0%        ~ (p=0.582 n=10)
FindBitRange64/Pattern4000400040004000Size32-32     1.003n ±  0%     1.003n ±  0%        ~ (p=0.056 n=10)

after:
FindBitRange64/Pattern00Size2-32                   0.6687n ±  0%    0.6687n ±  0%        ~ (p=0.610 n=10)
FindBitRange64/Pattern00Size8-32                   1.0030n ±  0%    0.8695n ±  0%  -13.31% (p=0.000 n=10)
FindBitRange64/Pattern00Size32-32                  1.0030n ±  0%    0.8695n ±  0%  -13.32% (p=0.000 n=10)
FindBitRange64/PatternFFFFFFFFFFFFFFFFSize2-32     0.6687n ±  0%    0.6688n ±  0%        ~ (p=0.115 n=10)
FindBitRange64/PatternFFFFFFFFFFFFFFFFSize8-32      1.674n ±  0%     2.006n ±  0%  +19.80% (p=0.000 n=10)
FindBitRange64/PatternFFFFFFFFFFFFFFFFSize32-32     2.773n ±  0%     3.344n ±  0%  +20.59% (p=0.000 n=10)
FindBitRange64/PatternAASize2-32                   0.6686n ±  0%    0.6688n ±  0%   +0.02% (p=0.016 n=10)
FindBitRange64/PatternAASize8-32                   1.0025n ±  0%    0.8695n ±  0%  -13.27% (p=0.000 n=10)
FindBitRange64/PatternAASize32-32                  1.0030n ±  0%    0.8695n ±  0%  -13.31% (p=0.000 n=10)
FindBitRange64/PatternAAAAAAAAAAAAAAAASize2-32     0.6685n ±  0%    0.6687n ±  0%   +0.03% (p=0.005 n=10)
FindBitRange64/PatternAAAAAAAAAAAAAAAASize8-32     1.0030n ±  0%    0.8692n ±  0%  -13.33% (p=0.000 n=10)
FindBitRange64/PatternAAAAAAAAAAAAAAAASize32-32    1.0030n ±  0%    0.8694n ±  0%  -13.32% (p=0.000 n=10)
FindBitRange64/Pattern80000000AAAAAAAASize2-32     0.6688n ±  0%    0.6687n ±  0%   -0.01% (p=0.044 n=10)
FindBitRange64/Pattern80000000AAAAAAAASize8-32     1.0030n ±  0%    0.8694n ±  0%  -13.32% (p=0.000 n=10)
FindBitRange64/Pattern80000000AAAAAAAASize32-32    1.0030n ±  0%    0.8693n ±  0%  -13.33% (p=0.000 n=10)
FindBitRange64/PatternAAAAAAAA00000001Size2-32     0.6688n ±  0%    0.6687n ±  0%        ~ (p=0.495 n=10)
FindBitRange64/PatternAAAAAAAA00000001Size8-32     1.0040n ±  0%    0.8692n ±  0%  -13.43% (p=0.000 n=10)
FindBitRange64/PatternAAAAAAAA00000001Size32-32    1.0050n ±  0%    0.8693n ±  0%  -13.50% (p=0.000 n=10)
FindBitRange64/PatternBBBBBBBBBBBBBBBBSize2-32     0.6688n ±  0%    0.6689n ±  0%        ~ (p=0.176 n=10)
FindBitRange64/PatternBBBBBBBBBBBBBBBBSize8-32      1.506n ±  0%     1.507n ±  1%        ~ (p=0.143 n=10)
FindBitRange64/PatternBBBBBBBBBBBBBBBBSize32-32     1.506n ±  0%     1.506n ±  0%        ~ (p=0.268 n=10)
FindBitRange64/Pattern80000000BBBBBBBBSize2-32     0.6688n ±  0%    0.6688n ±  0%        ~ (p=0.413 n=10)
FindBitRange64/Pattern80000000BBBBBBBBSize8-32      1.506n ±  0%     1.505n ±  1%        ~ (p=0.720 n=10)
FindBitRange64/Pattern80000000BBBBBBBBSize32-32     1.506n ±  0%     1.505n ±  0%        ~ (p=0.375 n=10)
FindBitRange64/PatternBBBBBBBB00000001Size2-32     0.6688n ±  0%    0.6686n ±  0%        ~ (p=0.238 n=10)
FindBitRange64/PatternBBBBBBBB00000001Size8-32      1.506n ±  0%     1.505n ±  0%        ~ (p=0.159 n=10)
FindBitRange64/PatternBBBBBBBB00000001Size32-32     1.506n ±  0%     1.505n ±  0%        ~ (p=0.507 n=10)
FindBitRange64/PatternCCCCCCCCCCCCCCCCSize2-32     0.6688n ±  0%    0.6686n ±  0%        ~ (p=0.309 n=10)
FindBitRange64/PatternCCCCCCCCCCCCCCCCSize8-32      1.505n ±  0%     1.506n ±  0%        ~ (p=0.641 n=10)
FindBitRange64/PatternCCCCCCCCCCCCCCCCSize32-32     1.505n ±  0%     1.505n ±  0%        ~ (p=0.926 n=10)
FindBitRange64/Pattern4444444444444444Size2-32     0.6687n ±  0%    0.6687n ±  0%        ~ (p=1.000 n=10)
FindBitRange64/Pattern4444444444444444Size8-32     1.0040n ±  0%    0.8695n ±  0%  -13.40% (p=0.000 n=10)
FindBitRange64/Pattern4444444444444444Size32-32    1.0040n ±  0%    0.8694n ±  0%  -13.41% (p=0.000 n=10)
FindBitRange64/Pattern4040404040404040Size2-32     0.6689n ±  0%    0.6687n ±  0%        ~ (p=0.176 n=10)
FindBitRange64/Pattern4040404040404040Size8-32     1.0030n ±  0%    0.8694n ±  0%  -13.32% (p=0.000 n=10)
FindBitRange64/Pattern4040404040404040Size32-32    1.0030n ±  0%    0.8695n ±  0%  -13.31% (p=0.000 n=10)
FindBitRange64/Pattern4000400040004000Size2-32     0.6688n ±  0%    0.6687n ±  0%   -0.01% (p=0.025 n=10)
FindBitRange64/Pattern4000400040004000Size8-32     1.0030n ±  0%    0.8692n ±  0%  -13.34% (p=0.000 n=10)
FindBitRange64/Pattern4000400040004000Size32-32    1.0030n ±  0%    0.8693n ±  0%  -13.33% (p=0.000 n=10)

7.1 update: I also found that "before" precedence relation is not completely correct now, because "a" comes before "b", and "b" comes before "c", we expect that before precedence is transitive, i.e. "a" comes before "c", this had been fixed in 36f564c, now the block order is as follows:

== Chains:
id:1 priority:1 blocks:[b1 b19]
id:2 priority:2 blocks:[b2 b18 b5]
id:3 priority:2 blocks:[b4 b7 b8]
id:5 priority:1 blocks:[b6 b24]
id:16 priority:2 blocks:[b16 b17 b23]
id:14 priority:2 blocks:[b14 b15 b22]
id:12 priority:2 blocks:[b12 b13 b21]
id:10 priority:2 blocks:[b10 b11 b20]
id:8 priority:1 blocks:[b3 b9]
== BlockOrder:
[b1 b19 b2 b18 b5 b4 b7 b8 b6 b24 b16 b17 b23 b14 b15 b22 b12 b13 b21 b10 b11 b20 b3 b9]

image