cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering

golang / go

The Go programming language

https://go.dev

BSD 3-Clause "New" or "Revised" License

122.88k stars 17.52k forks source link

cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering #18977

Open cherrymui opened 7 years ago

cherrymui commented 7 years ago

CL https://go-review.googlesource.com/c/36205/ deletes assembler backend instruction reordering. This makes Fannkuch-11 on AMD64 slow down 6%. Figure out why and improve the compiler.

gopherbot commented 7 years ago

CL https://golang.org/cl/38431 mentions this issue.

josharian commented 7 years ago

Tip (6897030fe3de43bbed48adb72f21a6c2d00042cd) looks like a net win vs 1.8.1. Code size is smaller:

1.8.1: "".fannkuch t=1 size=844 args=0x10 locals=0x80
tip: "".fannkuch STEXT size=829 args=0x10 locals=0x80

Execution time is faster, at least on my machine:

name          old time/op  new time/op  delta
Fannkuch11-8   2.74s ± 2%   2.68s ± 2%  -2.32%  (p=0.000 n=20+19)

So whatever regression occurred has at the least been compensated for, if not fixed by CL 38431.

I do see one notable regression, though, which is in the number of panicindex calls. I've filed #20332 for that.

I'm going to close this as fixed. @cherrymui or @randall77, please reopen if you disagree.

navytux commented 7 years ago

@josharian, sad to bring bad news but here is how it looks on my machine:

$ benchstat go18.txt gotip.txt
name          old time/op  new time/op  delta
Fannkuch11-4   2.68s ± 0%   2.94s ± 0%  +9.88%  (p=0.000 n=20+19)

in other words it is ~ 10% regression.

$ go18 version
go version go1.8.1 linux/amd64

$ go version
go version devel +ee69c21747 Thu May 11 18:28:40 2017 +0000 linux/amd64

Linux deco 4.9.0-2-amd64 #1 SMP Debian 4.9.18-1 (2017-03-30) x86_64 GNU/Linux Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz

josharian commented 7 years ago

@navytux will you re-measure including CL 43291?

navytux commented 7 years ago

let me try

navytux commented 7 years ago

Strangely CL 43291 make things only worse:

$ benchstat gotip.txt gotip+cl43291.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.94s ± 0%   3.01s ± 0%  +2.29%  (p=0.000 n=19+19)

and I've doble-checked it is reproducible via recompiling (make.bash) and benchmarking tip then tip+CL then tip again then tip+CL again.

josharian commented 7 years ago

OK, re-opening.

One other things I see in tip vs 1.8.1 is that in 1.8.1, all the panicindex calls are at the end of the function, so jumps to them are always unlikely (forward jumps). In tip, they're scattered throughout. Not sure whether that's responsible, but I'll see if I can fix it and check whether it helps.

navytux commented 7 years ago

@josharian thanks for feedback and reopening.

gopherbot commented 7 years ago

CL https://golang.org/cl/43293 mentions this issue.

josharian commented 7 years ago

@navytux will you see what effect CL 43293 has on your machine? Thanks.

navytux commented 7 years ago

@josharian sure. Today's tip is go version devel +482da51803 Fri May 12 01:10:11 2017 +0000 linux/amd64 and it already includes CL 43291. We are using it as the base:

$ benchstat gotip.txt gotip+cl43293.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   3.01s ± 0%   2.84s ± 1%  -5.66%  (p=0.000 n=17+19)

in other words it is partial improvement and we are now back to 6% regression over go18:

$ benchstat go18.txt gotip+cl43293.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.68s ± 0%   2.84s ± 1%  +6.17%  (p=0.000 n=20+19)

navytux commented 7 years ago

For the reference here is how Fannkuch11 looks under perf stat -d:

 Performance counter stats for './go1.test-18 -test.bench Fannkuch11' (10 runs):

       4008,809815      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,06% )
               834      context-switches          #    0,208 K/sec                    ( +-  1,04% )
                14      cpu-migrations            #    0,003 K/sec                    ( +- 10,89% )
            79 865      page-faults               #    0,020 M/sec                    ( +-  0,01% )
    13 398 886 442      cycles                    #    3,342 GHz                      ( +-  0,06% )  (49,99%)
    29 180 837 581      instructions              #    2,18  insn per cycle           ( +-  0,06% )  (62,57%)
     5 974 722 394      branches                  # 1490,398 M/sec                    ( +-  0,06% )  (62,62%)
       162 930 510      branch-misses             #    2,73% of all branches          ( +-  0,09% )  (62,73%)
     6 602 849 508      L1-dcache-loads           # 1647,085 M/sec                    ( +-  0,11% )  (62,84%)
        73 480 588      L1-dcache-load-misses     #    1,11% of all L1-dcache hits    ( +-  0,24% )  (62,61%)
         3 604 844      LLC-loads                 #    0,899 M/sec                    ( +-  3,96% )  (49,89%)
           316 385      LLC-load-misses           #    8,78% of all LL-cache hits     ( +-  5,10% )  (49,85%)

       4,003714952 seconds time elapsed                                          ( +-  0,07% )

 Performance counter stats for './go1.test-tip -test.bench Fannkuch11' (10 runs):

       4259,019276      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,09% )
               822      context-switches          #    0,193 K/sec                    ( +-  1,31% )
                15      cpu-migrations            #    0,004 K/sec                    ( +-  7,89% )
            79 928      page-faults               #    0,019 M/sec                    ( +-  0,01% )
    14 228 793 806      cycles                    #    3,341 GHz                      ( +-  0,08% )  (49,92%)
    29 011 760 342      instructions              #    2,04  insn per cycle           ( +-  0,08% )  (62,54%)
     6 812 768 092      branches                  # 1599,610 M/sec                    ( +-  0,05% )  (62,56%)
       169 436 247      branch-misses             #    2,49% of all branches          ( +-  0,07% )  (62,68%)
     6 617 477 028      L1-dcache-loads           # 1553,756 M/sec                    ( +-  0,07% )  (62,84%)
        72 643 321      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,21% )  (62,77%)
         3 792 842      LLC-loads                 #    0,891 M/sec                    ( +-  8,00% )  (49,95%)
           324 234      LLC-load-misses           #    8,55% of all LL-cache hits     ( +-  6,96% )  (49,88%)

       4,255140742 seconds time elapsed                                          ( +-  0,09% )

 Performance counter stats for './go1.test-tip+cl43293 -test.bench Fannkuch11' (10 runs):

       4076,691573      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,18% )
               809      context-switches          #    0,198 K/sec                    ( +-  1,07% )
                14      cpu-migrations            #    0,003 K/sec                    ( +-  7,07% )
            79 926      page-faults               #    0,020 M/sec                    ( +-  0,01% )
    13 622 118 634      cycles                    #    3,341 GHz                      ( +-  0,10% )  (49,99%)
    28 659 892 607      instructions              #    2,10  insn per cycle           ( +-  0,11% )  (62,64%)
     6 412 523 320      branches                  # 1572,972 M/sec                    ( +-  0,07% )  (62,67%)
       175 939 383      branch-misses             #    2,74% of all branches          ( +-  0,09% )  (62,74%)
     6 634 137 889      L1-dcache-loads           # 1627,334 M/sec                    ( +-  0,10% )  (62,81%)
        72 791 290      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,52% )  (62,68%)
         3 613 857      LLC-loads                 #    0,886 M/sec                    ( +-  3,46% )  (49,88%)
           315 113      LLC-load-misses           #    8,72% of all LL-cache hits     ( +-  5,94% )  (49,86%)

       4,072909978 seconds time elapsed                                          ( +-  0,19% )

highlights:

amount of instructions executed is less for tip (99.4%) and tip+cl43293 (98.2%) compared to go18.
however instructions per CPU cycle is lower at tip (2,04 insn/cycle) and tip+cl43293 (2,10 insn/cycle) compared to go18 (2,18 insn/cycle). That means there are probably more stalls due to instructions mischeduling or something. That results for more cycles being spent even though there are less instructions to execute.
total amount of branches is higher at tip (114%) and tip+cl43293 (107%) compared to go18.
L1d loads is only a bit slightly more for tip (0.2%) and tip+cl43293 (0.4%) compared to go18. L1d miss-ratio is also almost the same amongst 3 versions.
LLC (last level cache - L3 in my case - that goes first before RAM) loads are more for tip (105.2%) and go almost back to go18 level at tip+cl43293 (100.2%). The variation in LLC-loads is more for tip over go18 and tip+cl43293. LLC mis ratio also is approximately back to go18 level at tip+cl43293.

To me all this suggests that after CL43293 or similar is applied the problem is in that:

there are more branches compared to go18, and
instructions are less well scheduled by compiler

Hope this helps a bit, Kirill

josharian commented 7 years ago

Thanks, Kirill, that's great. My guess was that CL 43293 was making the critical sections of code more compact, thus keeping more in cache; nice to have that confirmed. I'm working on making CL 43293 submittable now, and then will look again.

navytux commented 7 years ago

Josh thanks for feedback and patches. Offhand me thinks to get this regression gone we need to fix "Kinda surprisingly" in 39ce5907 (CL 38431) - in other words to understand the root cause and apply loop rotation only when it is more optimal judging on corresponding factors. Current state is, it seems, we are applying loop rotation always and it sometimes helps and sometimes harms.

@randall77, @TocarIP please see https://github.com/golang/go/issues/18977#issuecomment-301012785 for a bit of info about what is going on with Fannkuch11.

josharian commented 7 years ago

I'm going to dig a bit more, although I'll be mostly AFK for a while. In the meantime, Kirill, maybe you could confirm that disabling looprotate on your machine fixes the problem by adding "return" at the very beginning of func loopRotate in cmd/compile/internal/ssa/looprotate.go, to disable the pass.

navytux commented 7 years ago

Josh, thanks for feedback. Sure, staying at the same tip = 482da518 here is what we have:

going to CL 43293.2 (second patch in that CL) birings more speedup:

$ benchstat gotip+cl43293.txt gotip+cl43293.2.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.84s ± 1%   2.78s ± 0%  -2.09%  (p=0.000 n=19+16)

 Performance counter stats for './go1.test-tip+cl43293.2 -test.bench Fannkuch11' (10 runs):

       4029,642462      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,11% )
               814      context-switches          #    0,202 K/sec                    ( +-  1,00% )
                11      cpu-migrations            #    0,003 K/sec                    ( +-  8,11% )
             79921      page-faults               #    0,020 M/sec                    ( +-  0,01% )
       13482684046      cycles                    #    3,346 GHz                      ( +-  0,10% )  (50,05%)
       28731337641      instructions              #    2,13  insn per cycle           ( +-  0,17% )  (62,66%)
        6429703132      branches                  # 1595,601 M/sec                    ( +-  0,14% )  (62,72%)
         168958851      branch-misses             #    2,63% of all branches          ( +-  0,18% )  (62,74%)
        6634192402      L1-dcache-loads           # 1646,348 M/sec                    ( +-  0,14% )  (62,80%)
          73264431      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,66% )  (62,57%)
           4233228      LLC-loads                 #    1,051 M/sec                    ( +-  3,12% )  (49,84%)
            376635      LLC-load-misses           #    8,90% of all LL-cache hits     ( +-  9,29% )  (49,89%)

       4,026348487 seconds time elapsed                                          ( +-  0,11% )

some comments: there is a bit more (0.2%) instructions but insn/cycle grows from 2.10 to 2.13 and as the result total cpu cycles is ~ 1% less (which correllates with totall run time reported by perf and probably includes preparatory program actions -test.bench itself excludes from accounting). The amount of branches is too 0.2% more, however, if I read counters correctly, not all new instructions are branches as δ(insn) = 71445034 but δ(branches) = 17179812 (4x less).

disabling loopRotate over CL 43293.2 makes it slower:

$ benchstat gotip+cl43293.2.txt gotip+cl43293.2-nolooprot.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.78s ± 0%   2.87s ± 0%  +2.93%  (p=0.000 n=16+15)

 Performance counter stats for './go1.test-tip+cl43293.2-nolooprot -test.bench Fannkuch11' (10 runs):

       4103,320043      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,12% )
               822      context-switches          #    0,200 K/sec                    ( +-  0,82% )
                12      cpu-migrations            #    0,003 K/sec                    ( +-  9,02% )
             79925      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       13711752541      cycles                    #    3,342 GHz                      ( +-  0,06% )  (49,99%)
       29498486031      instructions              #    2,15  insn per cycle           ( +-  0,07% )  (62,58%)
        7212426730      branches                  # 1757,705 M/sec                    ( +-  0,08% )  (62,64%)
         174290760      branch-misses             #    2,42% of all branches          ( +-  0,04% )  (62,73%)
        6609550227      L1-dcache-loads           # 1610,781 M/sec                    ( +-  0,07% )  (62,79%)
          73869349      L1-dcache-load-misses     #    1,12% of all L1-dcache hits    ( +-  0,28% )  (62,67%)
           3816460      LLC-loads                 #    0,930 M/sec                    ( +-  2,51% )  (49,91%)
            429992      LLC-load-misses           #   11,27% of all LL-cache hits     ( +-  8,17% )  (49,89%)

       4,098764012 seconds time elapsed                                          ( +-  0,12% )

insn / cycle grows from 2.13 to 2.15 (good) but there are 2.7% more total instructions so overal cycles grow by 1.7% (correlates wrt total run time reported by perf). There is 12% more branches and looks like they are the only visible factor that contribute to instruction growth: δ(branches) / δ(insn) ≈ 1.

but disabling loopRotate over just tip makes it faster:

$ benchstat gotip.txt gotip-nolooprot.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   3.02s ± 1%   2.98s ± 0%  -1.29%  (p=0.000 n=19+16)

 Performance counter stats for './go1.test-tip-nolooprot -test.bench Fannkuch11' (10 runs):

       4215,739102      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,08% )
               827      context-switches          #    0,196 K/sec                    ( +-  0,81% )
                13      cpu-migrations            #    0,003 K/sec                    ( +- 10,58% )
             79932      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       14108860946      cycles                    #    3,347 GHz                      ( +-  0,08% )  (49,91%)
       29508075225      instructions              #    2,09  insn per cycle           ( +-  0,14% )  (62,49%)
        7220934286      branches                  # 1712,851 M/sec                    ( +-  0,09% )  (62,58%)
         174301863      branch-misses             #    2,41% of all branches          ( +-  0,06% )  (62,77%)
        6621367485      L1-dcache-loads           # 1570,630 M/sec                    ( +-  0,07% )  (62,86%)
          73684219      L1-dcache-load-misses     #    1,11% of all L1-dcache hits    ( +-  0,45% )  (62,71%)
           3886187      LLC-loads                 #    0,922 M/sec                    ( +-  3,16% )  (49,91%)
            369954      LLC-load-misses           #    9,52% of all LL-cache hits     ( +-  6,86% )  (49,82%)

       4,211567986 seconds time elapsed                                          ( +-  0,08% )

Here the number of instructions increases by 1.7% which is mostly due to 6% N(branches) increase (δ(branches) / δ(insn) ≈ 0.82) however the insn/cycle ratio also grows from 2.04 to 2.09 which overcompensate all that and resulting N(cycles) decrease by ~ 1%.

My observation:

above everywhere disabling loop rotation makes insn/cycle ↑ but go18 2.18 level is never reached; closest was tip+cl43293.2-nolooprot (2.15 vs 2.18 for go18).
the task here seems to be to optimize target function (Ncycles) via variating Ninstructions⋅(insn/cycle). Code transformations which minimize Ninstructions (loop rotation, etc) explicitly change only one part of the formula, but I'm not sure insn/cycle is modelled or taken into account anywhere.

( I've rebenchmarked previous runs from todays morning and reverified all timings several times )

navytux commented 7 years ago

I'm not an x86 guy (probably @TocarIP could help here) but I imagined: if there are several execution units inside cpu with somehow different properties, units overal load could depend on the order of instructions in instruction stream. With this thinking in mind I googled for "x86 add nop speedup" and found e.g.

https://research.google.com/pubs/pub37077.html https://stackoverflow.com/questions/17896714/why-would-introducing-useless-mov-instructions-speed-up-a-tight-loop-in-x86-64-a/17906589 ...

also if we simplify and assume data loads/stores are always hitting cache what affects insn/cycle is

N(branch-misses) / N(instructions)

(note not / N(branches)). The stackoverflow link and MAO article above contains some information on what factors could affect branch prediction and other loop related things.

Coincidence or not there the talking is usually too about 5% - 7% speedup or slowdown.

TocarIP commented 7 years ago

I've played a bit with https://github.com/andikleen/pmu-tools It suggests that this bench is bound by branch misprediction in various forms: toplev.py -l=3 -d ./go1.test -test.run=NONE -test.bench=Fan:

FE             Frontend_Bound.Frontend_Latency.Branch_Resteers:         17.59 % Clocks [ 16.66%]
        This metric represents cycles fraction the CPU was stalled
        due to Branch Resteers...
        Sampling events:  br_misp_retired.all_branches
BAD            Bad_Speculation:                                         22.39 % Slots [ 17.00%]
BAD            Bad_Speculation.Branch_Mispredicts:                      22.21 % Slots [ 17.00%] BN
        This metric represents slots fraction the CPU has wasted due
        to Branch Misprediction...
        Sampling events:  br_misp_retired.all_branches
               MUX:                                                     16.62 %

Running ocperf.py record -e br_misp_retired.all_branches_pebs Shows that guilty branches are caused by index checks:

884511   75ef95:       4d 39 e7                cmp    %r12,%r15
884512   75ef98:       73 33                   jae    75efcd <_/localdisk/itocar/golang/test/bench/go1.fannkuch+0x1ed>

and code layout:

75efcb:       eb 0e                   jmp    75efdb <_/localdisk/itocar/golang/test/bench/go1.fannkuch+0x1fb>
884526   75efcd:       e8 9e c8 cc ff          callq  42b870 <runtime.panicindex>

Note that in this case branch is unconditional! With 43293 at least all problems are with two conditional branches. Playing with MaxLoopPad in cmd/internal/obj/x86/asm6 produce no improvements, so the problem is not in branch targets alignment. I'd like to experiment with branch source alignment, but this isn't as simple as changing 1 constant, and I will be away for a ~week.

josharian commented 7 years ago

I haven't had a chance to dig into this excellent data, but see also #20355 for something that seems definitely wrong with looprotate. It seems that looprotate is doing both good work and bad for the benchmark.

josharian commented 7 years ago

Also #20356.

gopherbot commented 7 years ago

CL https://golang.org/cl/43491 mentions this issue.

josharian commented 7 years ago

@navytux sorry to ask this so many times, but at your leisure, would you give CL 43491 a try (atop CL 43293) on your machine? Thanks!

navytux commented 7 years ago

@josharian please don't be sorry - I think we are all interested to get this regression understood and fixed, and on my side thanks a lot for attacking it with your work.

First - not so good news. Today's tip is 1d44c4e378 and just cherry-picking CL 43293.5 on top of it is not making so much good as it was making for tip=482da51 in https://github.com/golang/go/issues/18977#issuecomment-301150387:

$ benchstat tip.txt tip+cl43293.5.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   3.02s ± 1%   2.90s ± 1%  -3.79%  (p=0.000 n=20+20)

To make sure this is really so I've rebenchmarked once again 482da51 results without and with CL 43293.5:

$ benchstat 482da51.txt 482da51+cl43293.5 
name          old time/op  new time/op  delta
Fannkuch11-4   3.02s ± 0%   2.79s ± 1%  -7.52%  (p=0.000 n=19+20)

it bisects reliably to 29e88d51 (CL 43310; /cc @aclements; dddd1dd4=29e88d51~):

$ benchstat dddd1dd4+cl43293.5 29e88d51+cl43293.5 
name          old time/op  new time/op  delta
Fannkuch11-4   2.79s ± 1%   2.96s ± 0%  +6.29%  (p=0.000 n=20+18)

 Performance counter stats for './go1.test-dddd1dd4+cl43293.5 -test.bench Fannkuch11' (10 runs):

       4021,889153      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,09% )
               792      context-switches          #    0,197 K/sec                    ( +-  1,13% )
                12      cpu-migrations            #    0,003 K/sec                    ( +-  9,71% )
             79922      page-faults               #    0,020 M/sec                    ( +-  0,00% )
       13435428885      cycles                    #    3,341 GHz                      ( +-  0,08% )  (49,99%)
       28714158450      instructions              #    2,14  insn per cycle           ( +-  0,09% )  (62,59%)
        6420251768      branches                  # 1596,327 M/sec                    ( +-  0,06% )  (62,65%)
         168561553      branch-misses             #    2,63% of all branches          ( +-  0,05% )  (62,73%)
        6625031295      L1-dcache-loads           # 1647,244 M/sec                    ( +-  0,06% )  (62,80%)
          73183843      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,28% )  (62,68%)
           3249203      LLC-loads                 #    0,808 M/sec                    ( +-  2,81% )  (49,89%)
            293228      LLC-load-misses           #    9,02% of all LL-cache hits     ( +-  5,94% )  (49,89%)

       4,017762542 seconds time elapsed                                          ( +-  0,09% )

 Performance counter stats for './go1.test-29e88d51+cl43293.5 -test.bench Fannkuch11' (10 runs):

       4203,344343      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,15% )
               816      context-switches          #    0,194 K/sec                    ( +-  1,16% )
                14      cpu-migrations            #    0,003 K/sec                    ( +-  9,36% )
             79925      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       14050595666      cycles                    #    3,343 GHz                      ( +-  0,10% )  (49,97%)
       28679710375      instructions              #    2,04  insn per cycle           ( +-  0,13% )  (62,58%)
        6415984175      branches                  # 1526,400 M/sec                    ( +-  0,11% )  (62,64%)
         193672052      branch-misses             #    3,02% of all branches          ( +-  0,06% )  (62,74%)
        6622698993      L1-dcache-loads           # 1575,579 M/sec                    ( +-  0,10% )  (62,81%)
          73076770      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,62% )  (62,61%)
           3249848      LLC-loads                 #    0,773 M/sec                    ( +-  2,15% )  (49,87%)
            311298      LLC-load-misses           #    9,58% of all LL-cache hits     ( +-  4,13% )  (49,86%)

       4,198850562 seconds time elapsed                                          ( +-  0,14% )

eventhough Austin's change only adds debugging print on runtime assert it somehow results in insn/cycle ruined from 2.14 to 2.04. There is ~ the same number of total branches, but branch misses differ by 15% and that's seems to be the reason for insn/cycle change probably due to branch predictor depending on branch source/destination address (my findings in https://github.com/golang/go/issues/18977#issuecomment-301170941 contain some info on this topic).

NOTE also that resultant 2.96s is varying later on commits after 29e88d51, due to probably once again changed addresses, and goes to 2.90s on base tip=1d44c4e.

NOTE unpatched (without CL 43293.5) performance stays almost the same 3.02s for all tried commits in between 482da518..1d44c4e3 .

Now let's got back to trying CL 43491.2 on top of CL 43293.5 over tip=1d44c4e3:

$ benchstat tip+cl43293.5 tip+cl43293.5+cl43491.2 
name          old time/op  new time/op  delta
Fannkuch11-4   2.90s ± 1%   2.96s ± 0%  +2.05%  (p=0.000 n=20+18)

 Performance counter stats for './go1.test-tip+cl43293.5+cl43491.2 -test.bench Fannkuch11' (10 runs):

       4210,387266      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,04% )
               823      context-switches          #    0,196 K/sec                    ( +-  1,00% )
                15      cpu-migrations            #    0,004 K/sec                    ( +-  5,54% )
             79934      page-faults               #    0,019 M/sec                    ( +-  0,00% )
       14055075868      cycles                    #    3,338 GHz                      ( +-  0,07% )  (50,02%)
       28674740454      instructions              #    2,04  insn per cycle           ( +-  0,19% )  (62,65%)
        6391778084      branches                  # 1518,097 M/sec                    ( +-  0,11% )  (62,72%)
         179951909      branch-misses             #    2,82% of all branches          ( +-  0,06% )  (62,75%)
        6624416028      L1-dcache-loads           # 1573,351 M/sec                    ( +-  0,09% )  (62,80%)
          73049997      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,37% )  (62,58%)
           3103098      LLC-loads                 #    0,737 M/sec                    ( +-  2,46% )  (49,88%)
            302447      LLC-load-misses           #    9,75% of all LL-cache hits     ( +-  5,02% )  (49,84%)

       4,205040158 seconds time elapsed                                          ( +-  0,05% )

However from this I would not judge that CL 43491 makes things worse as the timings looks to be very similar to regression after Austin's patch with insn/cycle dropping to exactly the same 2.04.

For the reference here is CL 43293.5 + CL 43491.2 on top of 482da51:

$ benchstat 482da51+cl43293.5 482da51+cl43293.5+cl43491.2 
name          old time/op  new time/op  delta
Fannkuch11-4   2.79s ± 1%   2.97s ± 1%  +6.33%  (p=0.000 n=20+18)

 Performance counter stats for './go1.test-482da51+cl43293.5+cl43491.2 -test.bench Fannkuch11' (10 runs):

       4212,850293      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,12% )
               826      context-switches          #    0,196 K/sec                    ( +-  0,82% )
                12      cpu-migrations            #    0,003 K/sec                    ( +-  9,89% )
             79917      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       14057709811      cycles                    #    3,337 GHz                      ( +-  0,08% )  (50,00%)
       28665317018      instructions              #    2,04  insn per cycle           ( +-  0,10% )  (62,61%)
        6398312057      branches                  # 1518,761 M/sec                    ( +-  0,08% )  (62,67%)
         179810476      branch-misses             #    2,81% of all branches          ( +-  0,05% )  (62,76%)
        6632395958      L1-dcache-loads           # 1574,325 M/sec                    ( +-  0,12% )  (62,81%)
          72718580      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,54% )  (62,65%)
           3950724      LLC-loads                 #    0,938 M/sec                    ( +-  3,43% )  (49,85%)
            358681      LLC-load-misses           #    9,08% of all LL-cache hits     ( +-  7,62% )  (49,86%)

       4,208376700 seconds time elapsed                                          ( +-  0,12% )

looks to be similar branch misprediction story once again.

All this makes me think: as the most major factor for now seems to be branch misprediction I would use dddd1dd4 as the base point and test patches on top of that with manually adding e.g. unrelated debug print from 29e88d51 . Once we understand well enough what makes branch predictor happy or otherwise we could maybe adjust compiler accordingly.

Also my intuition says that it is possible not only to catch up with go18, but to do better once related performance laws are discovered and understood and the compiler is taught to respect it.

Hope this helps a bit, Kirill

josharian commented 7 years ago

Very interesting! I have a slightly different set of responses to that data, in combination with reading (the relevant parts of) Agner Fog's paper on branch prediction; see also the beginning of this mechanical sympathy post.

cc @quentinmit because some of the comments below, if correct, are relevant to any performance benchmarking that includes microbenchmarks.

According to those docs, recent amd64 chips use hashing for global branch prediction, including the IP of the jump. Moving the code around a little bit could easily cause a hash collision where there was none before. And if my CPU does some other branch prediction (loop counters? different/bigger tables?) where yours does not, that would explain why you see big performance swings and I don't. And the fact that the performance regression for you seems to be binary (on/off) suggests a single hot (mis-)prediction being toggled.

If this variability is caused by hash collisions coming and going, I don't see much to learn from it that we can teach the compiler. And Agner Fog doesn't offer much to go off of either. In the near term, I'd say we should instead continue to work on clear wins: improving code layout for static branch prediction, making code more compact, reducing the number of jumps.

The two CLs I've sent so far fall (I think) in that category, and I believe there's more we can do fairly easily, like making the layout pass loop-aware and making trim more aggressive. (Or maybe use a queue instead of a stack for posdegree/negdegree in layout? Or...? Lots of possibilities to explore.)

The question remains: How do we interpret microbenchmarks and fannkuch in particular in the face of this variability? A few thoughts:

First, just having a possible diagnosis (branch prediction hash collision) is super exciting. Many of us have long been frustrated by unrelated changes causing substantial performance changes in microbenchmarks (e.g. #7920 among many others); Dmitry's performance dashboard was noisy and spammy and ultimately disabled. Various ideas have been suggested for why this is, mostly commonly alignment. Finally having a theory with some data behind it is really delightful.
If hashing of the IP is involved, we (the non-royal "we", I fear) should be able to hack up the linker to accept a different base address for laying out code. That should leave everything else the same (code compactness, distance of jumps and calls, etc) but vary the IP of jumps. We could then test the same microbenchmark with a variety of base addresses and combine the results. If this provides more stable results, we could build proper tooling around it.
I wonder whether 1.8.1 is getting lucky on fannkuch, just as some recent commits have gotten unlucky. Short of the linker hacking suggested above, maybe you could also add some printlns in various (unexecuted) places in the runtime in 1.8.1 to see whether you could provoke it to regress. If 1.8.1 is just getting lucky, maybe we're actually already done, at least for 1.9? (I see this area as a possible opportunity for good performance wins in 1.10 regardless.)
Obviously, we should worry more about macro benchmarks and average results on suites of microbenchmarks than individual microbenchmarks, unless insofar as the individual microbenchmarks have something to teach us (as this one has!). FWIW, CL 43491 speeds up the compiler (my favorite macro benchmark); wall time geomean impact of -0.72%.

navytux commented 7 years ago

Josh thanks for feedback and your observations. I agree that having undocumented hash function mixed into branch prediction makes it harder to reason about. However as you say we can try to to see how much this affects performance. Let me show you something before we begin:

aaa

Details: as you suggest I've cocked a hacky way to check Fannkuch11 with shifted entry address for fannkuch() for go18 and various tips:

https://lab.nexedi.com/kirr/go/blob/dd8c9753/test/bench/go1/fannkuch-entryshifted

it hooks something like

kirr@deco:~/src/tools/go/go$ cat src/testing/zzz.go 
package testing

// shifting by ~ 3·16
func zzz() {
        println()
        println()
        println()
}

--- a/src/testing/testing.go
+++ b/src/testing/testing.go
@@ -499,6 +499,7 @@ func (c *common) Fail() {
        defer c.mu.Unlock()
        // c.done needs to be locked to synchronize checks to c.done in parent tests.
        if c.done {
+               zzz()
                panic("Fail in goroutine after " + c.name + " has completed")
        }
        c.failed = true

into testing package which gets linked into go1 benchmark before benchmark objects and as the result entry point of fannkuch() changes, e.g.:

@0
  fannkuch_test.go:12   0x75fdd0                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@1
  fannkuch_test.go:12   0x75fdf0                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@2
  fannkuch_test.go:12   0x75fe00                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@3
  fannkuch_test.go:12   0x75fe10                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@4
  fannkuch_test.go:12   0x75fe20                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
...

This way 3 go versions were tested: go18 (go1.8.1-3-gdd8c975372), gotip (4b0d74f8 ; already has CL 43293 applied) and gotip + CL 43491.3.

From the picture above it could be seen that go18 still clearly wins: eventhough there are some seldom overlaps with tip, if we compare good-luck to good-luck and bad-luck to bad-luck go18 is faster. It could be also seen that CL 43491.3 makes things unfortunately worse, at least on my cpu.

Resume: there, it seems, is hash(IP) effect and we can measure its range for cases when unrelated changes are made and whole function entry changes. However that's not all and rest is all intra-function (Fannkuch11 benchmark runs only one fannkuch function) for which compiler can directly control code-layout locally.

Generally agreeing that more high-level issues could bring more goods (like e.g. teaching compiler to staically prove and eliminate unnecessary bounds checks) from the above it is hard for me to say that we are done with this regression for Go1.9 .

However I confirm that CL 43491 overally speedups the compiler:

$ benchstat -geomean 4b0d74f89d.txt 4b0d74f89d+cl43491.3.txt 
name        old time/op     new time/op     delta
Template        174ms ± 2%      171ms ± 2%  -1.27%  (p=0.000 n=38+37)
Unicode        80.4ms ± 4%     79.9ms ± 3%  -0.60%  (p=0.039 n=39+38)
GoTypes         565ms ± 1%      561ms ± 2%  -0.82%  (p=0.000 n=37+40)
Compiler        2.65s ± 1%      2.62s ± 1%  -1.13%  (p=0.000 n=37+40)
SSA             5.31s ± 1%      5.24s ± 1%  -1.47%  (p=0.000 n=40+38)
Flate           112ms ± 2%      111ms ± 2%  -1.21%  (p=0.000 n=39+37)
GoParser        141ms ± 2%      140ms ± 2%  -0.98%  (p=0.000 n=40+40)
Reflect         350ms ± 1%      345ms ± 2%  -1.17%  (p=0.000 n=34+38)
Tar            99.0ms ± 3%     97.9ms ± 2%  -1.10%  (p=0.000 n=40+39)
XML             196ms ± 2%      194ms ± 2%  -1.09%  (p=0.000 n=39+38)
[Geo mean]      322ms           318ms       -1.08%

name        old user-ns/op  new user-ns/op  delta
Template         221M ± 9%       219M ±10%    ~     (p=0.158 n=39+40)
Unicode          110M ± 9%       108M ±15%    ~     (p=0.334 n=38+39)
GoTypes          737M ± 5%       737M ± 3%    ~     (p=0.537 n=40+39)
Compiler        3.48G ± 3%      3.46G ± 2%  -0.59%  (p=0.017 n=40+40)
SSA             7.52G ± 2%      7.40G ± 2%  -1.53%  (p=0.000 n=37+39)
Flate            143M ± 8%       141M ± 9%  -1.89%  (p=0.023 n=40+40)
GoParser         182M ± 8%       180M ± 9%    ~     (p=0.072 n=40+40)
Reflect          448M ± 3%       439M ± 4%  -2.19%  (p=0.000 n=40+40)
Tar              127M ±12%       124M ±10%  -2.20%  (p=0.019 n=40+40)
XML              250M ± 6%       246M ± 4%  -1.54%  (p=0.002 n=40+39)
[Geo mean]       421M            415M       -1.40%

Speaking about particular regression in this issue I've made some observations which are imho better to share. First of all @TocarIP thanks for mentioning pmu-tools which is a way to observe at least somehow what happens inside cpu. On my system it is too branch mispredictions. While profiling with ocperf.py record -e br_misp_retired.all_branches:pp I've noticed:

go18 has different code structure for innermost for i, j := 1, k-1; i < j; i, j = i+1, j-1 loop. In particular before entering the loop it checks loop condition i < j and if not skips loop completely. The go19 version does unconditional jump to after loop, setup registers there and then jumps unconditionally inside loop to condition check entry. (btw why not setup registers before loop and directly jump to loop entry ?)
in go18 there are two jumps that are regularly mispredicted - pre-loop-entry and loop jmp-back. In go19 the only major jump that is mispredicted is loop jmp-back. That could be maybe making the difference (in go19 there is no pre-loop conditional check jump at all so in profiler go18 looks like a molecule with 2 major spectral lines while go19 with only 1).
in go19 the inner loop conditional jump instruction starts at 0x...f address and occupies 2 bytes. That means it will overlap with 16 bytes line boundary which I think even Agner Fog advices against. @TocarIP could we please check somehow this is important or not?

Below are relevant screenshots for go18 and go19 with Josh's CLs. The inner loop is selected with loop arrow. Left column is ratio of mis-predicted events.

aaa aaa

Thanks, Kirill

josharian commented 7 years ago

Still digesting the above.

In parallel, I just found that the linker already supports changing the text address:

$ go test -bench=. -ldflags="-T=0x20000"
$ go test -bench=. -ldflags="-T=0x21000"

Can you check whether this has a similar effect to your fuzzer?

josharian commented 7 years ago

@navytux beautiful.

That means it will overlap with 16 bytes line boundary which I think even Agner Fog advices against.

I'll try to hack in an experimental fix for this now. (Not sure how easy/hard it will be.) Do you have a pointer to where in the docs Agner Fog discusses this, so that I can get the scope of my tweak correct?

josharian commented 7 years ago

@navytux at the top of asm6.go is:

const (
    // Loop alignment constants:
    // want to align loop entry to LoopAlign-byte boundary,
    // and willing to insert at most MaxLoopPad bytes of NOP to do so.
    // We define a loop entry as the target of a backward jump.
    //
    // gcc uses MaxLoopPad = 10 for its 'generic x86-64' config,
    // and it aligns all jump targets, not just backward jump targets.
    //
    // As of 6/1/2012, the effect of setting MaxLoopPad = 10 here
    // is very slight but negative, so the alignment is disabled by
    // setting MaxLoopPad = 0. The code is here for reference and
    // for future experiments.
    //
    LoopAlign  = 16
    MaxLoopPad = 0
)

This indicates the goal is to align the target of the conditional jumps, not the jumps themselves. If that's right, I think you could easily experiment with this--setting the value of MaxLoopPad to 1 should push elements at 0x...f up by a byte. If that's wrong, I could try fixing it. If it makes a difference, and the heuristic is wrong (just want to avoid splitting an instruction in two), we could also make the heuristic better.

One other thought. It seems like there ought to be tools to allow you to insert NOPs into a binary, automatically adjusting all jump targets etc. If so, that would be another quick way to test the theory.

josharian commented 7 years ago

go18 has different code structure for innermost for i, j := 1, k-1; i < j; i, j = i+1, j-1 loop.

Right. Keith mentioned that in the commit message for CL 38431:

This is a bit different than the peeling that the old obj
library did in that we don't duplicate the loop exit test.
We just jump to the test.

in go18 there are two jumps that are regularly mispredicted - pre-loop-entry and loop jmp-back. In go19 the only major jump that is mispredicted is loop jmp-back. That could be maybe making the difference

I'm not sure why going from 2 mispredicted jumps to 1 mispredicted jump would make things worse in and of itself. Anyway, the two mispredicted tests in 1.8 appear to be the same condition (i<j), with peeling.

With tip I see the commonly mispredicted jump laid out as a backward jump (jl) on my machine, as in your tip screenshot. With tip + CL 43491, I see it laid out as a forward jump (jge).

But I doubt it matters. I suspect that this branch is just hard to predict well. And if so, shuffling layout without adding 1.8-style peeling or something like it won't get us back to 1.8 levels.

You could try to confirm this thesis by manually reproducing the loop peeling from 1.8: Add if i >= j { ... } around the inner loop in the fannkuch function and see what affect it has.

If as it appears this is the decisive difference, we might just want to accept the regression for now and attempt to continue to improve things by other means. Or go ahead and reproduce the 1.8 loop peeling. That'd be a judgment call for @randall77. (Regardless, this has been a fruitful investigation.)

(Meanwhile, I'm trying yet again sigh to get my new UDOO x86 up and running, in the hopes that its CPU will allow me to locally reproduce some of this.)

navytux commented 7 years ago

Josh thanks for feedback. I'll try to try -ldflags=-T... in some time.

About branch targets I think @TocarIP already tried to play with MaxLoopPad in https://github.com/golang/go/issues/18977#issuecomment-301203346 however I decided to also check myself.

I'm taking tip = 4b0d74f8 as the base. If I just adjust MaxLoopPad whole program addresses become different and due to the hash(IP) effect everything is disrupted. Likewise if I adjust MaxLoopPad for whole fannkuch function only due to loops previous to inner loop changing addresses everything else is also disrupted. So I cooked the following patch to carefully-hacky adjust address of jmp instruction that was sitting at 0x...f by 1 byte :

diff --git a/src/cmd/internal/obj/x86/asm6.go b/src/cmd/internal/obj/x86/asm6.go
index bcf9318e2e..8b09bd58df 100644
--- a/src/cmd/internal/obj/x86/asm6.go
+++ b/src/cmd/internal/obj/x86/asm6.go
@@ -62,7 +62,7 @@ const (
        // for future experiments.
        //
        LoopAlign  = 16
-       MaxLoopPad = 0
+       MaxLoopPad = 6
 )

 type Optab struct {
@@ -1882,7 +1882,8 @@ func span6(ctxt *obj.Link, s *obj.LSym, newprog obj.ProgAlloc) {
                                // pad with NOPs
                                v := -c & (LoopAlign - 1)

-                               if v <= MaxLoopPad {
+                               if v <= MaxLoopPad && s.Name == `"".fannkuch` && c > 400 {
+                                       v = 1
                                        s.Grow(int64(c) + int64(v))
                                        fillnop(s.P[c:], int(v))
                                        c += v

this change shifts whole internal loop (and possible everything else after). Resulting program change is:

...
@@ -108,144 +108,151 @@ func fannkuch(n int) int {
   0x75ff62             4c8b2b                  MOVQ 0(BX), R13         
   0x75ff65             4531f6                  XORL R14, R14           
                        for {         // k!=0 ==> k>0
-  0x75ff68             eb6c                    JMP 0x75ffd6            
+  0x75ff68             eb6d                    JMP 0x75ffd7            
+  0x75ff6a             90                      NOPL                    
                                        perm[i], perm[j] = perm[j], perm[i]
-  0x75ff6a             4c39e7                  CMPQ R12, DI            
-  0x75ff6d             0f836f010000            JAE 0x7600e2            
-  0x75ff73             4d8b0cfb                MOVQ 0(R11)(DI*8), R9   
-  0x75ff77             4d39e7                  CMPQ R12, R15           
-  0x75ff7a             0f8362010000            JAE 0x7600e2            
-  0x75ff80             4b8b0cfb                MOVQ 0(R11)(R15*8), CX  
-  0x75ff84             49890cfb                MOVQ CX, 0(R11)(DI*8)   
-  0x75ff88             4f890cfb                MOVQ R9, 0(R11)(R15*8)  
+  0x75ff6b             4c39e7                  CMPQ R12, DI            
+  0x75ff6e             0f8372010000            JAE 0x7600e6            
+  0x75ff74             4d8b0cfb                MOVQ 0(R11)(DI*8), R9   
+  0x75ff78             4d39e7                  CMPQ R12, R15           
+  0x75ff7b             0f8365010000            JAE 0x7600e6            
+  0x75ff81             4b8b0cfb                MOVQ 0(R11)(R15*8), CX  
+  0x75ff85             49890cfb                MOVQ CX, 0(R11)(DI*8)   
+  0x75ff89             4f890cfb                MOVQ R9, 0(R11)(R15*8)  
                                for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
-  0x75ff8c             488d4f01                LEAQ 0x1(DI), CX        
-  0x75ff90             49ffcf                  DECQ R15                
-  0x75ff93             488b7c2440              MOVQ 0x40(SP), DI       
-  0x75ff98             4c8b4c2450              MOVQ 0x50(SP), R9       
-  0x75ff9d             48894c2448              MOVQ CX, 0x48(SP)       
-  0x75ffa2             488b4c2470              MOVQ 0x70(SP), CX       
-  0x75ffa7             488b7c2448              MOVQ 0x48(SP), DI       
-  0x75ffac             4c39ff                  CMPQ R15, DI            
-  0x75ffaf             7cb9                    JL 0x75ff6a             
+  0x75ff8d             488d4f01                LEAQ 0x1(DI), CX        
+  0x75ff91             49ffcf                  DECQ R15                
+  0x75ff94             488b7c2440              MOVQ 0x40(SP), DI       
+  0x75ff99             4c8b4c2450              MOVQ 0x50(SP), R9       
+  0x75ff9e             48894c2448              MOVQ CX, 0x48(SP)       
+  0x75ffa3             488b4c2470              MOVQ 0x70(SP), CX       
+  0x75ffa8             488b7c2448              MOVQ 0x48(SP), DI       
+  0x75ffad             4c39ff                  CMPQ R15, DI            
+  0x75ffb0             7cb9                    JL 0x75ff6b             
...

and it gives slight speedup:

$ benchstat tip.txt tip+mlp6fk400x1.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.79s ± 1%   2.77s ± 1%  -0.81%  (p=0.000 n=18+19)

I could not find a program to insert instructions directly into binary: after quick glance neither ht [1], nor radare2 [2] seem to be able to actually insert (not overwrite) instructions (but maybe I did not looked hard enough).

If analyzing properly I would try adjusting inner loop body and / or only last jump + look into how other jumps shift and try to adjust them there too. However my time for this issue is currently up and I'm not sure if I could spent something non-trivial in the upcoming days here (sorry about that).

[1] http://hte.sourceforge.net/ [2] https://www.radare.org/r/

You could try to confirm this thesis by manually reproducing the loop peeling from 1.8: Add if i >= j { ... } around the inner loop in the fannkuch function and see what affect it has

without and with the following patch

--- a/test/bench/go1/fannkuch_test.go
+++ b/test/bench/go1/fannkuch_test.go
@@ -41,8 +41,11 @@ func fannkuch(n int) int {
                        }
                        k := perm1[0] // cache perm[0] in k
                        for {         // k!=0 ==> k>0
-                               for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
-                                       perm[i], perm[j] = perm[j], perm[i]
+                               if i, j := 1, k-1; i < j {
+                                       for ; i < j; i, j = i+1, j-1 {
+                                       //for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
+                                               perm[i], perm[j] = perm[j], perm[i]
+                                       }
                                }
                                flips++
                                // Now exchange k (caching perm[0]) and perm[k]... with care!

we have

(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1/loopalign$ benchstat tip.txt tip+peel.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.79s ± 1%   2.80s ± 1%  +0.24%  (p=0.003 n=18+19)

However the code is not exactly the same as in go18 because "peeling" still jumps to inside loop entry:

@@ -100,144 +100,149 @@ pc392:
        MOVQ       R9, didpr-40(SP)
        MOVQ       (BX), R13              // fannkuch_test.go:42
        MOVL       $0, R14
-       JMP        pc518                  // fannkuch_test.go:43
-pc410:
-       CMPQ       DI, R12                // fannkuch_test.go:45
-       JCC        pc786
+       JMP        pc543                  // fannkuch_test.go:43
+pc413:
+       MOVQ       AX, _autotmp_40-32(SP)
+       MOVL       $1, AX                                               // <~~ peeling jumps inside loop
+       MOVQ       AX, i-48(SP)
+       MOVQ       _autotmp_40-32(SP), AX
+       JMP        pc496                  // fannkuch_test.go:45

+pc435:                                                                 // <-- loop
+       CMPQ       DI, R12                // fannkuch_test.go:47
+       JCC        pc801
        MOVQ       (R11)(DI*8), R9
        CMPQ       R15, R12
-       JCC        pc786
+       JCC        pc801
        MOVQ       (R11)(R15*8), CX
        MOVQ       CX, (R11)(DI*8)
        MOVQ       R9, (R11)(R15*8)
-       LEAQ       1(DI), CX              // fannkuch_test.go:44
+       LEAQ       1(DI), CX              // fannkuch_test.go:45
        DECQ       R15
        MOVQ       n1-56(SP), DI
        MOVQ       didpr-40(SP), R9
        MOVQ       CX, i-48(SP)
        MOVQ       count_ptr-8(SP), CX
-pc471:
+pc496:                                                                 // <-- cmp inside loop
        MOVQ       i-48(SP), DI
        CMPQ       DI, R15
-       JLT        pc410
-       CMPQ       R13, R12               // fannkuch_test.go:49
-       JCC        pc779
+       JLT        pc435
+pc506:
+       CMPQ       R13, R12               // fannkuch_test.go:52
+       JCC        pc794
        MOVQ       (R11)(R13*8), DI
-       MOVQ       R13, (R11)(R13*8)      // fannkuch_test.go:50
-       LEAQ       1(R14), R13            // fannkuch_test.go:47
-       TESTQ      DI, DI                 // fannkuch_test.go:52
-       JEQ        pc544
+       MOVQ       R13, (R11)(R13*8)      // fannkuch_test.go:53
+       LEAQ       1(R14), R13            // fannkuch_test.go:50
+       TESTQ      DI, DI                 // fannkuch_test.go:55
+       JEQ        pc559
        MOVQ       R13, R14
        MOVQ       DI, R13
        MOVQ       n1-56(SP), DI
...

As you say I tend to believe this should not be very important (not 100% sure though). I just wanted to mention it for completness that go18 and go19 generates different code structure.

Regarding 16 lines boundary:

I'm talking about jump source address because this address is taken into the branch predictor and could be potentially bad for decoding/etc.

My references:

https://patchwork.kernel.org/patch/6194771/ (search for "Intel CPUs can decode only up to 16 bytes at a time" and Linus answer)
http://www.agner.org/optimize/microarchitecture.pdf p. 28

BTB organization The branch target buffer in Sandy Bridge is bigger than in Nehalem according to unofficial rumors. It is unknown whether it has one level, as in Core 2 and earlier processors, or two levels as in Nehalem. It can handle a maximum of four call instructions per 16 bytes of code. Conditional jumps are less efficient if there are more than 3 branch instructions per 16 bytes of code.

p. 29

3.8 Branch prediction in Intel Haswell, Broadwell and Skylake ... The high throughput for taken branches of one per clock was observed for up to 128 branches with no more than one branch per 16 bytes of code. If there is more than one branch per 16 bytes of code then the throughput is reduced to one jump per two clock cycles. If there are more than 128 branches in the critical part of the code, and if they are spaced by at least 16 bytes, then apparently the first 128 branches have the high throughput and the remaining have the low throughput.

I agree that's not exactly what I said, so maybe sorry for misinformation here. However from the above references the 16 lines and effects associated to them seems to be there and from this point of view having jump instruction scattering two lines in a hot loop is suspicious to me.

Maybe I'm wrong here but I would try to check it. It would be good if @TocarIP could provide more feedback for this topic.

Thanks, Kirill

josharian commented 7 years ago

However my time for this issue is currently up

Understood. Thanks for all the digging so far!

I'm just about out of ideas myself. I am very curious to know about whether the ldflags serves as a suitable fuzzer for this, though, because if so, we should add it to our auto benchmarks.

BTB organization

IIUC, the BTB is used to predict the target of indirect jumps and calls. All jumps in fannkuch are direct, though, so the BTB should not be used. @TocarIP might correct me, though.

maybe sorry for misinformation here

We're all muddling through together. :)

And I just noticed that @TocarIP also said:

I'd like to experiment with branch source alignment, but this isn't as simple as changing 1 constant, and I will be away for a ~week.

navytux commented 7 years ago

Josh thanks once again for feedback and for your work on this issue.

I am very curious to know about whether the ldflags serves as a suitable fuzzer for this, though, because if so, we should add it to our auto benchmarks.

First I have to say that via -ldflags we can shift only by multiple of page size (=4K) because otherwise programs just segfault (probably due to somewhere there is assumption that some addresses are page-aligned):

kirr@deco:~/tmp/trashme/t$ cat hello.go 
package main

func main() {
        println("hello")
}
kirr@deco:~/tmp/trashme/t$ go build -ldflags=-T=0x401000 hello.go 
kirr@deco:~/tmp/trashme/t$ ./hello 
hello
kirr@deco:~/tmp/trashme/t$ go build -ldflags=-T=0x401010 hello.go 
kirr@deco:~/tmp/trashme/t$ ./hello 
Segmentation fault

So it is not as covering as my above shift by println. (0x401000 is address which is regularly used for .text section start)

Then here is what happens with ldflags for the same go18 (go1.8.1-3-gdd8c975372), gotip (4b0d74f ; already has CL 43293 applied) and gotip + CL 43491.3. with 4K shifts:

aaa

in other words the effect is almost gone. (benchmarked with https://lab.nexedi.com/kirr/go/commit/f10506bb)

I think what is needed is to teach go test to link into binary before user objects some more fine-grained padding function and then it will work.

/cc @dvyukov, @aclements, @quentinmit (please see how much difference shifting by 16 bytes can do here: https://github.com/golang/go/issues/18977#issuecomment-301858002)

BTB and rest: you could be right, but what I feel is that processing instructions is heavily tied to 16 bytes entries probably due to how decoding is organized and how branch prediction info is stored etc. Besides BTB was described for "Sandy Bridge" and for newer ones Intel Haswell, Broadwell and Skylake (I have 6th-gen so Skylake) it is about regular branch predictions.

Anyway for now this are all guesses and it would help to get some feedback from inside Intel (@TocarIP ?).

The documentation is also not always correct and as you could see about branch predictors in Agner Fog there are many missing pieces about how it works. So offhand if instruction processing unit is 16 bytes I would say for branch predictor it should also be at least somehow relevant.

Btw about CL 43491 - maybe you could look into how to change something there so that inner loop is still organized with jump backwards not jump forward. According to Agner Fog docs, if I understand correctly, backward jumps are a bit better for loops as at least they are predicted to be taken the first time, loops are detected with only backward jumps etc.

Hope this helps a bit, Kirill

dvyukov commented 7 years ago

This reminds me of #8717. We've constantly seen ~+/-3% fluctuations after completely unrelated changes (like deleting some code in a function that does not even execute). I and Russ were not able to get to the bottom of this. If you will shed some light on it, it would be great.

navytux commented 7 years ago

@dvyukov thanks for feedback. I thought the light was already sched in https://github.com/golang/go/issues/18977#issuecomment-301513260 and https://github.com/golang/go/issues/18977#issuecomment-301858002 . In short we currently believe it is hashing inside cpu being used for branch prediction which takes source jump IP into account, which sometime could result on hash collisions and this way branch prediction state for two unrelated branches could be overwriting each other.

Please see

http://www.agner.org/optimize/microarchitecture.pdf p. 15 (agree predictor) for overview.

The particular hash function used is not documented and is seemingly different (as well as other predictor details) from cpu to cpu.

However as Josh proposed there is a way to mitigate at least unrelated changes from benchmarks: perform the timings with tested function starting from different addresses, and take e.g. best or average. On amd64 go currently aligns functions to 16 bytes. Once again in https://github.com/golang/go/issues/18977#issuecomment-301858002 I hackily did test for Fannkuch11 with changing its start in 16 bytes steps. It could be done more general and used to at least remove unrelated changes noise from the benchmarks.

I hope I understood Josh correct, Kirill

navytux commented 7 years ago

For the reference: Fannkuch11 can be speedup -33% via improving BCE:

https://github.com/golang/go/issues/20393#issuecomment-302756277

josharian commented 7 years ago

It can be sped up 33% with -B. That might or might not be the same thing.

navytux commented 7 years ago

Correct. However all runtime bounds checks in fannkuch can be statically proven to be unneccessary:

https://github.com/golang/go/issues/20393#issuecomment-302877947

That means the compiler can theoretically reach -B level with just clever BCE.

dvyukov commented 7 years ago

@josharian Dmitry's performance dashboard was noisy and spammy and ultimately disabled.

It wasn't spammy and it wasn't particularly noisy. #8717 lead to episodic noise, but otherwise it reliably detected changes in multiple metrics within fractions of percent. Dashboard wasn't disabled. It was broken by changes to the app and then never fixed.

dvyukov commented 7 years ago

@navytux @josharian I thought the light was already sched in #18977 (comment) and #18977 (comment) .

The analysis looks great! Thanks!

navytux commented 7 years ago

@dvyukov thanks for feedback.

randall77 commented 7 years ago

We're punting more loop rotation work to 1.10.

randall77 commented 5 years ago

Punting to 1.13, too late for anything major in 1.12.