cmd/compile: random performance fluctuations after unrelated changes

dvyukov commented 10 years ago

We see constant performance fluctuations after unrelated changes on the perf dashboard.
For example:
http://build.golang.org/perfdetail?commit=96c713ab6c6f2a4b2a8a0bb2e8d674637b6ce596&;commit0=fee5fcd5f87e75235d93fb297123feb15a59ae38&builder=linux-amd64-perf&benchmark=json
http://build.golang.org/perfdetail?commit=455042166f1366b147e1249b8d5639be7d67bfce&;commit0=0a5fafdd2343b083457d0baf6487dfce0f01e25f&builder=windows-amd64-perf&benchmark=json
http://build.golang.org/perfdetail?commit=ad5d9f8f9be743e72f89d85d8bd6348807bdac90&;commit0=fc588981a45afa430a2d2cd29d234403cb86e1bd&builder=windows-amd64-perf&benchmark=json

I can reproduce it locally as well. I've took 2 consecutive commits:
changeset:   21142:91110f70916a
summary:     runtime: allow crash from gsignal stack
changeset:   21141:0768bb1fd027
summary:     net: fix inconsistent behavior across platforms in SetKeepAlivePeriod

and run go.benchmarks binary as:
./bench -bench=json -benchtime=3s -benchnum=100
alternating old and new binaries.

The results are:
GOPERF-METRIC:time=91998310
GOPERF-METRIC:time=91863644
GOPERF-METRIC:time=91491272
GOPERF-METRIC:time=91988322

new:
GOPERF-METRIC:time=93191495
GOPERF-METRIC:time=93222905
GOPERF-METRIC:time=93224972
GOPERF-METRIC:time=93140395

The difference is now that big (probably because my processors panilizes less what is
being penalized), but clearly observable.

It looks like code alignment issue. Probably loops are not 16-byte aligned or something
like this.

I guess this also penalizes user binaries in the same random and unpredictable way.

rsc commented 10 years ago

Comment 1:

If you want to try to figure out how this could be loop alignment, please go ahead. I
spent days on this a few years ago and got absolutely nowhere. I can't find any evidence
that loop alignment matters. It may be something else entirely, but in general modern
CPUs are black boxes that can misbehave on a whim and - at least for people not privy to
the inner workings - simply cannot be fully understood. They are subtle and quick to
anger.
If you want to try the loop alignment hypothesis, you could edit src/liblink/asm6.c.
Look for LoopAlign (MaxLoopPad = 0 means alignment is turned off right now).
I am removing the Release-Go1.5 tag. If someone wants to work on this, great, but I am
not going to promise to waste any more time on this. Long ago I made peace with the fact
that this kind of variation is something we just have to live with sometimes.

Labels changed: added release-none, removed release-go1.5.

Status changed to Accepted.

dvyukov commented 10 years ago

Comment 2:

I can confirm your conclusion. We need to wait until Go becomes important enough so that
processor manufacturers allocate engineers for optimization.
I've tried to align back-branch targets and all branch targets at 16 bytes
(https://golang.org/cl/162890043) with no success. Alignment of back-branch
targets increased binary size by 5.1%, all branches - 8.3%. So if we do it, we need
something smarter, e.g. align only within real loops.
I've checked that in both binaries stack segment address and fs register has equal
values, so we can strike it out.
Since code has moved, data segment also has a different address. So maybe it's related
to data. But I don't see any functions in the profile that heavily access global data...

dvyukov commented 8 years ago

Just debugged another case, which turned out to be this issue. After inserting a dead function call program becomes 2.5% slower.

go version devel +b4538d7 Wed May 11 06:00:33 2016 +0000 linux/amd64 with this patch:

--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -1028,12 +1028,41 @@ func startTheWorldWithSema() {
        add := needaddgcproc()
        lock(&sched.lock)

+       var p1 *p
        procs := gomaxprocs
-       if newprocs != 0 {
-               procs = newprocs
-               newprocs = 0
+       if true {
+               if newprocs != 0 {
+                       procs = newprocs
+                       newprocs = 0
+               }
+               p1 = procresize(procs)
+       } else {
+
+               if newprocs != 0 {
+                       procs = newprocs
+                       newprocs = 0
+                       p1 = procresize(procs)
+               } else {
+                       _g_.m.p.ptr().status = _Prunning
+                       var runnablePs *p
+                       for i := gomaxprocs - 1; i >= 0; i-- {
+                               p := allp[i]
+                               if _g_.m.p.ptr() == p {
+                                       continue
+                               }
+                               p.status = _Pidle
+                               if runqempty(p) {
+                                       pidleput(p)
+                               } else {
+                                       p.m.set(mget())
+                                       p.link.set(runnablePs)
+                                       runnablePs = p
+                               }
+                       }
+                       p1 = runnablePs
+               }
        }
-       p1 := procresize(procs)
+
        sched.gcwaiting = 0
        if sched.sysmonwait != 0 {
                sched.sysmonwait = 0

Then depending on presence of the following patch:

--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -1608,7 +1608,7 @@ func gcMark(start_time int64) {
                traceGCScanDone()
        }

-       cachestats()
+       //cachestats()

        // Update the reachable heap stat.
        memstats.heap_reachable = work.bytesMarked

The test program: https://gist.githubusercontent.com/dvyukov/419e97e80f28cf7a0c656b03b99b7a49/raw/003bb12b73498f512514c0dad436e9da38b4f990/gistfile1.txt performance changes +/-2.5%.

With the call commented out I consistently see:

$ nice -20 taskset -c 0-11 time /tmp/gif100
7.90user 0.05system 0:07.95elapsed 100%CPU

Without the call commented out:

$ nice -20 taskset -c 0-11 time /tmp/gif101
8.11user 0.05system 0:08.15elapsed 100%CPU

All time is spent in computations:

  95.11%  gif101   gif101             [.] image/draw.drawPaletted
   0.92%  gif101   gif101             [.] compress/lzw.(*encoder).Write
   0.79%  gif101   gif101             [.] image.(*Paletted).At
   0.58%  gif101   gif101             [.] compress/lzw.(*decoder).decode
   0.37%  gif101   gif101             [.] image/color.(*RGBA).RGBA
   0.32%  gif101   gif101             [.] compress/lzw.(*decoder).readLSB
   0.23%  gif101   gif101             [.] compress/lzw.(*encoder).writeLSB

drawPaletted magically becomes faster after the change.

Diff in CPU profiles still does not make any sense to me, it looks like percents are just randomly shuffled. Faster version: https://gist.githubusercontent.com/dvyukov/b6ebfbae2b5bb7423e8e3b34b565de1d/raw/57fd60ef4220bd60445d713282cae03cda4f7fe6/gistfile1.txt Slower version: https://gist.githubusercontent.com/dvyukov/9d7a9dc9a19a3d2628d7a439c7c0aa9e/raw/69d411f7a42f732614ba05dbe500a3f42f74d429/gistfile1.txt

@aclements @RLH @randall77 @dr2chase

dvyukov commented 8 years ago

Loop alignment still makes the program slower.

dvyukov commented 8 years ago

The the fast version the function is aligned on 0x10:

/tmp/gif103:0000000000494a10 T image/draw.drawPaletted

and in the slow version to 0x20:

/tmp/gif102:00000000004949a0 T image/draw.drawPaletted

If I set function alignment to 0x20 (which is broken due to asm functions, so it actually gives me 0x10 alignment for the function), it mostly fixes the problem:

7.96user 0.05system 0:08.01elapsed 100%CPU

tombergan commented 8 years ago

I'm reminded of: http://dl.acm.org/citation.cfm?id=1508275 https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf

From the second paper:

Mytkowicz et al. [the first paper] show that just changing the size of environment variables can trigger performance degradation as high as 300%; we find that simply changing the link order of object files can cause performance to decrease by as much as 57%.

Microsoft Research had a tool that would link a program multiple times, where each binary used a different (randomized) function order, then they'd run tests and pick the best function order. Unfortunately, my Google-fu is failing me and I cannot find a reference. (The closest I can find is VC's /ORDER linker option, which looks like it could be used to implement this feature.)

navytux commented 6 years ago

For the reference: the effect is fully explained here:

https://github.com/golang/go/issues/18977#issuecomment-302163082 https://github.com/golang/go/issues/18977#issuecomment-301513260 https://github.com/golang/go/issues/18977#issuecomment-301858002

navytux commented 6 years ago

(including way on how to denoise the timings)

gopherbot commented 3 years ago

Change https://golang.org/cl/332771 mentions this issue: strings,bytes: do not allocate memory in Trim/TrimLeft/TrimRight

golang / go

cmd/compile: random performance fluctuations after unrelated changes #8717