Open dvyukov opened 10 years ago
If you want to try to figure out how this could be loop alignment, please go ahead. I spent days on this a few years ago and got absolutely nowhere. I can't find any evidence that loop alignment matters. It may be something else entirely, but in general modern CPUs are black boxes that can misbehave on a whim and - at least for people not privy to the inner workings - simply cannot be fully understood. They are subtle and quick to anger. If you want to try the loop alignment hypothesis, you could edit src/liblink/asm6.c. Look for LoopAlign (MaxLoopPad = 0 means alignment is turned off right now). I am removing the Release-Go1.5 tag. If someone wants to work on this, great, but I am not going to promise to waste any more time on this. Long ago I made peace with the fact that this kind of variation is something we just have to live with sometimes.
Labels changed: added release-none, removed release-go1.5.
Status changed to Accepted.
I can confirm your conclusion. We need to wait until Go becomes important enough so that processor manufacturers allocate engineers for optimization. I've tried to align back-branch targets and all branch targets at 16 bytes (https://golang.org/cl/162890043) with no success. Alignment of back-branch targets increased binary size by 5.1%, all branches - 8.3%. So if we do it, we need something smarter, e.g. align only within real loops. I've checked that in both binaries stack segment address and fs register has equal values, so we can strike it out. Since code has moved, data segment also has a different address. So maybe it's related to data. But I don't see any functions in the profile that heavily access global data...
Just debugged another case, which turned out to be this issue. After inserting a dead function call program becomes 2.5% slower.
go version devel +b4538d7 Wed May 11 06:00:33 2016 +0000 linux/amd64 with this patch:
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -1028,12 +1028,41 @@ func startTheWorldWithSema() {
add := needaddgcproc()
lock(&sched.lock)
+ var p1 *p
procs := gomaxprocs
- if newprocs != 0 {
- procs = newprocs
- newprocs = 0
+ if true {
+ if newprocs != 0 {
+ procs = newprocs
+ newprocs = 0
+ }
+ p1 = procresize(procs)
+ } else {
+
+ if newprocs != 0 {
+ procs = newprocs
+ newprocs = 0
+ p1 = procresize(procs)
+ } else {
+ _g_.m.p.ptr().status = _Prunning
+ var runnablePs *p
+ for i := gomaxprocs - 1; i >= 0; i-- {
+ p := allp[i]
+ if _g_.m.p.ptr() == p {
+ continue
+ }
+ p.status = _Pidle
+ if runqempty(p) {
+ pidleput(p)
+ } else {
+ p.m.set(mget())
+ p.link.set(runnablePs)
+ runnablePs = p
+ }
+ }
+ p1 = runnablePs
+ }
}
- p1 := procresize(procs)
+
sched.gcwaiting = 0
if sched.sysmonwait != 0 {
sched.sysmonwait = 0
Then depending on presence of the following patch:
--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -1608,7 +1608,7 @@ func gcMark(start_time int64) {
traceGCScanDone()
}
- cachestats()
+ //cachestats()
// Update the reachable heap stat.
memstats.heap_reachable = work.bytesMarked
The test program: https://gist.githubusercontent.com/dvyukov/419e97e80f28cf7a0c656b03b99b7a49/raw/003bb12b73498f512514c0dad436e9da38b4f990/gistfile1.txt performance changes +/-2.5%.
With the call commented out I consistently see:
$ nice -20 taskset -c 0-11 time /tmp/gif100
7.90user 0.05system 0:07.95elapsed 100%CPU
Without the call commented out:
$ nice -20 taskset -c 0-11 time /tmp/gif101
8.11user 0.05system 0:08.15elapsed 100%CPU
All time is spent in computations:
95.11% gif101 gif101 [.] image/draw.drawPaletted
0.92% gif101 gif101 [.] compress/lzw.(*encoder).Write
0.79% gif101 gif101 [.] image.(*Paletted).At
0.58% gif101 gif101 [.] compress/lzw.(*decoder).decode
0.37% gif101 gif101 [.] image/color.(*RGBA).RGBA
0.32% gif101 gif101 [.] compress/lzw.(*decoder).readLSB
0.23% gif101 gif101 [.] compress/lzw.(*encoder).writeLSB
drawPaletted magically becomes faster after the change.
Diff in CPU profiles still does not make any sense to me, it looks like percents are just randomly shuffled. Faster version: https://gist.githubusercontent.com/dvyukov/b6ebfbae2b5bb7423e8e3b34b565de1d/raw/57fd60ef4220bd60445d713282cae03cda4f7fe6/gistfile1.txt Slower version: https://gist.githubusercontent.com/dvyukov/9d7a9dc9a19a3d2628d7a439c7c0aa9e/raw/69d411f7a42f732614ba05dbe500a3f42f74d429/gistfile1.txt
@aclements @RLH @randall77 @dr2chase
Loop alignment still makes the program slower.
The the fast version the function is aligned on 0x10:
/tmp/gif103:0000000000494a10 T image/draw.drawPaletted
and in the slow version to 0x20:
/tmp/gif102:00000000004949a0 T image/draw.drawPaletted
If I set function alignment to 0x20 (which is broken due to asm functions, so it actually gives me 0x10 alignment for the function), it mostly fixes the problem:
7.96user 0.05system 0:08.01elapsed 100%CPU
I'm reminded of: http://dl.acm.org/citation.cfm?id=1508275 https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf
From the second paper:
Mytkowicz et al. [the first paper] show that just changing the size of environment variables can trigger performance degradation as high as 300%; we find that simply changing the link order of object files can cause performance to decrease by as much as 57%.
Microsoft Research had a tool that would link a program multiple times, where each binary used a different (randomized) function order, then they'd run tests and pick the best function order. Unfortunately, my Google-fu is failing me and I cannot find a reference. (The closest I can find is VC's /ORDER linker option, which looks like it could be used to implement this feature.)
For the reference: the effect is fully explained here:
https://github.com/golang/go/issues/18977#issuecomment-302163082 https://github.com/golang/go/issues/18977#issuecomment-301513260 https://github.com/golang/go/issues/18977#issuecomment-301858002
(including way on how to denoise the timings)
Change https://golang.org/cl/332771 mentions this issue: strings,bytes: do not allocate memory in Trim/TrimLeft/TrimRight