golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.98k stars 17.67k forks source link

cmd/compile: SSA performance inconsistency/regression difference across amd64 CPUs. #16982

Closed dmitshur closed 8 years ago

dmitshur commented 8 years ago

Disclaimer: This is not necessarily an issue, I'm opening this thread to provide information that I hope may be helpful. It contains a microbenchmark which is not representative of real world performance, just a tiny subset. But there's something unusual/strange about it, which is why I think there's a chance this might be helpful and I'm reporting this. Please close if it's not helpful and nothing needs to be done.

I had a little microbenchmark snippet I used previously to compare gc and GopherJS performance, and I decided to try it on the SSA backend of Go 1.7. I found a surprise where one amd64 computer behaves very differently to all others I've tried it on, and I'm wondering if it's caused by an unintended bug somewhere or not.

What version of Go are you using (go version)?

go version go1.7 darwin/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/Dmitri/Dropbox/Work/2013/GoLanding:/Users/Dmitri/Dropbox/Work/2013/GoLand"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/g3/yrvdj9f55ll7jy7l5ygz82yc0000gn/T/go-build084371455=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"

What did you do?

I ran the following program with and without SSA backend on two different computers (both have amd64 CPU architecture).

https://play.golang.org/p/aAM1SuV6U4

Computer A is a MacBook Pro (15-inch, Late 2011), running OS X 10.11.6, with 2.4 GHz Intel Core i7-2760QM CPU @ 2.40GHz x 8.

$ go build -gcflags="-ssa=0" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.431564409s

$ go build -gcflags="-ssa=1" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.420316364s

Computer B is a MacBook (Retina, 12-inch, Early 2016), running OS X 10.11.6, with 1.1 GHz Intel Core m3-6Y30 CPU @ 0.90GHz x 4.

$ go build -gcflags="-ssa=0" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.564973583s

$ go build -gcflags="-ssa=1" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.771555271s

(There is a variance of about ±5% between individual runs.)

Computer A Computer B
SSA=0 6.431564409s 2.564973583s
SSA=1 6.420316364s 5.771555271s

What did you expect to see?

Given that the SSA backend generated code that performed roughly equally well on computer A, I expected that it have a similar result on computer B.

What did you see instead?

Instead, I saw that on computer B (and but not on computer A) enabling SSA reduces the performance by a factor of more than two.

dmitshur commented 8 years ago

Additional findings

If I rewrite term to avoid an if branch, into this form:

return 4 / (2*float64(k) + 1) * float64(1-2*(k%2))

Then computer A performance becomes ~2x slower with SSA enabled, and computer B becomes ~4x slower.

Computer A Computer B
SSA=0 6.703508163s 3.067721176s
SSA=1 11.302787342s 14.389823165s

However, if I simply move that float64(1-2*(k%2)) multiplier to the front of the expression like this:

return float64(1-2*(k%2)) * 4 / (2*float64(k) + 1)
Computer A Computer B
SSA=0 10.206198046s 12.165551807s
SSA=1 9.072183098s 10.452724107s

Then suddenly ssa=1 becomes equally as performant as ssa=0 on both computer A and B (or rather ssa=0 becomes much slower).

Are these kind of fluctuations normal and expected, or this is unexpected and caused by an unintended bug?

randall77 commented 8 years ago

Strange. I cannot reproduce on my desktop. Timing experiments can be tricky on laptops, with frequency scaling and all that other garbage. Maybe the SSA compiler heats up the chip so the chip scales back the clock frequency during the benchmark.

Please try building both binaries beforehand, and alternate running+timing them. Might also be worth trying the same binaries on both machines, just to make sure it's really the same code.

dmitshur commented 8 years ago

I cannot reproduce on my desktop.

That's not unexpected, I can only reproduce this on the 2016 12" Retina MacBook. Notably, it has an ultra low voltage Core m3-6Y30 CPU without a fan. It does not happen on this 2011 MBP, nor a 2015 iMac that I tried it on (but didn't mention above).

I've tried building both binaries on the 2011 MBP just now, copied and ran them interchangeably on the 12" MacBook, and got the same results.

2011 MBP (where the binaries were built beforehand):

$ ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.491640868s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.483156137s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.42614708s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.449560988s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.463344901s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.455693056s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.441013487s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 7.03565282s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.977611407s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.725358761s

2016 12" Retina MacBook (copied the binaries that were built on the 2011 MBP):

$ ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.596706378s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.863147823s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.641448078s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.809569822s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.575924752s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.761018202s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.592914719s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.98196503s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.665577762s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.813804654s

If I run a CPU + GPU intensive load on the 2016 12" Retina MacBook in the background, just to make sure the CPU is both clocked high, and not throttling due to load bursts, etc., I can still see the same effect. Both times get around 30% slower because of the background load.

screen shot 2016-09-03 at 2 46 31 pm

I suspect it'd be a good idea to look at the generated assembly difference for term with ssa=0 and ssa=1.

TocarIP commented 8 years ago

@shurcooL can you upload code somewhere? I get Forbidden from playground. 2016 12" Retina MacBook has skylake CPU, so i'd like to try to reproduce this on my skylake desktop.

dmitshur commented 8 years ago

@TocarIP, would pasting the code here work for you? Here it is, taken from https://play.golang.org/p/aAM1SuV6U4:

// Play with benchmarking a tight loop with many iterations and a func call,
// compare gc vs GopherJS performance.
//
// An alternative more close-to-metal implementation that doesn't use math.Pow.
//
// Disclaimer: This is a microbenchmark and is very poorly representative of
//             overall general real world performance of larger applications.
//
package main

import (
    "fmt"
    "time"
)

func term(k int32) float64 {
    //       | Computer A   | Computer B
    // ------|--------------|-------------
    // SSA=0 | 6.431564409s | 2.564973583s
    // SSA=1 | 6.420316364s | 5.771555271s
    if k%2 == 0 {
        return 4 / (2*float64(k) + 1)
    } else {
        return -4 / (2*float64(k) + 1)
    }

    //       | Computer A    | Computer B
    // ------|---------------|--------------
    // SSA=0 | 6.703508163s  | 3.067721176s
    // SSA=1 | 11.302787342s | 14.389823165s
    //return 4 / (2*float64(k) + 1) * float64(1-2*(k%2))

    //       | Computer A    | Computer B
    // ------|---------------|--------------
    // SSA=0 | 10.206198046s | 12.165551807s
    // SSA=1 | 9.072183098s  | 10.452724107s
    //return float64(1-2*(k%2)) * 4 / (2*float64(k) + 1)
}

// pi performs n iterations to compute an approximation of pi.
func pi(n int32) float64 {
    f := 0.0
    for k := int32(0); k <= n; k++ {
        f += term(k)
    }
    return f
}

func main() {
    // Start measuring time from now.
    started := time.Now()

    const n = 1000 * 1000 * 1000
    fmt.Printf("approximating pi with %v iterations.\n", n)
    fmt.Println(pi(n))

    fmt.Printf("total time taken is: %v\n", time.Since(started))
}
TocarIP commented 8 years ago

Thanks, I've tested 1.6 vs tip On haswell both versions have the same speed. But on Broadwell(sic!) and Skylake I was able to verify a 2x slowdown. Looking at asm main difference is usage of floating constants: 1.6 always reload them, while 1.7 loads them once in the beginning of pi function. This is due to both path ( k%2 == 0 and k%2 == 1) writing to same register In asm loop looks like: jne q2 cvtsi2sd %ebx,%xmm3 mulsd %xmm0,%xmm3 addsd %xmm1,%xmm3 movsd 0xb817d(%rip),%xmm4 # 4b91c0 <$f64.4010000000000000> divsd %xmm5,%xmm4 q4: lea 0x1(%rbx),%ecx addsd %xmm4,%xmm2 cmp %eax,%ecx jle q3 q1: movsd %xmm2,0x10(%rsp) retq q2: cvtsi2sd %ebx,%xmm3 mulsd %xmm0,%xmm3 addsd %xmm1,%xmm3 movsd 0xb8193(%rip),%xmm4 # 4b9200 <$f64.c010000000000000> divsd %xmm3,%xmm4

If I change xmm3 into xmm5: jne q2 cvtsi2sd %ebx,%xmm5 mulsd %xmm0,%xmm5 addsd %xmm1,%xmm5 movsd 0xb817d(%rip),%xmm4 # 4b91c0 <$f64.4010000000000000> divsd %xmm5,%xmm4 q4: lea 0x1(%rbx),%ecx addsd %xmm4,%xmm2 cmp %eax,%ecx jle q3 q1: movsd %xmm2,0x10(%rsp) retq q2: cvtsi2sd %ebx,%xmm3 mulsd %xmm0,%xmm3 addsd %xmm1,%xmm3 movsd 0xb8193(%rip),%xmm4 # 4b9200 <$f64.c010000000000000> divsd %xmm3,%xmm4

Performance improves by 2x (gets back to 1.6 level)

TocarIP commented 8 years ago

By introducing register allocator heuristic, it is possible to avoid writing to the same register: See https://go-review.googlesource.com/#/c/28874/ as a prototype.

gopherbot commented 8 years ago

CL https://golang.org/cl/28874 mentions this issue.

gopherbot commented 8 years ago

CL https://golang.org/cl/31490 mentions this issue.