golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.03k stars 17.54k forks source link

dev.ssa: Performance regressions on BLAS benchmarks vs 1.6 #14511

Closed btracey closed 8 years ago

btracey commented 8 years ago

We are seeing some significant performance regressions for BLAS benchmarks vs. 1.6. These benchmarks are numeric, and consist almost entirely of []float64 indexing and assignment. While they may seem hyper-specialized, calls to Dgemm in particular can make up a significant fraction of runtime in codes we write.

Note that Dgemm is coded as a concurrent algorithm, while Dgemv is not.

Code: https://godoc.org/github.com/gonum/blas/native Dgemv: https://github.com/gonum/blas/blob/master/native/level2double.go#L13 Dgemm: https://github.com/gonum/blas/blob/master/native/dgemm.go Actual benchmark call is in the packages, but code is in the blas/testblas package.

Call:

go test -bench Dgem -tags noasm -cpu=1,8 -count 5 -timeout=60m

The noasm flag changes a dot product inner loop call to use the native go version (https://github.com/gonum/internal/blob/master/asm/ddot.go) instead of the assembly version.

SSA Version: go version devel +fb54e03 Thu Feb 25 07:10:07 2016 +0000 darwin/amd64

Go env output:

brendan:~/Documents/mygo/src/github.com/gonum/blas/native$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/brendan/Documents/mygo"
GORACE=""
GOROOT="/Users/brendan/gover/go"
GOTOOLDIR="/Users/brendan/gover/go/pkg/tool/darwin_amd64"
GO15VENDOREXPERIMENT="1"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fno-common"
CXX="clang++"
CGO_ENABLED="1"
brendan:~/Documents/mygo$ benchstat blasonesix.txt ssatip.txt
name                      old time/op  new time/op   delta
DgemmSmSmSm               1.89µs ± 0%   2.19µs ± 2%   +16.04%  (p=0.008 n=5+5)
DgemmSmSmSm-8             1.90µs ± 0%   2.15µs ± 1%   +13.65%  (p=0.008 n=5+5)
DgemmMedMedMed            1.35ms ± 0%   1.68ms ± 1%   +24.92%  (p=0.008 n=5+5)
DgemmMedMedMed-8           581µs ± 2%    735µs ± 1%   +26.59%  (p=0.008 n=5+5)
DgemmMedLgMed             13.1ms ± 1%   16.4ms ± 1%   +25.41%  (p=0.008 n=5+5)
DgemmMedLgMed-8           3.96ms ± 2%   5.09ms ± 2%   +28.58%  (p=0.008 n=5+5)
DgemmLgLgLg                1.32s ± 0%    1.66s ± 0%   +26.52%  (p=0.016 n=5+4)
DgemmLgLgLg-8              374ms ± 1%    498ms ± 1%   +33.25%  (p=0.008 n=5+5)
DgemmLgSmLg               19.4ms ± 1%   22.2ms ± 1%   +14.56%  (p=0.008 n=5+5)
DgemmLgSmLg-8             5.28ms ± 3%   6.28ms ± 3%   +18.84%  (p=0.008 n=5+5)
DgemmLgLgSm               14.4ms ± 1%   17.9ms ± 1%   +23.88%  (p=0.008 n=5+5)
DgemmLgLgSm-8             4.11ms ± 5%   5.28ms ± 5%   +28.60%  (p=0.008 n=5+5)
DgemmHgHgSm                1.49s ± 0%    1.85s ± 2%   +24.57%  (p=0.008 n=5+5)
DgemmHgHgSm-8              383ms ± 1%    496ms ± 3%   +29.45%  (p=0.008 n=5+5)
DgemmMedMedMedTNT         1.35ms ± 2%   1.69ms ± 3%   +24.66%  (p=0.008 n=5+5)
DgemmMedMedMedTNT-8        589µs ± 3%    748µs ± 2%   +27.05%  (p=0.008 n=5+5)
DgemmMedMedMedNTT         1.08ms ± 1%   2.42ms ± 1%  +125.10%  (p=0.008 n=5+5)
DgemmMedMedMedNTT-8        480µs ± 2%   1040µs ± 0%  +116.96%  (p=0.008 n=5+5)
DgemmMedMedMedTT          1.61ms ± 0%   1.93ms ± 1%   +19.93%  (p=0.008 n=5+5)
DgemmMedMedMedTT-8         701µs ± 3%    851µs ± 1%   +21.47%  (p=0.008 n=5+5)
DgemvSmSmNoTransInc1       186ns ± 1%    273ns ± 1%   +46.77%  (p=0.008 n=5+5)
DgemvSmSmNoTransInc1-8     186ns ± 0%    274ns ± 0%   +47.31%  (p=0.016 n=4+5)
DgemvSmSmNoTransIncN       227ns ± 0%    317ns ± 1%   +39.74%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     227ns ± 1%    315ns ± 1%   +38.82%  (p=0.008 n=5+5)
DgemvSmSmTransInc1         212ns ± 3%    240ns ± 1%   +13.29%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       212ns ± 2%    240ns ± 1%   +13.61%  (p=0.008 n=5+5)
DgemvSmSmTransIncN         244ns ± 1%    272ns ± 1%   +11.38%  (p=0.008 n=5+5)
DgemvSmSmTransIncN-8       243ns ± 0%    272ns ± 2%   +11.85%  (p=0.016 n=4+5)
DgemvMedMedNoTransInc1    10.0µs ± 1%   24.7µs ± 0%  +147.68%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.82µs ± 2%  24.67µs ± 1%  +151.26%  (p=0.008 n=5+5)
DgemvMedMedNoTransIncN    12.4µs ± 0%   25.6µs ± 0%  +107.05%  (p=0.008 n=5+5)
DgemvMedMedNoTransIncN-8  12.3µs ± 0%   25.8µs ± 2%  +109.16%  (p=0.008 n=5+5)
DgemvMedMedTransInc1      12.2µs ± 0%   15.5µs ± 1%   +26.54%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    12.3µs ± 1%   15.5µs ± 1%   +26.48%  (p=0.008 n=5+5)
DgemvMedMedTransIncN      14.3µs ± 1%   17.9µs ± 1%   +25.29%  (p=0.008 n=5+5)
DgemvMedMedTransIncN-8    14.3µs ± 0%   17.9µs ± 0%   +25.42%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1       909µs ± 0%   2515µs ± 1%  +176.56%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     906µs ± 1%   2520µs ± 1%  +178.23%  (p=0.008 n=5+5)
DgemvLgLgNoTransIncN      1.14ms ± 0%   2.52ms ± 1%  +120.18%  (p=0.008 n=5+5)
DgemvLgLgNoTransIncN-8    1.15ms ± 1%   2.52ms ± 1%  +119.76%  (p=0.008 n=5+5)
DgemvLgLgTransInc1        1.15ms ± 1%   1.53ms ± 0%   +33.05%  (p=0.008 n=5+5)
DgemvLgLgTransInc1-8      1.15ms ± 0%   1.53ms ± 0%   +32.95%  (p=0.008 n=5+5)
DgemvLgLgTransIncN        1.30ms ± 1%   1.84ms ±10%   +41.65%  (p=0.008 n=5+5)
DgemvLgLgTransIncN-8      1.30ms ± 1%   1.85ms ± 5%   +42.56%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1      15.4µs ± 0%   23.6µs ± 0%   +53.10%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    15.4µs ± 1%   23.8µs ± 1%   +54.35%  (p=0.008 n=5+5)
DgemvLgSmNoTransIncN      19.9µs ± 2%   28.0µs ± 1%   +40.57%  (p=0.008 n=5+5)
DgemvLgSmNoTransIncN-8    19.8µs ± 0%   28.1µs ± 1%   +41.70%  (p=0.008 n=5+5)
DgemvLgSmTransInc1        16.8µs ± 1%   19.0µs ± 0%   +12.97%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      16.9µs ± 0%   19.0µs ± 1%   +12.80%  (p=0.008 n=5+5)
DgemvLgSmTransIncN        20.3µs ± 0%   23.1µs ± 0%   +13.97%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      20.3µs ± 1%   23.1µs ± 0%   +13.47%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1      8.59µs ± 1%  24.80µs ± 0%  +188.69%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.61µs ± 0%  24.84µs ± 1%  +188.53%  (p=0.008 n=5+5)
DgemvSmLgNoTransIncN      11.2µs ± 1%   25.0µs ± 1%  +123.28%  (p=0.008 n=5+5)
DgemvSmLgNoTransIncN-8    11.3µs ± 3%   25.6µs ± 3%  +125.67%  (p=0.008 n=5+5)
DgemvSmLgTransInc1        12.2µs ± 3%   15.5µs ± 2%   +27.53%  (p=0.008 n=5+5)
DgemvSmLgTransInc1-8      12.0µs ± 1%   16.0µs ± 5%   +32.67%  (p=0.008 n=5+5)
DgemvSmLgTransIncN        13.7µs ± 0%   18.2µs ± 3%   +33.06%  (p=0.008 n=5+5)
DgemvSmLgTransIncN-8      13.6µs ± 0%   17.9µs ± 1%   +31.52%  (p=0.008 n=5+5)
josharian commented 8 years ago

Can you provide assembly dumps from tip and from ssa for some of the biggest slowdowns and/or simplest functions, as well as the corresponding function? That'd make digging into this a bit easier.

btracey commented 8 years ago

/cc @randall77 @tzneal @dr2chase @brtzsnr

Note that while the timeout was listed at 60m, the benchmarks really only take slightly greater than 10m

btracey commented 8 years ago

A smaller reproducer is the Ddot benchmark Code: https://github.com/gonum/blas/blob/master/native/level1double_ddot.go Benchmark(s): https://github.com/gonum/blas/blob/master/native/level1doubleBench_auto_test.go#L38

brendan:~/Documents/mygo$ benchstat blasonesixddot.txt ssatipddot.txt 
name                   old time/op  new time/op   delta
DdotSmallBothUnitary   17.3ns ± 2%   26.6ns ± 6%   +53.46%  (p=0.008 n=5+5)
DdotSmallIncUni        21.4ns ± 1%   31.8ns ± 8%   +48.55%  (p=0.008 n=5+5)
DdotSmallUniInc        23.2ns ±10%   30.4ns ± 8%   +31.15%  (p=0.008 n=5+5)
DdotSmallBothInc       22.7ns ±12%   29.0ns ± 3%   +27.53%  (p=0.008 n=5+5)
DdotMediumBothUnitary  1.01µs ± 3%   2.68µs ± 8%  +165.29%  (p=0.008 n=5+5)
DdotMediumIncUni       1.38µs ± 2%   2.68µs ± 3%   +93.62%  (p=0.008 n=5+5)
DdotMediumUniInc       1.18µs ± 3%   2.56µs ± 3%  +116.45%  (p=0.008 n=5+5)
DdotMediumBothInc      1.32µs ± 8%   2.66µs ± 3%  +101.93%  (p=0.008 n=5+5)
DdotLargeBothUnitary   88.6µs ± 8%  262.6µs ± 2%  +196.36%  (p=0.008 n=5+5)
DdotLargeIncUni         169µs ± 1%    278µs ± 2%   +64.51%  (p=0.008 n=5+5)
DdotLargeUniInc         121µs ± 1%    252µs ± 1%  +108.23%  (p=0.008 n=5+5)
DdotLargeBothInc        237µs ± 0%    304µs ± 9%   +28.25%  (p=0.008 n=5+5)
DdotHugeBothUnitary    10.6ms ± 0%   27.6ms ± 3%  +161.88%  (p=0.016 n=4+5)
DdotHugeIncUni         25.8ms ± 2%   31.7ms ± 5%   +22.78%  (p=0.008 n=5+5)
DdotHugeUniInc         17.6ms ± 1%   28.7ms ± 4%   +63.27%  (p=0.008 n=5+5)
DdotHugeBothInc        32.9ms ± 0%   35.8ms ± 7%    +8.83%  (p=0.008 n=5+5)
btracey commented 8 years ago

I'm not very experienced with assembler, but I think these are the most relevant outputs. It may be that the real dump needs to be done in the actual blas package, rather than just the inner loop.

brendan:~/Documents/mygo/src/github.com/gonum/internal/asm$ go version
go version go1.6 darwin/amd64
brendan:~/Documents/mygo/src/github.com/gonum/internal/asm$ go build -gcflags=-S -tags noasm ddot.go 
# command-line-arguments
"".DdotUnitary t=1 size=128 value=0 args=0x38 locals=0x0
    0x0000 00000 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    TEXT    "".DdotUnitary(SB), $0-56
    0x0000 00000 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    MOVQ    (TLS), CX
    0x0009 00009 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    CMPQ    SP, 16(CX)
    0x000d 00013 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    JLS 111
    0x000f 00015 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    NOP
    0x000f 00015 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    NOP
    0x000f 00015 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    MOVQ    "".y+32(FP), R9
    0x0014 00020 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    MOVQ    "".y+40(FP), DI
    0x0019 00025 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    FUNCDATA    $0, gclocals·71f75e7e2fe2878e818867fe3428bd87(SB)
    0x0019 00025 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    FUNCDATA    $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0019 00025 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    XORPS   X3, X3
    0x001c 00028 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    MOVSD   X3, "".sum+56(FP)
    0x0022 00034 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   NOP
    0x0022 00034 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   MOVQ    "".x+8(FP), CX
    0x0027 00039 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   MOVQ    "".x+16(FP), SI
    0x002c 00044 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   MOVQ    "".x+24(FP), BX
    0x0031 00049 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   MOVQ    $0, AX
    0x0033 00051 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   CMPQ    AX, SI
    0x0036 00054 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   JGE $0, 103
    0x0038 00056 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   NOP
    0x0038 00056 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   MOVSD   (CX), X2
    0x003c 00060 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   CMPQ    AX, DI
    0x003f 00063 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   JCC $1, 104
    0x0041 00065 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   LEAQ    (R9)(AX*8), BX
    0x0045 00069 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   MOVSD   (BX), X0
    0x0049 00073 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   MULSD   X2, X0
    0x004d 00077 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   ADDSD   X3, X0
    0x0051 00081 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   MOVAPD  X0, X3
    0x0055 00085 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   MOVSD   X0, "".sum+56(FP)
    0x005b 00091 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   NOP
    0x005b 00091 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   ADDQ    $8, CX
    0x005f 00095 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   INCQ    AX
    0x0062 00098 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   CMPQ    AX, SI
    0x0065 00101 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   JLT $0, 56
    0x0067 00103 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   NOP
    0x0067 00103 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:13)   RET
    0x0068 00104 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   PCDATA  $0, $0
    0x0068 00104 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   CALL    runtime.panicindex(SB)
    0x006d 00109 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   UNDEF
    0x006f 00111 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   NOP
    0x006f 00111 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    CALL    runtime.morestack_noctxt(SB)
    0x0074 00116 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    JMP 0
    0x0000 65 48 8b 0c 25 00 00 00 00 48 3b 61 10 76 60 4c  eH..%....H;a.v`L
    0x0010 8b 4c 24 20 48 8b 7c 24 28 0f 57 db f2 0f 11 5c  .L$ H.|$(.W....\
    0x0020 24 38 48 8b 4c 24 08 48 8b 74 24 10 48 8b 5c 24  $8H.L$.H.t$.H.\$
    0x0030 18 31 c0 48 39 f0 7d 2f f2 0f 10 11 48 39 f8 73  .1.H9.}/....H9.s
    0x0040 27 49 8d 1c c1 f2 0f 10 03 f2 0f 59 c2 f2 0f 58  'I.........Y...X
    0x0050 c3 66 0f 28 d8 f2 0f 11 44 24 38 48 83 c1 08 48  .f.(....D$8H...H
    0x0060 ff c0 48 39 f0 7c d1 c3 e8 00 00 00 00 0f 0b e8  ..H9.|..........
    0x0070 00 00 00 00 eb 8a cc cc cc cc cc cc cc cc cc cc  ................
    rel 5+4 t=14 +0
    rel 105+4 t=6 runtime.panicindex+0
    rel 112+4 t=6 runtime.morestack_noctxt+0
"".DdotInc t=1 size=176 value=0 args=0x60 locals=0x0
    0x0000 00000 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   TEXT    "".DdotInc(SB), $0-96
    0x0000 00000 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    (TLS), CX
    0x0009 00009 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   CMPQ    SP, 16(CX)
    0x000d 00013 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   JLS 157
    0x0013 00019 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   NOP
    0x0013 00019 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   NOP
    0x0013 00019 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    "".n+56(FP), R13
    0x0018 00024 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    "".y+32(FP), R12
    0x001d 00029 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    "".y+40(FP), R11
    0x0022 00034 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    "".x+8(FP), R10
    0x0027 00039 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    "".x+16(FP), R9
    0x002c 00044 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    "".incX+64(FP), DI
    0x0031 00049 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    "".incY+72(FP), SI
    0x0036 00054 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    "".iy+88(FP), DX
    0x003b 00059 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    "".ix+80(FP), CX
    0x0040 00064 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   FUNCDATA    $0, gclocals·7f14b12e2041f9b568f9bbe12353a4a8(SB)
    0x0040 00064 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   FUNCDATA    $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0040 00064 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   XORPS   X1, X1
    0x0043 00067 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVSD   X1, "".sum+96(FP)
    0x0049 00073 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   MOVQ    $0, AX
    0x004b 00075 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   CMPQ    R13, AX
    0x004e 00078 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   JLE $0, 142
    0x0050 00080 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVAPD  X1, X2
    0x0054 00084 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   CMPQ    DX, R11
    0x0057 00087 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   JCC $1, 150
    0x0059 00089 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   LEAQ    (R12)(DX*8), BX
    0x005d 00093 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVSD   (BX), X0
    0x0061 00097 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   CMPQ    CX, R9
    0x0064 00100 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   JCC $1, 143
    0x0066 00102 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   LEAQ    (R10)(CX*8), BX
    0x006a 00106 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVSD   (BX), X1
    0x006e 00110 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MULSD   X1, X0
    0x0072 00114 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   ADDSD   X2, X0
    0x0076 00118 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVAPD  X0, X1
    0x007a 00122 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVSD   X0, "".sum+96(FP)
    0x0080 00128 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   NOP
    0x0080 00128 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:19)   ADDQ    DI, CX
    0x0083 00131 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:19)   NOP
    0x0083 00131 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:20)   ADDQ    SI, DX
    0x0086 00134 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:20)   NOP
    0x0086 00134 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   INCQ    AX
    0x0089 00137 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   NOP
    0x0089 00137 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   CMPQ    R13, AX
    0x008c 00140 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   JGT $0, 80
    0x008e 00142 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:22)   RET
    0x008f 00143 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   PCDATA  $0, $0
    0x008f 00143 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   CALL    runtime.panicindex(SB)
    0x0094 00148 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   UNDEF
    0x0096 00150 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   PCDATA  $0, $0
    0x0096 00150 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   CALL    runtime.panicindex(SB)
    0x009b 00155 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   UNDEF
    0x009d 00157 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   NOP
    0x009d 00157 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   CALL    runtime.morestack_noctxt(SB)
    0x00a2 00162 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   JMP 0
    0x0000 65 48 8b 0c 25 00 00 00 00 48 3b 61 10 0f 86 8a  eH..%....H;a....
    0x0010 00 00 00 4c 8b 6c 24 38 4c 8b 64 24 20 4c 8b 5c  ...L.l$8L.d$ L.\
    0x0020 24 28 4c 8b 54 24 08 4c 8b 4c 24 10 48 8b 7c 24  $(L.T$.L.L$.H.|$
    0x0030 40 48 8b 74 24 48 48 8b 54 24 58 48 8b 4c 24 50  @H.t$HH.T$XH.L$P
    0x0040 0f 57 c9 f2 0f 11 4c 24 60 31 c0 49 39 c5 7e 3e  .W....L$`1.I9.~>
    0x0050 66 0f 28 d1 4c 39 da 73 3d 49 8d 1c d4 f2 0f 10  f.(.L9.s=I......
    0x0060 03 4c 39 c9 73 29 49 8d 1c ca f2 0f 10 0b f2 0f  .L9.s)I.........
    0x0070 59 c1 f2 0f 58 c2 66 0f 28 c8 f2 0f 11 44 24 60  Y...X.f.(....D$`
    0x0080 48 01 f9 48 01 f2 48 ff c0 49 39 c5 7f c2 c3 e8  H..H..H..I9.....
    0x0090 00 00 00 00 0f 0b e8 00 00 00 00 0f 0b e8 00 00  ................
    0x00a0 00 00 e9 59 ff ff ff cc cc cc cc cc cc cc cc cc  ...Y............
    rel 5+4 t=14 +0
    rel 144+4 t=6 runtime.panicindex+0
    rel 151+4 t=6 runtime.panicindex+0
    rel 158+4 t=6 runtime.morestack_noctxt+0
gclocals·33cdeccccebe80329f1fdbee7f5874cb t=8 dupok size=8 value=0
    0x0000 01 00 00 00 00 00 00 00                          ........
gclocals·71f75e7e2fe2878e818867fe3428bd87 t=8 dupok size=12 value=0
    0x0000 01 00 00 00 07 00 00 00 09 00 00 00              ............
gclocals·33cdeccccebe80329f1fdbee7f5874cb t=8 dupok size=8 value=0
    0x0000 01 00 00 00 00 00 00 00                          ........
gclocals·7f14b12e2041f9b568f9bbe12353a4a8 t=8 dupok size=12 value=0
    0x0000 01 00 00 00 0c 00 00 00 09 00 00 00              ............
"".DdotUnitary·f t=8 dupok size=8 value=0
    0x0000 00 00 00 00 00 00 00 00                          ........
    rel 0+8 t=1 "".DdotUnitary+0
"".DdotInc·f t=8 dupok size=8 value=0
    0x0000 00 00 00 00 00 00 00 00                          ........
    rel 0+8 t=1 "".DdotInc+0
runtime.gcbits.01 t=8 dupok size=1 value=0
    0x0000 01                                               .
go.string.hdr."[]float64" t=8 dupok size=16 value=0
    0x0000 00 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00  ................
    rel 0+8 t=1 go.string."[]float64"+0
go.string."[]float64" t=8 dupok size=16 value=0
    0x0000 5b 5d 66 6c 6f 61 74 36 34 00                    []float64.
type.[]float64 t=8 dupok size=72 value=0
    0x0000 18 00 00 00 00 00 00 00 08 00 00 00 00 00 00 00  ................
    0x0010 30 33 37 9c 00 08 08 17 00 00 00 00 00 00 00 00  037.............
    0x0020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    0x0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    0x0040 00 00 00 00 00 00 00 00                          ........
    rel 24+8 t=1 runtime.algarray+272
    rel 32+8 t=1 runtime.gcbits.01+0
    rel 40+8 t=1 go.string.hdr."[]float64"+0
    rel 56+8 t=1 go.weak.type.*[]float64+0
    rel 64+8 t=1 type.float64+0
go.typelink.[]float64   []float64 t=8 dupok size=8 value=0
    0x0000 00 00 00 00 00 00 00 00                          ........
    rel 0+8 t=1 type.[]float64+0
brendan:~/Documents/mygo/src/github.com/gonum/internal/asm$ go version
go version devel +fb54e03 Thu Feb 25 07:10:07 2016 +0000 darwin/amd64
brendan:~/Documents/mygo/src/github.com/gonum/internal/asm$ go build -gcflags=-S -tags noasm ddot.go 
# command-line-arguments
"".DdotUnitary t=1 size=128 value=0 args=0x38 locals=0x0
    0x0000 00000 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    TEXT    "".DdotUnitary(SB), $0-56
    0x0000 00000 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    MOVQ    (TLS), CX
    0x0009 00009 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    CMPQ    SP, 16(CX)
    0x000d 00013 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    JLS 113
    0x000f 00015 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    NOP
    0x000f 00015 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    NOP
    0x000f 00015 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    FUNCDATA    $0, gclocals·71f75e7e2fe2878e818867fe3428bd87(SB)
    0x000f 00015 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    FUNCDATA    $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x000f 00015 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    XORPS   X0, X0
    0x0012 00018 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    MOVSD   X0, "".sum+56(FP)
    0x0018 00024 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    MOVQ    "".x+8(FP), AX
    0x001d 00029 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   MOVQ    $0, CX
    0x001f 00031 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   MOVQ    "".x+16(FP), DX
    0x0024 00036 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   CMPQ    CX, DX
    0x0027 00039 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   JGE $0, 105
    0x0029 00041 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   TESTB   AL, (AX)
    0x002b 00043 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   MOVSD   (AX), X0
    0x002f 00047 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    MOVSD   "".sum+56(FP), X1
    0x0035 00053 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   MOVQ    "".y+40(FP), BX
    0x003a 00058 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   CMPQ    CX, BX
    0x003d 00061 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   JCC $0, 106
    0x003f 00063 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   MOVQ    "".y+32(FP), BP
    0x0044 00068 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   MOVSD   (BP)(CX*8), X2
    0x004a 00074 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   MULSD   X2, X0
    0x004e 00078 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   ADDSD   X1, X0
    0x0052 00082 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   MOVSD   X0, "".sum+56(FP)
    0x0058 00088 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   ADDQ    $8, AX
    0x005c 00092 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   INCQ    CX
    0x005f 00095 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   MOVQ    "".x+16(FP), DX
    0x0064 00100 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   CMPQ    CX, DX
    0x0067 00103 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:10)   JLT $0, 41
    0x0069 00105 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:13)   RET
    0x006a 00106 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   PCDATA  $0, $0
    0x006a 00106 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   CALL    runtime.panicindex(SB)
    0x006f 00111 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   UNDEF
    0x0071 00113 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:11)   NOP
    0x0071 00113 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    CALL    runtime.morestack_noctxt(SB)
    0x0076 00118 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:9)    JMP 0
    0x0000 65 48 8b 0c 25 00 00 00 00 48 3b 61 10 76 62 0f  eH..%....H;a.vb.
    0x0010 57 c0 f2 0f 11 44 24 38 48 8b 44 24 08 31 c9 48  W....D$8H.D$.1.H
    0x0020 8b 54 24 10 48 39 d1 7d 40 84 00 f2 0f 10 00 f2  .T$.H9.}@.......
    0x0030 0f 10 4c 24 38 48 8b 5c 24 28 48 39 d9 73 2b 48  ..L$8H.\$(H9.s+H
    0x0040 8b 6c 24 20 f2 0f 10 54 cd 00 f2 0f 59 c2 f2 0f  .l$ ...T....Y...
    0x0050 58 c1 f2 0f 11 44 24 38 48 83 c0 08 48 ff c1 48  X....D$8H...H..H
    0x0060 8b 54 24 10 48 39 d1 7c c0 c3 e8 00 00 00 00 0f  .T$.H9.|........
    0x0070 0b e8 00 00 00 00 eb 88 cc cc cc cc cc cc cc cc  ................
    rel 5+4 t=14 +0
    rel 107+4 t=6 runtime.panicindex+0
    rel 114+4 t=6 runtime.morestack_noctxt+0
"".DdotInc t=1 size=160 value=0 args=0x60 locals=0x0
    0x0000 00000 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   TEXT    "".DdotInc(SB), $0-96
    0x0000 00000 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    (TLS), CX
    0x0009 00009 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   CMPQ    SP, 16(CX)
    0x000d 00013 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   JLS 148
    0x0013 00019 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   NOP
    0x0013 00019 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   NOP
    0x0013 00019 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   FUNCDATA    $0, gclocals·7f14b12e2041f9b568f9bbe12353a4a8(SB)
    0x0013 00019 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   FUNCDATA    $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0013 00019 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   XORPS   X0, X0
    0x0016 00022 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVSD   X0, "".sum+96(FP)
    0x001c 00028 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    "".ix+80(FP), AX
    0x0021 00033 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVQ    "".iy+88(FP), CX
    0x0026 00038 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   MOVQ    $0, DX
    0x0028 00040 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   MOVQ    "".n+56(FP), BX
    0x002d 00045 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   CMPQ    DX, BX
    0x0030 00048 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   JGE $0, 140
    0x0032 00050 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   MOVSD   "".sum+96(FP), X0
    0x0038 00056 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVQ    "".y+40(FP), BP
    0x003d 00061 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   CMPQ    CX, BP
    0x0040 00064 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   JCC $0, 141
    0x0042 00066 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVQ    "".y+32(FP), SI
    0x0047 00071 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVSD   (SI)(CX*8), X1
    0x004c 00076 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVQ    "".x+16(FP), DI
    0x0051 00081 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   CMPQ    AX, DI
    0x0054 00084 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   JCC $0, 141
    0x0056 00086 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVQ    "".x+8(FP), R8
    0x005b 00091 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVSD   (R8)(AX*8), X2
    0x0061 00097 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MULSD   X2, X1
    0x0065 00101 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   ADDSD   X1, X0
    0x0069 00105 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   MOVSD   X0, "".sum+96(FP)
    0x006f 00111 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   INCQ    DX
    0x0072 00114 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:19)   MOVQ    "".incX+64(FP), R9
    0x0077 00119 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:19)   ADDQ    R9, AX
    0x007a 00122 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:20)   MOVQ    "".incY+72(FP), R10
    0x007f 00127 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:20)   ADDQ    R10, CX
    0x0082 00130 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   MOVQ    "".n+56(FP), BX
    0x0087 00135 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   CMPQ    DX, BX
    0x008a 00138 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:17)   JLT $0, 50
    0x008c 00140 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:22)   RET
    0x008d 00141 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   PCDATA  $0, $0
    0x008d 00141 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   CALL    runtime.panicindex(SB)
    0x0092 00146 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   UNDEF
    0x0094 00148 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:18)   NOP
    0x0094 00148 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   CALL    runtime.morestack_noctxt(SB)
    0x0099 00153 (/Users/brendan/Documents/mygo/src/github.com/gonum/internal/asm/ddot.go:16)   JMP 0
    0x0000 65 48 8b 0c 25 00 00 00 00 48 3b 61 10 0f 86 81  eH..%....H;a....
    0x0010 00 00 00 0f 57 c0 f2 0f 11 44 24 60 48 8b 44 24  ....W....D$`H.D$
    0x0020 50 48 8b 4c 24 58 31 d2 48 8b 5c 24 38 48 39 da  PH.L$X1.H.\$8H9.
    0x0030 7d 5a f2 0f 10 44 24 60 48 8b 6c 24 28 48 39 e9  }Z...D$`H.l$(H9.
    0x0040 73 4b 48 8b 74 24 20 f2 0f 10 0c ce 48 8b 7c 24  sKH.t$ .....H.|$
    0x0050 10 48 39 f8 73 37 4c 8b 44 24 08 f2 41 0f 10 14  .H9.s7L.D$..A...
    0x0060 c0 f2 0f 59 ca f2 0f 58 c1 f2 0f 11 44 24 60 48  ...Y...X....D$`H
    0x0070 ff c2 4c 8b 4c 24 40 4c 01 c8 4c 8b 54 24 48 4c  ..L.L$@L..L.T$HL
    0x0080 01 d1 48 8b 5c 24 38 48 39 da 7c a6 c3 e8 00 00  ..H.\$8H9.|.....
    0x0090 00 00 0f 0b e8 00 00 00 00 e9 62 ff ff ff cc cc  ..........b.....
    rel 5+4 t=14 +0
    rel 142+4 t=6 runtime.panicindex+0
    rel 149+4 t=6 runtime.morestack_noctxt+0
gclocals·33cdeccccebe80329f1fdbee7f5874cb t=8 dupok size=8 value=0
    0x0000 01 00 00 00 00 00 00 00                          ........
gclocals·71f75e7e2fe2878e818867fe3428bd87 t=8 dupok size=12 value=0
    0x0000 01 00 00 00 07 00 00 00 09 00 00 00              ............
gclocals·33cdeccccebe80329f1fdbee7f5874cb t=8 dupok size=8 value=0
    0x0000 01 00 00 00 00 00 00 00                          ........
gclocals·7f14b12e2041f9b568f9bbe12353a4a8 t=8 dupok size=12 value=0
    0x0000 01 00 00 00 0c 00 00 00 09 00 00 00              ............
"".DdotUnitary·f t=8 dupok size=8 value=0
    0x0000 00 00 00 00 00 00 00 00                          ........
    rel 0+8 t=1 "".DdotUnitary+0
"".DdotInc·f t=8 dupok size=8 value=0
    0x0000 00 00 00 00 00 00 00 00                          ........
    rel 0+8 t=1 "".DdotInc+0
runtime.gcbits.01 t=8 dupok size=1 value=0
    0x0000 01                                               .
go.string.hdr."[]float64" t=8 dupok size=16 value=0
    0x0000 00 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00  ................
    rel 0+8 t=1 go.string."[]float64"+0
go.string."[]float64" t=8 dupok size=10 value=0
    0x0000 5b 5d 66 6c 6f 61 74 36 34 00                    []float64.
type.[]float64 t=8 dupok size=72 value=0
    0x0000 18 00 00 00 00 00 00 00 08 00 00 00 00 00 00 00  ................
    0x0010 30 33 37 9c 00 08 08 17 00 00 00 00 00 00 00 00  037.............
    0x0020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    0x0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    0x0040 00 00 00 00 00 00 00 00                          ........
    rel 24+8 t=1 runtime.algarray+0
    rel 32+8 t=1 runtime.gcbits.01+0
    rel 40+8 t=1 go.string.hdr."[]float64"+0
    rel 56+8 t=1 go.weak.type.*[]float64+0
    rel 64+8 t=1 type.float64+0
go.typelink.[]float64   []float64 t=8 dupok size=8 value=0
    0x0000 00 00 00 00 00 00 00 00                          ........
    rel 0+8 t=1 type.[]float64+0
randall77 commented 8 years ago

Looks like mostly a problem of not registerizing a few variables before the loop starts. I'll have to think about how to handle this one. Also a nil check not eliminated, I've got a simple fix for that.

I'm surprised the times are that much slower. It is only a few extra loads from the stack frame per iteration.

gopherbot commented 8 years ago

CL https://golang.org/cl/19923 mentions this issue.

randall77 commented 8 years ago

Looks like almost all of lost performance is caused by not SSA-ifying PARAMOUT (approximately, named return) values. You can test by changing

func DdotUnitary(x, y []float64) (sum float64) {
    for i, v := range x {
        sum += y[i] * v
    }
    return
}

to

func DdotUnitary(x, y []float64) float64 {
    sum := 0.0
    for i, v := range x {
        sum += y[i] * v
    }
    return sum
}

I already have a CL partially coded up to fix this. I'll increase its priority.

For reproducing posterity:

go get github.com/gonum/blas
go get github.com/gonum/floats
cd $GOPATH/src/github.com/gonum/blas/native
go test -test.bench=Ddot -tags noasm
gopherbot commented 8 years ago

CL https://golang.org/cl/19988 mentions this issue.

btracey commented 8 years ago

Definitely much better than it was. Still minor regressions

brendan:~$ benchstat goonesix.txt ssatip.txt 
name                   old time/op  new time/op  delta
DdotSmallBothUnitary   18.7ns ± 1%  23.5ns ± 1%  +25.67%  (p=0.016 n=4+5)
DdotSmallIncUni        23.4ns ± 1%  29.4ns ± 1%  +25.58%  (p=0.008 n=5+5)
DdotSmallUniInc        22.7ns ± 1%  28.1ns ± 1%  +23.79%  (p=0.008 n=5+5)
DdotSmallBothInc       22.5ns ± 0%  27.9ns ± 2%  +24.09%  (p=0.016 n=4+5)
DdotMediumBothUnitary   924ns ± 1%   933ns ± 1%   +1.00%  (p=0.024 n=5+5)
DdotMediumIncUni       1.28µs ± 2%  1.77µs ± 1%  +37.85%  (p=0.008 n=5+5)
DdotMediumUniInc       1.23µs ± 0%  1.53µs ± 0%  +24.46%  (p=0.008 n=5+5)
DdotMediumBothInc      1.33µs ± 1%  1.83µs ± 1%  +37.53%  (p=0.008 n=5+5)
DdotLargeBothUnitary   93.7µs ± 1%  92.6µs ± 1%     ~     (p=0.095 n=5+5)
DdotLargeIncUni         200µs ± 1%   244µs ± 1%  +22.01%  (p=0.008 n=5+5)
DdotLargeUniInc         135µs ± 1%   174µs ± 1%  +28.88%  (p=0.008 n=5+5)
DdotLargeBothInc        275µs ± 1%   302µs ± 1%   +9.75%  (p=0.008 n=5+5)
DdotHugeBothUnitary    11.7ms ± 1%  11.4ms ± 0%   -2.59%  (p=0.008 n=5+5)
DdotHugeIncUni         28.6ms ± 1%  31.4ms ± 1%   +9.78%  (p=0.008 n=5+5)
DdotHugeUniInc         19.8ms ± 1%  23.0ms ± 1%  +16.23%  (p=0.008 n=5+5)
DdotHugeBothInc        37.7ms ± 1%  37.6ms ± 3%     ~     (p=0.421 n=5+5)
gopherbot commented 8 years ago

CL https://golang.org/cl/20151 mentions this issue.

btracey commented 8 years ago

Comparison with go version devel +c63dbd8 Thu Mar 10 18:35:10 2016 +0000 darwin/amd64

brendan:~/Documents/mygo$ benchstat blasonesixddot.txt ssatipddot.txt 
name                     old time/op  new time/op  delta
DdotSmallBothUnitary-8   17.6ns ± 1%  15.6ns ± 2%  -11.28%  (p=0.008 n=5+5)
DdotSmallIncUni-8        21.9ns ± 1%  21.9ns ± 1%     ~     (p=0.952 n=5+5)
DdotSmallUniInc-8        21.2ns ± 1%  20.2ns ± 0%   -4.54%  (p=0.000 n=5+4)
DdotSmallBothInc-8       21.1ns ± 0%  20.8ns ± 1%   -1.42%  (p=0.016 n=5+5)
DdotMediumBothUnitary-8   851ns ± 1%   843ns ± 1%   -1.01%  (p=0.032 n=5+5)
DdotMediumIncUni-8       1.17µs ± 1%  0.95µs ± 0%  -18.32%  (p=0.008 n=5+5)
DdotMediumUniInc-8       1.12µs ± 0%  0.86µs ± 1%  -22.91%  (p=0.008 n=5+5)
DdotMediumBothInc-8      1.21µs ± 1%  0.99µs ± 2%  -18.72%  (p=0.008 n=5+5)
DdotLargeBothUnitary-8   85.9µs ± 1%  83.0µs ± 1%   -3.33%  (p=0.008 n=5+5)
DdotLargeIncUni-8         169µs ± 1%   154µs ± 1%   -8.97%  (p=0.008 n=5+5)
DdotLargeUniInc-8         121µs ± 1%   106µs ± 1%  -11.99%  (p=0.008 n=5+5)
DdotLargeBothInc-8        241µs ± 1%   230µs ± 1%   -4.26%  (p=0.008 n=5+5)
DdotHugeBothUnitary-8    10.6ms ± 1%  10.1ms ± 1%   -4.59%  (p=0.008 n=5+5)
DdotHugeIncUni-8         25.8ms ± 1%  25.6ms ± 3%     ~     (p=0.151 n=5+5)
DdotHugeUniInc-8         17.7ms ± 1%  16.7ms ± 1%   -5.81%  (p=0.016 n=5+4)
DdotHugeBothInc-8        33.0ms ± 0%  33.2ms ± 1%     ~     (p=0.151 n=5+5)