golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.83k stars 17.65k forks source link

cmd/compile: BLAS DdotSmall and Idamax regressions #14917

Closed btracey closed 8 years ago

btracey commented 8 years ago

Idamax and Ddot small are showing regressions between 1.6 and gotip.

brendan:~/Documents/mygo/src/github.com/gonum/blas/native$ go version
go version devel +7177cb9 Tue Mar 22 17:30:30 2016 +0000 darwin/amd64

Reproducing:

go get github.com/gonum/blas
go get github.com/gonum/floats
cd $GOPATH/src/github.com/gonum/blas/native
go test -bench DdotSmall -tags=noasm
go test -bench Ida -tags=noasm

While the Ddot regressions look small, there had been improvements as of the resolution of #14511 , so it's a significant regression since then. Ddot shows improvement between 1.6 and tip for the benchmarks with larger vector sizes (slices of size 100+ instead of 10).

DdotSmallBothUnitary-8    17.9ns ± 2%  18.9ns ± 1%   +6.04%  (p=0.000 n=10+10)
DdotSmallIncUni-8         22.1ns ± 1%  23.1ns ± 1%   +4.87%   (p=0.000 n=10+9)
DdotSmallUniInc-8         21.6ns ± 1%  21.6ns ± 1%     ~      (p=0.455 n=10+9)
DdotSmallBothInc-8        21.6ns ± 1%  22.1ns ± 1%   +2.73%    (p=0.000 n=8+8)
IdamaxSmallUnitaryInc-8   31.6ns ±10%  36.5ns ± 4%  +15.46%   (p=0.000 n=10+8)
IdamaxSmallPosInc-8       27.5ns ±19%  44.0ns ±11%  +59.75%  (p=0.000 n=10+10)
IdamaxMediumUnitaryInc-8  1.59µs ± 4%  1.92µs ± 5%  +20.81%  (p=0.000 n=10+10)
IdamaxMediumPosInc-8      1.83µs ± 1%  2.17µs ± 2%  +18.58%   (p=0.000 n=9+10)
IdamaxLargeUnitaryInc-8    145µs ± 2%   192µs ± 2%  +32.64%  (p=0.000 n=10+10)
IdamaxLargePosInc-8        195µs ± 2%   218µs ± 2%  +12.06%  (p=0.000 n=10+10)
IdamaxHugeUnitaryInc-8    14.8ms ± 1%  19.8ms ± 1%  +33.49%  (p=0.000 n=10+10)
IdamaxHugePosInc-8        26.9ms ± 3%  27.9ms ± 1%   +3.62%   (p=0.000 n=10+9)

/cc @randall77 @tzneal @dr2chase @josharian @brtzsnr

OneOfOne commented 8 years ago

I wonder if this is related to #14920, try to run your benchmark against d37d3bdcfc429168adac5bf046172fd9c07bfdc2 (the last good commit for me).

btracey commented 8 years ago

14920 is the same issue for Ddot. That commit fixes the problem.

The Idamax cases are a different issue, there is still significant regression even at d37d3bd

randall77 commented 8 years ago

I'm pretty sure this has nothing to do with #14920. At least, the fix for that issue certainly doesn't apply here. It looks like for ddot, anyway, the additional overhead (compared to c03ed49, about 20 days ago and just after the #14511 fix) looks like an unnecessary nil check and the associated load:

before:

    0x0014 00020 (ddot.go:5)    LEAQ    8(DX), BP
    0x0018 00024 (ddot.go:5)    INCQ    BX
    0x001b 00027 (ddot.go:5)    MOVQ    (DX), DX
    0x001e 00030 (ddot.go:6)    ADDQ    DX, CX
    0x0021 00033 (ddot.go:5)    MOVQ    BP, DX
    0x0024 00036 (ddot.go:5)    CMPQ    BX, AX
    0x0027 00039 (ddot.go:5)    JLT $0, 20

after:

    0x0014 00020 (ddot.go:5)    MOVQ    "".a+8(FP), BP
    0x0019 00025 (ddot.go:5)    TESTB   AL, (BP)
    0x001c 00028 (ddot.go:5)    MOVQ    (AX), SI
    0x001f 00031 (ddot.go:5)    ADDQ    $8, AX
    0x0023 00035 (ddot.go:5)    INCQ    BX
    0x0026 00038 (ddot.go:6)    ADDQ    DX, SI
    0x0029 00041 (ddot.go:6)    MOVQ    SI, DX
    0x002c 00044 (ddot.go:5)    CMPQ    BX, CX
    0x002f 00047 (ddot.go:5)    JLT $0, 20

The nil check in the "before" case can be removed because we have a nil check of &a[i] followed by a load from that pointer. The nil check is effectively merged into the load so you don't see it in the generated assembly.

The nil check in the "after" case isn't being removed because it is checking the original slice ptr &a[0] each time. The load is still from &a[i] so the merging doesn't happen and the nil check must be explicit.

In either case, it is kind of dumb to do nil checks on the pointer loaded from a slice. If the bounds check passes, the pointer is guaranteed to be non-nil. I'll see if I can whip up a fix. It might also be instructive to figure out what changed in the last 20 days about nil checks, I'm not sure what caused it.

I don't see any difference in the inner loop between d37d3bd and tip, so I'm not sure why @btracey you saw performance improve at d37d3bd.

randall77 commented 8 years ago

Looks like the culprit is https://go-review.googlesource.com/20307 @brtzsnr It rewrites the nil check to be on &a[0] instead of &a[i]. It doesn't seem to be helping any in this case. (Lifting the nil check out of the loop altogether is an admirable goal, and this rewrite is one step on the way. But without lifting it out of the loop, the rewrite is just hurting us.)

brtzsnr commented 8 years ago

I'll send a CL to remove that rewrite and I'll figure it out how to remove the NilChecks completely. Now that we have use counts should be easier.

gopherbot commented 8 years ago

CL https://golang.org/cl/21040 mentions this issue.

btracey commented 8 years ago

That change may have fixed Ddot, but it does not fix the Idamax problem. Comparing go1.6 with go version devel +7e88826 Mon Mar 28 14:10:21 2016 +0000 darwin/amd64

IdamaxSmallUnitaryInc-8   32.4ns ± 7%  39.1ns ± 6%  +20.80%  (p=0.008 n=5+5)
IdamaxSmallPosInc-8       28.7ns ±11%  41.1ns ±11%  +43.24%  (p=0.008 n=5+5)
IdamaxMediumUnitaryInc-8  1.58µs ± 2%  2.03µs ± 2%  +27.83%  (p=0.008 n=5+5)
IdamaxMediumPosInc-8      1.86µs ± 2%  2.38µs ±11%  +27.93%  (p=0.008 n=5+5)
IdamaxLargeUnitaryInc-8    150µs ± 2%   195µs ± 1%  +30.23%  (p=0.008 n=5+5)
IdamaxLargePosInc-8        202µs ± 1%   241µs ± 2%  +19.10%  (p=0.008 n=5+5)
IdamaxHugeUnitaryInc-8    15.1ms ± 1%  21.0ms ± 2%  +39.67%  (p=0.008 n=5+5)
IdamaxHugePosInc-8        27.9ms ± 3%  30.0ms ± 2%   +7.38%  (p=0.008 n=5+5)

Should I open a different issue?

randall77 commented 8 years ago

Sorry about closing this prematurely. But yes, please open a different issue. There are two different underlying problems so it would be better to have them in different issues.

btracey commented 8 years ago

No problem. Do you have a preferred title?

randall77 commented 8 years ago

cmd/compile: BLAS Idamax regression