math: Lgamma got slower from Go 1.6 to Go 1.7

mbulatova commented 8 years ago

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (`go version`)?

go1.6.3, go version devel +8086e7c

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build584510834=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

What did you do?

Investigating slowdown of Lgamma benchmark. 1.6 vs 1.7 gives: name old time/op new time/op delta Lgamma-4 11.9ns ± 0% 14.2ns ± 0% +19.33% (p=0.000 n=29+29)

It might be compiler issue(code not changed). Looking into objdump of lgamma function for both versions I see different instruction order and registers used: In 1.7:

REPNE MOVSD_XMM 0x14a4b5(IP), X3
REPNE MOVSD_XMM 0x14a4bd(IP), X5
REPNE MOVSD_XMM 0x14a4c5(IP), X6
REPNE MOVSD_XMM 0x14a4cd(IP), X7
REPNE MOVSD_XMM 0x14a4d4(IP), X8
REPNE MOVSD_XMM 0x14a4db(IP), X9
REPNE MULSD X4, X9
REPNE ADDSD X9, X8
REPNE MULSD X4, X8
REPNE ADDSD X8, X7
REPNE MULSD X4, X7
REPNE ADDSD X7, X6
REPNE MULSD X4, X6
...

In 1.6:

REPNE MOVSD_XMM 0x1f94c0(IP), X0
REPNE MOVSD_XMM 0x1f94c0(IP), X1
REPNE MULSD X3, X1
REPNE ADDSD X1, X0
REPNE MULSD X3, X0
REPNE MOVSD_XMM 0x1f949c(IP), X1
REPNE ADDSD X1, X0
REPNE MULSD X3, X0
REPNE MOVSD_XMM 0x1f9484(IP), X1
REPNE ADDSD X1, X0
REPNE MULSD X3, X0
...

Tried force scheduler to generate instruction order like in 1.6 (for that expanded all expressions like p1 := _lgamT[0] + w_(_lgamT[3]+w_(_lgamT[6]+w_(_lgamT[9]+w__lgamT[12]))) with tmp variables). But result was still bad: name old time/op new time/op delta Lgamma-4 11.9ns ± 0% 13.1ns ± 0% +10.08% (p=0.000 n=29+19)

Also tried to generate random registers - no much effect: name old time/op new time/op delta Lgamma-4 11.9ns ± 0% 13.5ns ± 0% +13.45% (p=0.000 n=29+10)

randall77 commented 8 years ago

@TocarIP

rsc commented 8 years ago

/cc @cherrymui

josharian commented 7 years ago

Confirmed that there's been no change since 1.7, although the slowdown from 1.6 is less pronounced for me:

$ benchstat 16 17
name      old time/op  new time/op  delta
Lgamma-8  21.1ns ± 5%  22.7ns ±11%  +7.48%  (p=0.000 n=46+50)
$ benchstat 17 tip
name      old time/op  new time/op  delta
Lgamma-8  22.7ns ±11%  22.7ns ±10%   ~     (p=0.684 n=50+50)
$ benchstat 16 tip
name      old time/op  new time/op  delta
Lgamma-8  21.1ns ± 5%  22.7ns ±10%  +7.18%  (p=0.000 n=46+50)

cherrymui commented 7 years ago

For record, I tried CL https://go-review.googlesource.com/c/31656/ to make loads and computations interleaving, but it doesn't help on speed.

aregm commented 7 years ago

I have looked into the internals of the slow case and looks like there is a problem with register spill/fill caused by a call in the hot loop here: 46db60: movsd %xmm2,0x40(%rsp) <--------------- 46db66: movsd 0x10368a(%rip),%xmm3 # 5711f8 <$f64.3b90000000000000> 46db6d: 00 46db6e: ucomisd %xmm2,%xmm3 46db72: jbe 46dbc5 <math.Lgamma+0xe5> ... 6db8b: callq 471150 <math.Log> <---------------- and SinPi later ... 46dbc5: ucomisd %xmm0,%xmm1 46dbc9: jbe 46e46f <math.Lgamma+0x98f> ... 46e46f: xor %ecx,%ecx 46e471: xorps %xmm0,%xmm0 46e474: mov $0x1,%rax 46e47b: jmpq 46dc62 <math.Lgamma+0x182> ... 46dc62: mov %cl,0x16(%rsp) 46dc66: movsd %xmm0,0x30(%rsp) 46dc6c: mov %rax,0x38(%rsp) 46dc71: movsd 0x40(%rsp),%xmm2 <---------------

rsc commented 7 years ago

Hi @aregm, thanks very much for looking at this. I tried to reproduce this result and I can't. If I run

go get -u rsc.io/tmp/gammabench
cd $GOPATH/src/rsc.io/tmp/gammabench
go test -bench . -count 5

then I get:

BenchmarkLgamma16-8     100000000           13.4 ns/op
BenchmarkLgamma16-8     100000000           13.4 ns/op
BenchmarkLgamma16-8     100000000           13.4 ns/op
BenchmarkLgamma16-8     100000000           13.5 ns/op
BenchmarkLgamma16-8     100000000           13.4 ns/op
BenchmarkLgamma17-8     100000000           14.8 ns/op
BenchmarkLgamma17-8     100000000           14.8 ns/op
BenchmarkLgamma17-8     100000000           14.8 ns/op
BenchmarkLgamma17-8     100000000           14.8 ns/op
BenchmarkLgamma17-8     100000000           14.8 ns/op
BenchmarkLgammaZZZ-8    100000000           14.7 ns/op
BenchmarkLgammaZZZ-8    100000000           14.6 ns/op
BenchmarkLgammaZZZ-8    100000000           14.6 ns/op
BenchmarkLgammaZZZ-8    100000000           14.6 ns/op
BenchmarkLgammaZZZ-8    100000000           14.8 ns/op

That directory has the assembly equivalents of the Go 1.6, Go 1.7, and Go master (ZZZ) compiler output for the math.Lgamma function (lgamma16.s, lgamma17.s, lgammazzz.s). The idea was to be able to modify the assembly directly, to be able to experiment to test hypotheses about the slowdown without having to coerce the compiler into making the changes you want to try.

In the Go 1.6 assembly, the spill of the temporary holding lgamma happens before math.Log, instead of above the switch, as you identified. That is, Go 1.6 does:

    DIVSD   X4, X1
    ADDSD   X1, X0
    MOVAPD  X0, X5
    MOVSD   X0, lgamma+8(FP)
    MOVSD   $(1.0), X1
    CMPQ    BX, $4
    JGT _1284
    CMPQ    BX, $3
    JNE _1252
_1189:
    MOVAPD  X1, X2
    MOVSD   $(2.0), X1
    ADDSD   X3, X1
    MULSD   X2, X1
    MOVSD   X5, autotmp_0310-8(SP)
    MOVSD   X1, (SP)
    PCDATA  $0, $0
    CALL    math·Log(SB)
    MOVSD   8(SP), X2
    MOVSD   autotmp_0310-8(SP), X5
    ADDSD   X2, X5
    MOVSD   X5, lgamma+8(FP)
    NOP

while current master does (Go 1.7 is similar):

    DIVSD   X1, X2
    ADDSD   X3, X2
    MOVSD   X2, autotmp_419-16(SP)
    CMPQ    DX, $4
    JGT _1695
    CMPQ    DX, $3
    JNE _1658
    MOVSD   $(1.0), X1
_1604:
    ADDSD   X5, X4
    MULSD   X1, X4
    MOVSD   X4, (SP)
    PCDATA  $0, $0
    CALL    math·Log(SB)
    MOVSD autotmp_419-16(SP), X0
    ADDSD   8(SP), X0

It's true that autotmp_419 is being spilled unnecessarily in the case that the switch doesn't end up in a case that eventually calls Log. But the spll in master to autotmp_419 replaces the (possibly also unnecessary) spill to lgamma+0(FP) in Go 1.6, so there aren't extra spills compared to Go 1.6, at least not in the hot code.

Even so, I tried changing the lgammazzz.s assembly to move the spill:

    DIVSD   X1, X2
    ADDSD   X3, X2
//old   MOVSD   X2, autotmp_419-16(SP)
    CMPQ    DX, $4
    JGT _1695
    CMPQ    DX, $3
    JNE _1658
    MOVSD   $(1.0), X1
_1604:
    ADDSD   X5, X4
    MULSD   X1, X4
    MOVSD   X4, (SP)
    PCDATA  $0, $0
    MOVSD   X2, autotmp_419-16(SP) // new
    CALL    math·Log(SB)
    MOVSD autotmp_419-16(SP), X0
    ADDSD   8(SP), X0

That cut BenchmarkLgammaZZZ from 14.6ns to 14.5ns, so it does help by 0.1ns but it doesn't seem to be the major contributor for the 1.2ns gap between Go 1.6 and master.

aregm commented 7 years ago

There are two main causes for this degradation: first in bad case the register allocator is not keeping X2 register live enough and does redundant spills in the critical path, which is: pc101 pc2246 pc123 pc224 pc2231 pc378 pc432 pc448 pc1393

In the good case, it keeps X2 alive throughout the critical path and spills it just before ret. Second, the bad case has about 5% more code execution which affects the caches and contributes to the rest of degradation. I have used temp register to get back some of the degradations and it gives more than half back.

lgamma_bad.s.txt lgamma_good.s.txt

golang / go