Open mbulatova opened 8 years ago
@TocarIP
/cc @cherrymui
Confirmed that there's been no change since 1.7, although the slowdown from 1.6 is less pronounced for me:
$ benchstat 16 17
name old time/op new time/op delta
Lgamma-8 21.1ns ± 5% 22.7ns ±11% +7.48% (p=0.000 n=46+50)
$ benchstat 17 tip
name old time/op new time/op delta
Lgamma-8 22.7ns ±11% 22.7ns ±10% ~ (p=0.684 n=50+50)
$ benchstat 16 tip
name old time/op new time/op delta
Lgamma-8 21.1ns ± 5% 22.7ns ±10% +7.18% (p=0.000 n=46+50)
For record, I tried CL https://go-review.googlesource.com/c/31656/ to make loads and computations interleaving, but it doesn't help on speed.
I have looked into the internals of the slow case and looks like there is a problem with register spill/fill caused by a call in the hot loop here:
46db60: movsd %xmm2,0x40(%rsp) <---------------
46db66: movsd 0x10368a(%rip),%xmm3 # 5711f8 <$f64.3b90000000000000>
46db6d: 00
46db6e: ucomisd %xmm2,%xmm3
46db72: jbe 46dbc5 <math.Lgamma+0xe5>
...
6db8b: callq 471150 <math.Log> <---------------- and SinPi later
...
46dbc5: ucomisd %xmm0,%xmm1
46dbc9: jbe 46e46f <math.Lgamma+0x98f>
...
46e46f: xor %ecx,%ecx
46e471: xorps %xmm0,%xmm0
46e474: mov $0x1,%rax
46e47b: jmpq 46dc62 <math.Lgamma+0x182>
...
46dc62: mov %cl,0x16(%rsp)
46dc66: movsd %xmm0,0x30(%rsp)
46dc6c: mov %rax,0x38(%rsp)
46dc71: movsd 0x40(%rsp),%xmm2 <---------------
Hi @aregm, thanks very much for looking at this. I tried to reproduce this result and I can't. If I run
go get -u rsc.io/tmp/gammabench
cd $GOPATH/src/rsc.io/tmp/gammabench
go test -bench . -count 5
then I get:
BenchmarkLgamma16-8 100000000 13.4 ns/op
BenchmarkLgamma16-8 100000000 13.4 ns/op
BenchmarkLgamma16-8 100000000 13.4 ns/op
BenchmarkLgamma16-8 100000000 13.5 ns/op
BenchmarkLgamma16-8 100000000 13.4 ns/op
BenchmarkLgamma17-8 100000000 14.8 ns/op
BenchmarkLgamma17-8 100000000 14.8 ns/op
BenchmarkLgamma17-8 100000000 14.8 ns/op
BenchmarkLgamma17-8 100000000 14.8 ns/op
BenchmarkLgamma17-8 100000000 14.8 ns/op
BenchmarkLgammaZZZ-8 100000000 14.7 ns/op
BenchmarkLgammaZZZ-8 100000000 14.6 ns/op
BenchmarkLgammaZZZ-8 100000000 14.6 ns/op
BenchmarkLgammaZZZ-8 100000000 14.6 ns/op
BenchmarkLgammaZZZ-8 100000000 14.8 ns/op
That directory has the assembly equivalents of the Go 1.6, Go 1.7, and Go master (ZZZ) compiler output for the math.Lgamma function (lgamma16.s, lgamma17.s, lgammazzz.s). The idea was to be able to modify the assembly directly, to be able to experiment to test hypotheses about the slowdown without having to coerce the compiler into making the changes you want to try.
In the Go 1.6 assembly, the spill of the temporary holding lgamma happens before math.Log, instead of above the switch, as you identified. That is, Go 1.6 does:
DIVSD X4, X1
ADDSD X1, X0
MOVAPD X0, X5
MOVSD X0, lgamma+8(FP)
MOVSD $(1.0), X1
CMPQ BX, $4
JGT _1284
CMPQ BX, $3
JNE _1252
_1189:
MOVAPD X1, X2
MOVSD $(2.0), X1
ADDSD X3, X1
MULSD X2, X1
MOVSD X5, autotmp_0310-8(SP)
MOVSD X1, (SP)
PCDATA $0, $0
CALL math·Log(SB)
MOVSD 8(SP), X2
MOVSD autotmp_0310-8(SP), X5
ADDSD X2, X5
MOVSD X5, lgamma+8(FP)
NOP
while current master does (Go 1.7 is similar):
DIVSD X1, X2
ADDSD X3, X2
MOVSD X2, autotmp_419-16(SP)
CMPQ DX, $4
JGT _1695
CMPQ DX, $3
JNE _1658
MOVSD $(1.0), X1
_1604:
ADDSD X5, X4
MULSD X1, X4
MOVSD X4, (SP)
PCDATA $0, $0
CALL math·Log(SB)
MOVSD autotmp_419-16(SP), X0
ADDSD 8(SP), X0
It's true that autotmp_419 is being spilled unnecessarily in the case that the switch doesn't end up in a case that eventually calls Log. But the spll in master to autotmp_419 replaces the (possibly also unnecessary) spill to lgamma+0(FP) in Go 1.6, so there aren't extra spills compared to Go 1.6, at least not in the hot code.
Even so, I tried changing the lgammazzz.s assembly to move the spill:
DIVSD X1, X2
ADDSD X3, X2
//old MOVSD X2, autotmp_419-16(SP)
CMPQ DX, $4
JGT _1695
CMPQ DX, $3
JNE _1658
MOVSD $(1.0), X1
_1604:
ADDSD X5, X4
MULSD X1, X4
MOVSD X4, (SP)
PCDATA $0, $0
MOVSD X2, autotmp_419-16(SP) // new
CALL math·Log(SB)
MOVSD autotmp_419-16(SP), X0
ADDSD 8(SP), X0
That cut BenchmarkLgammaZZZ from 14.6ns to 14.5ns, so it does help by 0.1ns but it doesn't seem to be the major contributor for the 1.2ns gap between Go 1.6 and master.
There are two main causes for this degradation: first in bad case the register allocator is not keeping X2 register live enough and does redundant spills in the critical path, which is: pc101 pc2246 pc123 pc224 pc2231 pc378 pc432 pc448 pc1393
In the good case, it keeps X2 alive throughout the critical path and spills it just before ret. Second, the bad case has about 5% more code execution which affects the caches and contributes to the rest of degradation. I have used temp register to get back some of the degradations and it gives more than half back.
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go1.6.3, go version devel +8086e7c
What operating system and processor architecture are you using (
go env
)?What did you do?
Investigating slowdown of Lgamma benchmark. 1.6 vs 1.7 gives: name old time/op new time/op delta Lgamma-4 11.9ns ± 0% 14.2ns ± 0% +19.33% (p=0.000 n=29+29)
It might be compiler issue(code not changed). Looking into objdump of lgamma function for both versions I see different instruction order and registers used: In 1.7:
In 1.6:
Tried force scheduler to generate instruction order like in 1.6 (for that expanded all expressions like
p1 := _lgamT[0] + w_(_lgamT[3]+w_(_lgamT[6]+w_(_lgamT[9]+w__lgamT[12])))
with tmp variables). But result was still bad: name old time/op new time/op delta Lgamma-4 11.9ns ± 0% 13.1ns ± 0% +10.08% (p=0.000 n=29+19)Also tried to generate random registers - no much effect: name old time/op new time/op delta Lgamma-4 11.9ns ± 0% 13.5ns ± 0% +13.45% (p=0.000 n=29+10)