Closed josharian closed 6 years ago
Looks like this might be a duplicate of #19714. Sound right, @navytux?
@josharian thanks for looking into this. Yes, this sounds right.
On a related note some time ago I was thinking: when there is a loop and slice accesses inside which generally needs bounds checking, for many kind of loops the compiler could transform the checks to be performed from inside loop to equivalent only one bounds check before entering the loop.
The limiting factors here are:
That means that in general if such transformation will be happenning, the loop with insides bounds check needs to be transformed into:
Unfortunately that means the code could become more fat. Fortunately with such transformation in place, if a user adds
if len(a) < ... {
panic(...)
}
or similar in function preambule with correct checks equivalent to 1, the compiler will know generating 3 is not needed at all. Similarly if the function gets inlined and in inlining context the compiler already has some information about variables it should be easy for compiler to check whether only fast loop needs to be remained becase pre-condition for that is already in top-level block of inlined function.
To me this suggest that having such transformations as one of the basis of BCE should be a good idea. Personally I'm also ok with some code size increase if that means bounds in all functions will be automatically checked before loops not inside. And for the slow loop version, since we know it will panic and is there only for correctness, the generated code could be somehow optimized more for size.
I beg you pardon if this all is too abstract or non practicall, Kirill
P.S. for the reference some time ago I've made a-la protobuf compiler which moves such overflow checks from inside loops out:
https://lab.nexedi.com/kirr/neo/blob/391212b7/go/neo/protogen.go
P.P.S. it would be great if in Go1.10 cycle some love would be devoted to BCE
It's possible but I am not sure it will make the cut. Bounds checks are highly predictable (which reduces the win), and when I took a stab at the analysis before it took too much time.
@dr2chase
.../go/src/encoding/hex$ go test -bench . -count=10 >a.txt
.../go/src/encoding/hex$ go test -gcflags=-B -bench . -count=10 >b.txt
.../go/src/encoding/hex$ benchstat a.txt b.txt
name old time/op new time/op delta
Encode/256-4 352ns ± 0% 311ns ± 1% -11.54% (p=0.000 n=10+9)
Encode/1024-4 1.38µs ± 1% 1.23µs ± 1% -11.03% (p=0.000 n=10+10)
Encode/4096-4 5.48µs ± 0% 4.87µs ± 1% -11.09% (p=0.000 n=9+9)
Encode/16384-4 21.9µs ± 0% 19.5µs ± 0% -10.97% (p=0.000 n=9+10)
That's one benchmark. How about json, http, and the geomean of the go1 benchmarks?
@dr2chase bounds checks of course feels on codes which work more with slices, like the hex encoding above (access to hextable), all gonum (dot product, matrix diagonalization, etc), search in sorted slice, permutations etc. When code works more with structures -B won't practically make a difference.
Said that here are the benchmarks you've asked.
(g.env) kirr@deco:~/src/tools/go/go/src/encoding/json$ benchstat json.txt jsonB.txt
name old time/op new time/op delta
CodeEncoder-4 5.45ms ± 1% 5.24ms ± 0% -4.01% (p=0.000 n=19+17)
CodeMarshal-4 6.63ms ± 0% 6.22ms ± 9% -6.25% (p=0.000 n=15+20)
CodeDecoder-4 25.7ms ± 1% 25.7ms ± 1% ~ (p=0.365 n=20+19)
DecoderStream-4 297ns ± 1% 297ns ± 1% ~ (p=0.194 n=20+20)
CodeUnmarshal-4 26.9ms ± 1% 26.3ms ± 8% ~ (p=0.499 n=15+20)
CodeUnmarshalReuse-4 25.5ms ± 2% 25.5ms ± 2% ~ (p=0.550 n=20+19)
UnmarshalString-4 254ns ± 0% 253ns ± 0% -0.28% (p=0.000 n=18+19)
UnmarshalFloat64-4 225ns ± 1% 225ns ± 0% ~ (p=0.186 n=19+18)
UnmarshalInt64-4 198ns ± 1% 196ns ± 0% -0.87% (p=0.000 n=18+16)
Issue10335-4 294ns ± 0% 290ns ± 1% -1.25% (p=0.000 n=14+18)
Unmapped-4 903ns ± 0% 882ns ± 1% -2.39% (p=0.000 n=19+18)
NumberIsValid-4 18.6ns ± 0% 17.9ns ± 0% -3.76% (p=0.000 n=13+15)
NumberIsValidRegexp-4 584ns ± 2% 586ns ± 2% ~ (p=0.302 n=20+20)
SkipValue-4 12.8ms ± 4% 12.8ms ± 3% ~ (p=0.879 n=20+19)
EncoderEncode-4 166ns ± 7% 162ns ± 8% -2.05% (p=0.016 n=20+20)
(g.env) kirr@deco:~/src/tools/go/go/src/net/http$ benchstat http.txt httpB.txt
name old time/op new time/op delta
CookieString-4 839ns ± 1% 822ns ± 1% -2.06% (p=0.000 n=19+19)
ReadSetCookies-4 3.23µs ± 0% 3.21µs ± 1% -0.57% (p=0.000 n=19+18)
ReadCookies-4 3.28µs ± 0% 3.25µs ± 1% -0.81% (p=0.000 n=18+20)
HeaderWriteSubset-4 479ns ± 1% 486ns ± 1% +1.59% (p=0.000 n=20+19)
ReadRequestChrome-4 2.95µs ± 1% 2.93µs ± 1% -0.55% (p=0.000 n=20+20)
ReadRequestCurl-4 1.53µs ± 0% 1.52µs ± 1% -0.42% (p=0.000 n=18+20)
ReadRequestApachebench-4 1.53µs ± 0% 1.53µs ± 0% -0.33% (p=0.000 n=18+18)
ReadRequestSiege-4 1.96µs ± 0% 1.97µs ± 1% +0.44% (p=0.000 n=20+19)
ReadRequestWrk-4 1.08µs ± 0% 1.08µs ± 1% ~ (p=0.851 n=19+20)
ServeMux-4 73.4µs ± 1% 73.2µs ± 1% -0.36% (p=0.004 n=19+20)
ClientServer-4 44.7µs ± 1% 44.4µs ± 1% -0.73% (p=0.000 n=18+18)
ClientServerParallel4-4 22.4µs ± 6% 22.2µs ± 6% ~ (p=0.114 n=17+17)
ClientServerParallel64-4 110µs ±60% 101µs ±64% ~ (p=0.336 n=20+19)
ClientServerParallelTLS4-4 140µs ±13% 141µs ±14% ~ (p=0.718 n=20+20)
ClientServerParallelTLS64-4 80.3µs ±98% 86.5µs ±85% ~ (p=0.941 n=10+17)
Server-4 58.7µs ± 0% 58.5µs ± 1% -0.31% (p=0.003 n=10+19)
Client-4 57.9µs ± 1% 57.5µs ± 1% -0.61% (p=0.000 n=10+20)
ServerFakeConnNoKeepAlive-4 15.8µs ± 1% 15.9µs ± 1% +1.14% (p=0.000 n=10+20)
ServerFakeConnWithKeepAlive-4 7.64µs ± 0% 7.64µs ± 1% ~ (p=0.750 n=10+18)
ServerFakeConnWithKeepAliveLite-4 4.93µs ± 0% 4.85µs ± 1% -1.61% (p=0.000 n=9+20)
ServerHandlerTypeLen-4 6.24µs ± 0% 6.23µs ± 1% ~ (p=0.167 n=9+19)
ServerHandlerNoLen-4 5.77µs ± 1% 5.77µs ± 1% ~ (p=1.000 n=10+18)
ServerHandlerNoType-4 5.84µs ± 0% 5.90µs ± 3% +1.08% (p=0.001 n=10+18)
ServerHandlerNoHeader-4 4.52µs ± 0% 4.52µs ± 4% -0.03% (p=0.043 n=10+19)
ServerHijack-4 19.3µs ± 1% 18.9µs ± 2% -2.15% (p=0.000 n=10+18)
CloseNotifier-4 63.3µs ± 1% 63.7µs ± 3% ~ (p=0.914 n=10+20)
ResponseStatusLine-4 30.0ns ± 8% 29.9ns ± 6% ~ (p=0.467 n=10+20)
(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1$ benchstat -geomean go1.txt go1B.txt
name old time/op new time/op delta
BinaryTree17-4 2.25s ± 1% 2.25s ± 2% ~ (p=0.964 n=38+39)
Fannkuch11-4 2.91s ± 1% 1.95s ± 1% -32.86% (p=0.000 n=36+39)
FmtFprintfEmpty-4 34.8ns ± 1% 35.4ns ± 3% +1.74% (p=0.000 n=35+40)
FmtFprintfString-4 59.3ns ± 5% 58.9ns ± 1% ~ (p=0.797 n=40+38)
FmtFprintfInt-4 61.3ns ± 1% 61.9ns ± 1% +1.02% (p=0.000 n=34+39)
FmtFprintfIntInt-4 95.2ns ± 1% 95.5ns ± 2% ~ (p=0.073 n=33+38)
FmtFprintfPrefixedInt-4 114ns ± 5% 113ns ± 2% ~ (p=0.131 n=39+40)
FmtFprintfFloat-4 192ns ± 1% 193ns ± 1% +0.18% (p=0.008 n=39+38)
FmtManyArgs-4 412ns ± 2% 411ns ± 0% ~ (p=0.805 n=40+36)
GobDecode-4 5.41ms ± 1% 5.44ms ± 3% +0.49% (p=0.008 n=39+35)
GobEncode-4 4.17ms ± 1% 4.23ms ± 2% +1.35% (p=0.000 n=40+40)
Gzip-4 202ms ± 0% 203ms ± 1% ~ (p=0.057 n=35+36)
Gunzip-4 31.5ms ± 1% 31.5ms ± 1% ~ (p=0.700 n=37+33)
HTTPClientServer-4 39.9µs ± 2% 40.5µs ± 2% +1.60% (p=0.000 n=37+38)
JSONEncode-4 10.6ms ± 1% 10.6ms ± 2% +0.58% (p=0.000 n=38+39)
JSONDecode-4 42.9ms ± 1% 43.0ms ± 2% ~ (p=0.248 n=40+40)
Mandelbrot200-4 3.82ms ± 0% 3.84ms ± 1% +0.34% (p=0.002 n=7+19)
[Geo mean] 147µs 144µs -1.94%
in short: json - helps some parts, http - not sure, go1 - significantly helps Fannkuch11 (-33%) not sure about the rest as it all could be noise related to source positons of jumps changed.
Once again not all codes are using slices heavily but for those who do bounds checking takes visible part of runtime and not only 10% as e.g. now famous Fannkuch11 shows above.
Thanks, Kirill
I benchmarked the compiler with and without -B, including some anomalous cases in which bvec.AndNot is hot code, and got no impact.
A key question for fannkuch (and the rest of these) is whether BCE could actually eliminate the relevant checks; -B is more powerful than the compiler ever can be.
Josh thanks for your feedback and for trying the compiler with -B. About fannkuch I've tried to look inside to see details about bounds checking. For this I've compiled fannkuch_test.go
without and with -B, did go tool objdump -S
on both, removed addresses of instruction so that the diff is less noisy, and investigated the difference. Let me show it below with my annotations and comments:
@@ -2,259 +2,172 @@ TEXT %22%22.fannkuch(SB) gofile..$GOROOT/test/bench/go1/fannkuch_test.go
func fannkuch(n int) int {
64488b0c2500000000 MOVQ FS:0, CX [5:9]R_TLS_LE
483b6110 CMPQ 0x10(CX), SP
- 0f8623030000 JBE 0x2508
- 4881ec80000000 SUBQ $0x80, SP
- 48896c2478 MOVQ BP, 0x78(SP)
- 488d6c2478 LEAQ 0x78(SP), BP
- 488b842488000000 MOVQ 0x88(SP), AX
+ 0f86d5010000 JBE 0x235a
+ 4883ec48 SUBQ $0x48, SP
+ 48896c2440 MOVQ BP, 0x40(SP)
+ 488d6c2440 LEAQ 0x40(SP), BP
+ 488b442450 MOVQ 0x50(SP), AX
if n < 1 {
4883f801 CMPQ $0x1, AX
- 0f8ca8020000 JL 0x24b0
+ 0f8ca5010000 JL 0x2347
488d0d00000000 LEAQ 0(IP), CX [3:7]R_PCREL:type.int
perm := make([]int, n)
48890c24 MOVQ CX, 0(SP)
4889442408 MOVQ AX, 0x8(SP)
4889442410 MOVQ AX, 0x10(SP)
- e800000000 CALL 0x2222 [1:5]R_CALL:runtime.makeslice
- 488b442420 MOVQ 0x20(SP), AX
+ e800000000 CALL 0x21bc [1:5]R_CALL:runtime.makeslice
+ 488b442418 MOVQ 0x18(SP), AX
4889442438 MOVQ AX, 0x38(SP)
- 488b4c2418 MOVQ 0x18(SP), CX
- 48894c2468 MOVQ CX, 0x68(SP)
- 488d1500000000 LEAQ 0(IP), DX [3:7]R_PCREL:type.int
+ 488d0d00000000 LEAQ 0(IP), CX [3:7]R_PCREL:type.int
perm1 := make([]int, n)
- 48891424 MOVQ DX, 0(SP)
- 488b9c2488000000 MOVQ 0x88(SP), BX
- 48895c2408 MOVQ BX, 0x8(SP)
- 48895c2410 MOVQ BX, 0x10(SP)
- e800000000 CALL 0x2258 [1:5]R_CALL:runtime.makeslice
- 488b442418 MOVQ 0x18(SP), AX
- 4889442460 MOVQ AX, 0x60(SP)
- 488b4c2420 MOVQ 0x20(SP), CX
- 48894c2430 MOVQ CX, 0x30(SP)
- 488d1500000000 LEAQ 0(IP), DX [3:7]R_PCREL:type.int
- count := make([]int, n)
- 48891424 MOVQ DX, 0(SP)
- 488b942488000000 MOVQ 0x88(SP), DX
+ 48890c24 MOVQ CX, 0(SP)
+ 488b542450 MOVQ 0x50(SP), DX
4889542408 MOVQ DX, 0x8(SP)
4889542410 MOVQ DX, 0x10(SP)
- e800000000 CALL 0x228e [1:5]R_CALL:runtime.makeslice
- 488b442420 MOVQ 0x20(SP), AX
- 488b4c2418 MOVQ 0x18(SP), CX
- 488b942488000000 MOVQ 0x88(SP), DX
- 488b5c2460 MOVQ 0x60(SP), BX
- 488b742430 MOVQ 0x30(SP), SI
- 31ff XORL DI, DI
+ e800000000 CALL 0x21e5 [1:5]R_CALL:runtime.makeslice
+ 488b442418 MOVQ 0x18(SP), AX
+ 4889442430 MOVQ AX, 0x30(SP)
+ 488d0d00000000 LEAQ 0(IP), CX [3:7]R_PCREL:type.int
+ count := make([]int, n)
+ 48890c24 MOVQ CX, 0(SP)
+ 488b4c2450 MOVQ 0x50(SP), CX
+ 48894c2408 MOVQ CX, 0x8(SP)
+ 48894c2410 MOVQ CX, 0x10(SP)
+ e800000000 CALL 0x220e [1:5]R_CALL:runtime.makeslice
+ 488b442418 MOVQ 0x18(SP), AX
+ 488b4c2450 MOVQ 0x50(SP), CX
+ 488b542430 MOVQ 0x30(SP), DX
+ 31db XORL BX, BX
for i := 0; i < n; i++ {
- eb07 JMP 0x22b5
+ eb07 JMP 0x2228
perm1[i] = i // initial (trivial) permutation
- 48893cfb MOVQ DI, 0(BX)(DI*8)
+ 48891cda MOVQ BX, 0(DX)(BX*8)
for i := 0; i < n; i++ {
- 48ffc7 INCQ DI
- 4839d7 CMPQ DX, DI
- 7d0a JGE 0x22c4
- perm1[i] = i // initial (trivial) permutation
- 4839f7 CMPQ SI, DI
- 72ef JB 0x22ae
- e93d020000 JMP 0x2501 // panicindex
- 48894c2470 MOVQ CX, 0x70(SP)
+ 48ffc3 INCQ BX
+ 4839cb CMPQ CX, BX
+ 7cf4 JL 0x2221
BCE could remove ^^^ panicindex because:
perm1
is initialized with make([]int, n)
and its len is never changed for i := 0; i < n; i++ {
perm1[i] = i // initial (trivial) permutation
}
so BCE knows n <= len(perm1) (actually ==), 0 <= i < n -> i < len(perm1) -> no BC needed.
n1 := n - 1
- 488d7aff LEAQ -0x1(DX), DI
- 48897c2440 MOVQ DI, 0x40(SP)
- 4989d0 MOVQ DX, R8
- 4531c9 XORL R9, R9
- 4531d2 XORL R10, R10
+ 488d59ff LEAQ -0x1(CX), BX
+ 4889ce MOVQ CX, SI
+ 31ff XORL DI, DI
+ 4531c0 XORL R8, R8
for {
- e986010000 JMP 0x2466
+ e9cf000000 JMP 0x230d
count[r-1] = r
- 488954d1f8 MOVQ DX, -0x8(CX)(DX*8)
- 4c89da MOVQ R11, DX
+ 48894cc8f8 MOVQ CX, -0x8(AX)(CX*8)
+ 48ffc9 DECQ CX
for ; r != 1; r-- {
- 4883fa01 CMPQ $0x1, DX
- 740e JE 0x22fc
- count[r-1] = r
- 4c8d5aff LEAQ -0x1(DX), R11
- 4939c3 CMPQ AX, R11
- 72e9 JB 0x22e0
- e9fe010000 JMP 0x24fa // panicindex
+ 4883f901 CMPQ $0x1, CX
+ 75f2 JNE 0x223e
For count[r-1] = r
BCE could remove ^^^ panicindex because:
count
is initialized with make([]int, n)
and its len is never changedr
is initialized as n
, goes down to 1 in below loop, goes up in tail loop inside main loop for ; r < n; r++
, so it is never > n. => r ∈ [1, n] always. for ; r != 1; r-- {
count[r-1] = r
}
so BCE knows
1 <= r <= n
0 <= r-1 <= n-1 < n
so count[r-1]
does not need BC.
if perm1[0] != 0 && perm1[n1] != n1 {
- 4885f6 TESTQ SI, SI
- 0f86ee010000 JBE 0x24f3 // panicindex
- 4c8b1b MOVQ 0(BX), R11
- 4d85db TESTQ R11, R11
- 0f8490010000 JE 0x24a1
- 4839f7 CMPQ SI, DI
- 0f83d9010000 JAE 0x24f3 // panicindex
- 4e8b5cc3f8 MOVQ -0x8(BX)(R8*8), R11
- 4c39df CMPQ R11, DI
- 0f846a010000 JE 0x2492
- 4c8b5c2468 MOVQ 0x68(SP), R11
- 4c8b642438 MOVQ 0x38(SP), R12
- 41bd01000000 MOVL $0x1, R13
+ 4c8b0a MOVQ 0(DX), R9
+ 4d85c9 TESTQ R9, R9
+ 0f84e5000000 JE 0x233d
+ 4c8b4cf2f8 MOVQ -0x8(DX)(SI*8), R9
+ 4c39cb CMPQ R9, BX
+ 0f84cd000000 JE 0x2333
+ 4c8b4c2438 MOVQ 0x38(SP), R9
+ 41ba01000000 MOVL $0x1, R10
len(perm1)
is known to be n
if n < 1 {
return 0
}
-> this way perm1[0] does not need BC.
n1
is initialized as n - 1
and is never changedlen(perm1)
is once again known to be n
-> this way perm1[n1]
does not need BC
for i := 1; i < n; i++ { // perm = perm1
- eb07 JMP 0x2341
+ eb0b JMP 0x227e
perm[i] = perm1[i]
- 4f8934eb MOVQ R14, 0(R11)(R13*8)
+ 4e8b1cd2 MOVQ 0(DX)(R10*8), R11
+ 4f891cd1 MOVQ R11, 0(R9)(R10*8)
for i := 1; i < n; i++ { // perm = perm1
- 49ffc5 INCQ R13
- 4d39c5 CMPQ R8, R13
- 7d17 JGE 0x235d
- perm[i] = perm1[i]
- 4939f5 CMPQ SI, R13
- 0f839d010000 JAE 0x24ec // panicindex
- 4e8b34eb MOVQ 0(BX)(R13*8), R14
- 4d39e5 CMPQ R12, R13
- 72e2 JB 0x233a
- e98f010000 JMP 0x24ec // panicindex
- 4c894c2450 MOVQ R9, 0x50(SP)
+ 49ffc2 INCQ R10
+ 4939f2 CMPQ SI, R10
+ 7cf0 JL 0x2273
perm
and perm1
were initialized as make([]int, n)
and their len is never changed-> both perm[i]
and perm1[i]
do not need BC
k := perm1[0] // cache perm[0] in k
- 4c8b2b MOVQ 0(BX), R13
- 4531f6 XORL R14, R14
+ 4c8b12 MOVQ 0(DX), R10
+ 4531db XORL R11, R11
for { // k!=0 ==> k>0
- eb6d JMP 0x23d7
+ eb32 JMP 0x22bd
perm[i], perm[j] = perm[j], perm[i]
- 4b8b0cfb MOVQ 0(R11)(R15*8), CX
- 49890cfb MOVQ CX, 0(R11)(DI*8)
- 4f890cfb MOVQ R9, 0(R11)(R15*8)
+ 4f8b34e1 MOVQ 0(R9)(R12*8), R14
+ 4f8b3ce9 MOVQ 0(R9)(R13*8), R15
+ 4f8934e9 MOVQ R14, 0(R9)(R13*8)
+ 4f893ce1 MOVQ R15, 0(R9)(R12*8)
for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
- 488d4f01 LEAQ 0x1(DI), CX
- 49ffcf DECQ R15
- 488b7c2440 MOVQ 0x40(SP), DI
- 4c8b4c2450 MOVQ 0x50(SP), R9
- 48894c2448 MOVQ CX, 0x48(SP)
- 488b4c2470 MOVQ 0x70(SP), CX
- 488b7c2448 MOVQ 0x48(SP), DI
- 4c39ff CMPQ R15, DI
- 7d17 JGE 0x23b2
- perm[i], perm[j] = perm[j], perm[i]
- 4c39e7 CMPQ R12, DI
- 0f8341010000 JAE 0x24e5 // panicindex
- 4d8b0cfb MOVQ 0(R11)(DI*8), R9
- 4d39e7 CMPQ R12, R15
- 72bd JB 0x236a
- e933010000 JMP 0x24e5 // panicindex
k is initialized as perm1[0]
- no BC needed here since compiler already knows len(perm1) = n >= 1
initially perm1 is initialized as perm1[i] = i
i ∈ [0, n)
from ^^^ the compiler knows initially ∀ x ∈ [0, n) perm1[x] ∈ [0, n)
later elements of perm1 are only exchanged with each other, so the compiler can prove ∀ x ∈ [0, n) perm1[x] ∈ [0, n) holds always
thus k itself can be proven to be in [0, n) on first iteration (see also below about all next iterations)
the inner loop is
for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
perm[i], perm[j] = perm[j], perm[i]
}
-> both i and j are in valid range and no BC is needed for perm[i]
and perm[j]
- j := perm[k]
+ 49ffc5 INCQ R13
+ 49ffcc DECQ R12
4d39e5 CMPQ R12, R13
- 0f8323010000 JAE 0x24de // panicindex
- 4b8b3ceb MOVQ 0(R11)(R13*8), DI
+ 7ce5 JL 0x228b
+ j := perm[k]
+ 4f8b24d1 MOVQ 0(R9)(R10*8), R12
// Now exchange k (caching perm[0]) and perm[k]... with care!
j := perm[k]
perm[k] = k
k = j
perm[k]
does not need BCperm
is initialized as = perm1
- this way compiler can know ∀ x ∈ [0, n) perm[x]
∈ [0, n) initiallyj := perm[k]
j
is thus ∈ [0, n)perm[k] = k
preserves ∀ x ∈ [0, n) perm[x]
∈ [0, n) property.k = j
compiler can see k continues to k ∈ [0, n) in next iterations perm[k] = k
- 4f892ceb MOVQ R13, 0(R11)(R13*8)
+ 4f8914d1 MOVQ R10, 0(R9)(R10*8)
flips++
- 4d8d6e01 LEAQ 0x1(R14), R13
+ 4d8d5301 LEAQ 0x1(R11), R10
if k == 0 {
- 4885ff TESTQ DI, DI
- 7425 JE 0x23f1
- 4d89ee MOVQ R13, R14
- 4989fd MOVQ DI, R13
- 488b7c2440 MOVQ 0x40(SP), DI
+ 4d85e4 TESTQ R12, R12
+ 7412 JE 0x22c9
+ 4d89d3 MOVQ R10, R11
+ 4d89e2 MOVQ R12, R10
for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
- 4d8d7dff LEAQ -0x1(R13), R15
- 4889442458 MOVQ AX, 0x58(SP)
- b801000000 MOVL $0x1, AX
- 4889442448 MOVQ AX, 0x48(SP)
- 488b442458 MOVQ 0x58(SP), AX
- eba0 JMP 0x2391
+ 4d8d62ff LEAQ -0x1(R10), R12
+ 41bd01000000 MOVL $0x1, R13
+ ebd8 JMP 0x22a1
if flipsMax < flips {
- 4d39ea CMPQ R13, R10
- 7c56 JL 0x244c
- e992000000 JMP 0x248d
+ 4d39d0 CMPQ R10, R8
+ 7c2a JL 0x22f8
+ eb5e JMP 0x232e
perm1[i] = perm1[i+1]
- 4e893cd3 MOVQ R15, 0(BX)(R10*8)
- 4d89f2 MOVQ R14, R10
+ 4e8b64da08 MOVQ 0x8(DX)(R11*8), R12
+ 4e8924da MOVQ R12, 0(DX)(R11*8)
+ 49ffc3 INCQ R11
for i := 0; i < r; i++ {
- 4939d2 CMPQ DX, R10
- 7d1c JGE 0x2423
- perm1[i] = perm1[i+1]
- 4d8d7201 LEAQ 0x1(R10), R14
- 4939f6 CMPQ SI, R14
- 0f83c3000000 JAE 0x24d7 // panicindex
- 4e8b7cd308 MOVQ 0x8(BX)(R10*8), R15
- 4939f2 CMPQ SI, R10
- 72dd JB 0x23fb
- e9b4000000 JMP 0x24d7 // panicindex
+ 4939cb CMPQ CX, R11
+ 7cef JL 0x22d0
for ; r < n; r++
for i := 0; i < r; i++
perm1[i]
and perm1[i+1]
does not need BCperm1[i] = perm1[i+1]
the compiler can see ∀ x ∈ [0, n) perm1[x] ∈ [0, n)
property is preserved perm1[r] = perm0
- 4839f2 CMPQ SI, DX
- 0f83a4000000 JAE 0x24d0
- 48893cd3 MOVQ DI, 0(BX)(DX*8)
+ 4c8904ca MOVQ R8, 0(DX)(CX*8)
count[r]--
- 4839c2 CMPQ AX, DX
- 0f8390000000 JAE 0x24c9 // panicindex
- 488b3cd1 MOVQ 0(CX)(DX*8), DI
- 48ffcf DECQ DI
- 48893cd1 MOVQ DI, 0(CX)(DX*8)
+ 4c8b04c8 MOVQ 0(AX)(CX*8), R8
+ 49ffc8 DECQ R8
+ 4c8904c8 MOVQ R8, 0(AX)(CX*8)
r
is initially set to n
for ; r != 1; r-- {
count[r-1] = r
}
thus after this we know (at least on first mainloop iteration, see below about next) r = 1
furtner
the outer loop, once again, is for ; r < n; r++
thus inside r ∈ [1, n)
thus perm1[r]
and count[r]
do not need BC
the outer loop only increases r which was initially known to be = 1, so after it is known to be >= 1
this way when main loop runs next time in
for ; r != 1; r-- {
count[r-1] = r
}
the compiler can prove, since r >= 1
the loop will terminate with r==1, not loop forever
if count[r] > 0 {
- 4885ff TESTQ DI, DI
- 7f10 JG 0x2459
+ 4d85c0 TESTQ R8, R8
+ 7f10 JG 0x2305
for ; r < n; r++ {
- 48ffc2 INCQ DX
- 4c39c2 CMPQ R8, DX
- 7d0b JGE 0x245c
+ 48ffc1 INCQ CX
+ 4839f1 CMPQ SI, CX
+ 7d0b JGE 0x2308
perm0 := perm1[0]
- 488b3b MOVQ 0(BX), DI
- 4531d2 XORL R10, R10
+ 4c8b02 MOVQ 0(DX), R8
+ 4531db XORL R11, R11
for i := 0; i < r; i++ {
- eba9 JMP 0x2402
+ ebd7 JMP 0x22dc
for ; r < n; r++ {
- 4c39c2 CMPQ R8, DX
+ 4839f1 CMPQ SI, CX
if r == n {
- 741a JE 0x2478
- 488b7c2440 MOVQ 0x40(SP), DI
- 4d89ea MOVQ R13, R10
+ 7415 JE 0x231f
+ 4d89d0 MOVQ R10, R8
if didpr < 30 {
- 4983f91e CMPQ $0x1e, R9
- 0f8d78feffff JGE 0x22e8
+ 4883ff1e CMPQ $0x1e, DI
+ 0f8d2fffffff JGE 0x2246
didpr++
- 49ffc1 INCQ R9
+ 48ffc7 INCQ DI
if didpr < 30 {
- e970feffff JMP 0x22e8
+ e927ffffff JMP 0x2246
return flipsMax
- 4c89ac2490000000 MOVQ R13, 0x90(SP)
- 488b6c2478 MOVQ 0x78(SP), BP
- 4881c480000000 ADDQ $0x80, SP
+ 4c89542458 MOVQ R10, 0x58(SP)
+ 488b6c2440 MOVQ 0x40(SP), BP
+ 4883c448 ADDQ $0x48, SP
c3 RET
- 4d89d5 MOVQ R10, R13
+ 4d89c2 MOVQ R8, R10
if flipsMax < flips {
- ebba JMP 0x244c
- 4c8b5c2468 MOVQ 0x68(SP), R11
- 4c8b642438 MOVQ 0x38(SP), R12
- 4d89d5 MOVQ R10, R13
+ ebc5 JMP 0x22f8
+ 4c8b4c2438 MOVQ 0x38(SP), R9
+ 4d89c2 MOVQ R8, R10
if perm1[0] != 0 && perm1[n1] != n1 {
- ebab JMP 0x244c
- 4c8b5c2468 MOVQ 0x68(SP), R11
- 4c8b642438 MOVQ 0x38(SP), R12
- 4d89d5 MOVQ R10, R13
- eb9c JMP 0x244c
+ ebbb JMP 0x22f8
+ 4c8b4c2438 MOVQ 0x38(SP), R9
+ 4d89c2 MOVQ R8, R10
+ ebb1 JMP 0x22f8
return 0
- 48c784249000000000000000 MOVQ $0x0, 0x90(SP)
- 488b6c2478 MOVQ 0x78(SP), BP
- 4881c480000000 ADDQ $0x80, SP
- c3 RET
- count[r]--
- 0x24c9 e800000000 CALL 0x24ce [1:5]R_CALL:runtime.panicindex
- 0x24ce 0f0b UD2
- perm1[r] = perm0
- 0x24d0 e800000000 CALL 0x24d5 [1:5]R_CALL:runtime.panicindex
- 0x24d5 0f0b UD2
- perm1[i] = perm1[i+1]
- 0x24d7 e800000000 CALL 0x24dc [1:5]R_CALL:runtime.panicindex
- 0x24dc 0f0b UD2
- j := perm[k]
- 0x24de e800000000 CALL 0x24e3 [1:5]R_CALL:runtime.panicindex
- 0x24e3 0f0b UD2
- perm[i], perm[j] = perm[j], perm[i]
- 0x24e5 e800000000 CALL 0x24ea [1:5]R_CALL:runtime.panicindex
- 0x24ea 0f0b UD2
- perm[i] = perm1[i]
- 0x24ec e800000000 CALL 0x24f1 [1:5]R_CALL:runtime.panicindex
- 0x24f1 0f0b UD2
- if perm1[0] != 0 && perm1[n1] != n1 {
- 0x24f3 e800000000 CALL 0x24f8 [1:5]R_CALL:runtime.panicindex
- 0x24f8 0f0b UD2
- count[r-1] = r
- 0x24fa e800000000 CALL 0x24ff [1:5]R_CALL:runtime.panicindex
- 0x24ff 0f0b UD2
- perm1[i] = i // initial (trivial) permutation
- 0x2501 e800000000 CALL 0x2506 [1:5]R_CALL:runtime.panicindex
- 0x2506 0f0b UD2
+ 48c744245800000000 MOVQ $0x0, 0x58(SP)
+ 488b6c2440 MOVQ 0x40(SP), BP
+ 4883c448 ADDQ $0x48, SP
+ c3 RET
func fannkuch(n int) int {
- e800000000 CALL 0x250d [1:5]R_CALL:runtime.morestack_noctxt
- e9c0fcffff JMP %22%22.fannkuch(SB)
+ e800000000 CALL 0x235f [1:5]R_CALL:runtime.morestack_noctxt
+ e90efeffff JMP %22%22.fannkuch(SB)
^^^ are all CALL runtime.panicindex
destinations and they were all covered by my comments.
So maybe I missed something (appologize then) but above I was able to prove all runtime bounds check are unneccessary in fannkuch. I don't see any reason the compiler cannot be taught to do the same. Thus for fannkuch it is theoretically possible to reach -B level with just BCE.
About compiler itself I've tried to see myself too. For this via passing GO_GCFLAGS=-B
env var to make.bash I've bootstrapped second compiler (goB
) which does not check bounds at compile runtime, checked it is really the case (e.g. go tool objdump pkg/tool/linux_amd64/compile |grep panicindex |wc -l
), and compared go and goB via compilebench:
$ benchstat go.txt goB.txt
name old time/op new time/op delta
Template 171ms ± 3% 169ms ± 2% -1.12% (p=0.001 n=19+18)
Unicode 79.1ms ± 3% 78.3ms ± 4% ~ (p=0.089 n=19+20)
GoTypes 559ms ± 2% 550ms ± 2% -1.53% (p=0.000 n=20+20)
Compiler 2.63s ± 2% 2.59s ± 1% -1.52% (p=0.000 n=20+20)
SSA 5.26s ± 1% 5.19s ± 1% -1.40% (p=0.000 n=19+19)
Flate 111ms ± 1% 109ms ± 1% -1.74% (p=0.000 n=20+18)
GoParser 139ms ± 2% 136ms ± 1% -1.75% (p=0.000 n=20+19)
Reflect 346ms ± 2% 341ms ± 1% -1.61% (p=0.000 n=19+19)
Tar 97.5ms ± 3% 96.3ms ± 3% -1.18% (p=0.002 n=19+19)
XML 192ms ± 1% 189ms ± 1% -1.23% (p=0.000 n=20+19)
It is hard for me to say it had no impact.
Josh, what was your way of testing? Same as mine, or maybe I'm missing something?
Thanks beforehand, Kirill
The optimization proposed here is very aggressive, and the proposed gains are smaller than what we hope to obtain from either getting better inlining into the compiler or passing arguments in registers. I'd like to target 1.11 instead of 1.10.
Josh, what was your way of testing? Same as mine, or maybe I'm missing something?
Effectively same as yours. My CPU's branch predictor just seems to be generally better than yours.
I'd like to target 1.11 instead of 1.10.
Agreed. @bradfitz any chance you could create 1.11 milestones for us?
@josharian thanks for feedback. For the reference my CPU is this:
http://cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-6600U+%40+2.60GHz&id=2608
it is the highest end one could get from 15W TDP mobile series in 6th Intel generation.
For reference, mine: https://cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-6920HQ+%40+2.90GHz&id=2699
Josh, thanks for feedback. I've found only below two "monster" heavy 5.5 kg weight 18.4" notebooks with your processor:
https://market.yandex.ru/product/14232631?show-uid=955252503320279700516001&nid=54544&glfilter=6069383%3A13313955 https://market.yandex.ru/product/14232101?show-uid=955252503320279700516002&nid=54544&glfilter=6069383%3A13313955
:)
Though maybe in your country the choice is more wide...
Mine's in one of the new MacBook Pros. To be clear, I'm not suggesting that you get a new processor! :) I'm still working on getting my udoo x86 up, which hopefully will have different branch prediction properties that I can use during development.
Josh thanks for feedback and good luck with getting udoo up and running.
Change https://golang.org/cl/100278 mentions this issue: cmd/compile: in prove, infer unsigned relations while branching
Change https://golang.org/cl/100277 mentions this issue: cmd/compile: in prove, add transitive closure of relations
The original issue tracked in this bug (BCE in bvec.AndNot
) is fixed by CL100277 and CL100278. After the first lands, len(a) == len(b)
can be used to trigger BCE; after the second, len(a) < len(b)
will work too.
@navytux would it possible to move your wonderful comment about fannkuch in a separate issue? It would be easier to find and track.
@rasky, thanks for working on patches improving BCE. Sure, fannkuch case can be moved outside for easier observability and tracking -> please see https://github.com/golang/go/issues/24660.
The original issue tracked in this bug (BCE in bvec.AndNot) is fixed by CL100277 and CL100278.
Awesome.
Will it require additional work or hints to be added to bvec.AndNot, or will the existing code suffice? Just want to know whether I need to send any follow-up CLs.
Yes, you need to assert that all the involved vector have equal or compatible size (eg: >= of the one you're iterating on). In the OP you say "By construction, though, len(dst.b) == len(src1.b) == len(src2.b)" but unless that is visible to the compiler, you need explicitly assert it.
Cool. Once those CLs are in I'll find the right assertion to add.
Change https://golang.org/cl/110560 mentions this issue: cmd/compile: optimize bvec routines
From cmd/compile/internal/gc/bv.go:
This can be very hot code for certain kind of input code. The core loop here contains two bounds checks, on
dst.b[i]
andsrc2.b[i]
. By construction, though, len(dst.b) == len(src1.b) == len(src2.b).The compiler should already have enough information to hoist these bounds checks out of the loop. I'd be happy enough to provide explicit hints, though, e.g. like
_ = dst.b[len(src1.b)-1]
at the top of the function. I'm not too particular about what they look like--I just want it to be possible. It doesn't appear to be right now; at least, I can't figure it out.cc @brtzsnr @dr2chase