cmd/compile: no way to hint to eliminate bounds checks in bvec.AndNot

josharian commented 7 years ago

From cmd/compile/internal/gc/bv.go:

func (dst bvec) AndNot(src1, src2 bvec) {
    for i, x := range src1.b {
        dst.b[i] = x &^ src2.b[i]
    }
}

This can be very hot code for certain kind of input code. The core loop here contains two bounds checks, on dst.b[i] and src2.b[i]. By construction, though, len(dst.b) == len(src1.b) == len(src2.b).

The compiler should already have enough information to hoist these bounds checks out of the loop. I'd be happy enough to provide explicit hints, though, e.g. like _ = dst.b[len(src1.b)-1] at the top of the function. I'm not too particular about what they look like--I just want it to be possible. It doesn't appear to be right now; at least, I can't figure it out.

cc @brtzsnr @dr2chase

josharian commented 7 years ago

Looks like this might be a duplicate of #19714. Sound right, @navytux?

navytux commented 7 years ago

@josharian thanks for looking into this. Yes, this sounds right.

On a related note some time ago I was thinking: when there is a loop and slice accesses inside which generally needs bounds checking, for many kind of loops the compiler could transform the checks to be performed from inside loop to equivalent only one bounds check before entering the loop.

The limiting factors here are:

when something panics we want to know the exact line of failure
if there are side effects of the loop they must remain committed to memory for successfull operations executed before panic.

That means that in general if such transformation will be happenning, the loop with insides bounds check needs to be transformed into:

equivalent bounds precheck, which if successfull, would mean there won't be any bounds panic inside loop;
fast version of the loop with no bounds checking performed at runtime. (this gets executed if 1 succeeds)
slow version of the loop with runtime bounds checking still there. We know it will fail, but we want the side effects (if any) to be committed and to get correct line in the traceback.

Unfortunately that means the code could become more fat. Fortunately with such transformation in place, if a user adds

if len(a) < ... {
    panic(...)
}

or similar in function preambule with correct checks equivalent to 1, the compiler will know generating 3 is not needed at all. Similarly if the function gets inlined and in inlining context the compiler already has some information about variables it should be easy for compiler to check whether only fast loop needs to be remained becase pre-condition for that is already in top-level block of inlined function.

To me this suggest that having such transformations as one of the basis of BCE should be a good idea. Personally I'm also ok with some code size increase if that means bounds in all functions will be automatically checked before loops not inside. And for the slow loop version, since we know it will panic and is there only for correctness, the generated code could be somehow optimized more for size.

I beg you pardon if this all is too abstract or non practicall, Kirill

P.S. for the reference some time ago I've made a-la protobuf compiler which moves such overflow checks from inside loops out:

https://lab.nexedi.com/kirr/neo/blob/391212b7/go/neo/protogen.go

P.P.S. it would be great if in Go1.10 cycle some love would be devoted to BCE

dr2chase commented 7 years ago

It's possible but I am not sure it will make the cut. Bounds checks are highly predictable (which reduces the win), and when I took a stab at the analysis before it took too much time.

navytux commented 7 years ago

@dr2chase

.../go/src/encoding/hex$ go test -bench . -count=10 >a.txt
.../go/src/encoding/hex$ go test -gcflags=-B -bench . -count=10 >b.txt
.../go/src/encoding/hex$ benchstat a.txt b.txt 
name            old time/op  new time/op  delta
Encode/256-4     352ns ± 0%   311ns ± 1%  -11.54%  (p=0.000 n=10+9)
Encode/1024-4   1.38µs ± 1%  1.23µs ± 1%  -11.03%  (p=0.000 n=10+10)
Encode/4096-4   5.48µs ± 0%  4.87µs ± 1%  -11.09%  (p=0.000 n=9+9)
Encode/16384-4  21.9µs ± 0%  19.5µs ± 0%  -10.97%  (p=0.000 n=9+10)

dr2chase commented 7 years ago

That's one benchmark. How about json, http, and the geomean of the go1 benchmarks?

navytux commented 7 years ago

@dr2chase bounds checks of course feels on codes which work more with slices, like the hex encoding above (access to hextable), all gonum (dot product, matrix diagonalization, etc), search in sorted slice, permutations etc. When code works more with structures -B won't practically make a difference.

Said that here are the benchmarks you've asked.

(g.env) kirr@deco:~/src/tools/go/go/src/encoding/json$ benchstat json.txt jsonB.txt 
name                   old time/op    new time/op    delta
CodeEncoder-4            5.45ms ± 1%    5.24ms ± 0%  -4.01%  (p=0.000 n=19+17)
CodeMarshal-4            6.63ms ± 0%    6.22ms ± 9%  -6.25%  (p=0.000 n=15+20)
CodeDecoder-4            25.7ms ± 1%    25.7ms ± 1%    ~     (p=0.365 n=20+19)
DecoderStream-4           297ns ± 1%     297ns ± 1%    ~     (p=0.194 n=20+20)
CodeUnmarshal-4          26.9ms ± 1%    26.3ms ± 8%    ~     (p=0.499 n=15+20)
CodeUnmarshalReuse-4     25.5ms ± 2%    25.5ms ± 2%    ~     (p=0.550 n=20+19)
UnmarshalString-4         254ns ± 0%     253ns ± 0%  -0.28%  (p=0.000 n=18+19)
UnmarshalFloat64-4        225ns ± 1%     225ns ± 0%    ~     (p=0.186 n=19+18)
UnmarshalInt64-4          198ns ± 1%     196ns ± 0%  -0.87%  (p=0.000 n=18+16)
Issue10335-4              294ns ± 0%     290ns ± 1%  -1.25%  (p=0.000 n=14+18)
Unmapped-4                903ns ± 0%     882ns ± 1%  -2.39%  (p=0.000 n=19+18)
NumberIsValid-4          18.6ns ± 0%    17.9ns ± 0%  -3.76%  (p=0.000 n=13+15)
NumberIsValidRegexp-4     584ns ± 2%     586ns ± 2%    ~     (p=0.302 n=20+20)
SkipValue-4              12.8ms ± 4%    12.8ms ± 3%    ~     (p=0.879 n=20+19)
EncoderEncode-4           166ns ± 7%     162ns ± 8%  -2.05%  (p=0.016 n=20+20)

(g.env) kirr@deco:~/src/tools/go/go/src/net/http$ benchstat http.txt httpB.txt 
name                               old time/op    new time/op    delta
CookieString-4                        839ns ± 1%     822ns ± 1%  -2.06%  (p=0.000 n=19+19)
ReadSetCookies-4                     3.23µs ± 0%    3.21µs ± 1%  -0.57%  (p=0.000 n=19+18)
ReadCookies-4                        3.28µs ± 0%    3.25µs ± 1%  -0.81%  (p=0.000 n=18+20)
HeaderWriteSubset-4                   479ns ± 1%     486ns ± 1%  +1.59%  (p=0.000 n=20+19)
ReadRequestChrome-4                  2.95µs ± 1%    2.93µs ± 1%  -0.55%  (p=0.000 n=20+20)
ReadRequestCurl-4                    1.53µs ± 0%    1.52µs ± 1%  -0.42%  (p=0.000 n=18+20)
ReadRequestApachebench-4             1.53µs ± 0%    1.53µs ± 0%  -0.33%  (p=0.000 n=18+18)
ReadRequestSiege-4                   1.96µs ± 0%    1.97µs ± 1%  +0.44%  (p=0.000 n=20+19)
ReadRequestWrk-4                     1.08µs ± 0%    1.08µs ± 1%    ~     (p=0.851 n=19+20)
ServeMux-4                           73.4µs ± 1%    73.2µs ± 1%  -0.36%  (p=0.004 n=19+20)
ClientServer-4                       44.7µs ± 1%    44.4µs ± 1%  -0.73%  (p=0.000 n=18+18)
ClientServerParallel4-4              22.4µs ± 6%    22.2µs ± 6%    ~     (p=0.114 n=17+17)
ClientServerParallel64-4              110µs ±60%     101µs ±64%    ~     (p=0.336 n=20+19)
ClientServerParallelTLS4-4            140µs ±13%     141µs ±14%    ~     (p=0.718 n=20+20)
ClientServerParallelTLS64-4          80.3µs ±98%    86.5µs ±85%    ~     (p=0.941 n=10+17)
Server-4                             58.7µs ± 0%    58.5µs ± 1%  -0.31%  (p=0.003 n=10+19)
Client-4                             57.9µs ± 1%    57.5µs ± 1%  -0.61%  (p=0.000 n=10+20)
ServerFakeConnNoKeepAlive-4          15.8µs ± 1%    15.9µs ± 1%  +1.14%  (p=0.000 n=10+20)
ServerFakeConnWithKeepAlive-4        7.64µs ± 0%    7.64µs ± 1%    ~     (p=0.750 n=10+18)
ServerFakeConnWithKeepAliveLite-4    4.93µs ± 0%    4.85µs ± 1%  -1.61%  (p=0.000 n=9+20)
ServerHandlerTypeLen-4               6.24µs ± 0%    6.23µs ± 1%    ~     (p=0.167 n=9+19)
ServerHandlerNoLen-4                 5.77µs ± 1%    5.77µs ± 1%    ~     (p=1.000 n=10+18)
ServerHandlerNoType-4                5.84µs ± 0%    5.90µs ± 3%  +1.08%  (p=0.001 n=10+18)
ServerHandlerNoHeader-4              4.52µs ± 0%    4.52µs ± 4%  -0.03%  (p=0.043 n=10+19)
ServerHijack-4                       19.3µs ± 1%    18.9µs ± 2%  -2.15%  (p=0.000 n=10+18)
CloseNotifier-4                      63.3µs ± 1%    63.7µs ± 3%    ~     (p=0.914 n=10+20)
ResponseStatusLine-4                 30.0ns ± 8%    29.9ns ± 6%    ~     (p=0.467 n=10+20)

(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1$ benchstat -geomean go1.txt go1B.txt 
name                     old time/op    new time/op    delta
BinaryTree17-4              2.25s ± 1%     2.25s ± 2%     ~     (p=0.964 n=38+39)
Fannkuch11-4                2.91s ± 1%     1.95s ± 1%  -32.86%  (p=0.000 n=36+39)
FmtFprintfEmpty-4          34.8ns ± 1%    35.4ns ± 3%   +1.74%  (p=0.000 n=35+40)
FmtFprintfString-4         59.3ns ± 5%    58.9ns ± 1%     ~     (p=0.797 n=40+38)
FmtFprintfInt-4            61.3ns ± 1%    61.9ns ± 1%   +1.02%  (p=0.000 n=34+39)
FmtFprintfIntInt-4         95.2ns ± 1%    95.5ns ± 2%     ~     (p=0.073 n=33+38)
FmtFprintfPrefixedInt-4     114ns ± 5%     113ns ± 2%     ~     (p=0.131 n=39+40)
FmtFprintfFloat-4           192ns ± 1%     193ns ± 1%   +0.18%  (p=0.008 n=39+38)
FmtManyArgs-4               412ns ± 2%     411ns ± 0%     ~     (p=0.805 n=40+36)
GobDecode-4                5.41ms ± 1%    5.44ms ± 3%   +0.49%  (p=0.008 n=39+35)
GobEncode-4                4.17ms ± 1%    4.23ms ± 2%   +1.35%  (p=0.000 n=40+40)
Gzip-4                      202ms ± 0%     203ms ± 1%     ~     (p=0.057 n=35+36)
Gunzip-4                   31.5ms ± 1%    31.5ms ± 1%     ~     (p=0.700 n=37+33)
HTTPClientServer-4         39.9µs ± 2%    40.5µs ± 2%   +1.60%  (p=0.000 n=37+38)
JSONEncode-4               10.6ms ± 1%    10.6ms ± 2%   +0.58%  (p=0.000 n=38+39)
JSONDecode-4               42.9ms ± 1%    43.0ms ± 2%     ~     (p=0.248 n=40+40)
Mandelbrot200-4            3.82ms ± 0%    3.84ms ± 1%   +0.34%  (p=0.002 n=7+19)
[Geo mean]                  147µs          144µs        -1.94%

in short: json - helps some parts, http - not sure, go1 - significantly helps Fannkuch11 (-33%) not sure about the rest as it all could be noise related to source positons of jumps changed.

Once again not all codes are using slices heavily but for those who do bounds checking takes visible part of runtime and not only 10% as e.g. now famous Fannkuch11 shows above.

Thanks, Kirill

josharian commented 7 years ago

I benchmarked the compiler with and without -B, including some anomalous cases in which bvec.AndNot is hot code, and got no impact.

A key question for fannkuch (and the rest of these) is whether BCE could actually eliminate the relevant checks; -B is more powerful than the compiler ever can be.

navytux commented 7 years ago

Josh thanks for your feedback and for trying the compiler with -B. About fannkuch I've tried to look inside to see details about bounds checking. For this I've compiled fannkuch_test.go without and with -B, did go tool objdump -S on both, removed addresses of instruction so that the diff is less noisy, and investigated the difference. Let me show it below with my annotations and comments:

@@ -2,259 +2,172 @@ TEXT %22%22.fannkuch(SB) gofile..$GOROOT/test/bench/go1/fannkuch_test.go
 func fannkuch(n int) int {
                        64488b0c2500000000      MOVQ FS:0, CX           [5:9]R_TLS_LE
                        483b6110                CMPQ 0x10(CX), SP
-                       0f8623030000            JBE 0x2508              
-                       4881ec80000000          SUBQ $0x80, SP          
-                       48896c2478              MOVQ BP, 0x78(SP)       
-                       488d6c2478              LEAQ 0x78(SP), BP       
-                       488b842488000000        MOVQ 0x88(SP), AX       
+                       0f86d5010000            JBE 0x235a              
+                       4883ec48                SUBQ $0x48, SP          
+                       48896c2440              MOVQ BP, 0x40(SP)       
+                       488d6c2440              LEAQ 0x40(SP), BP       
+                       488b442450              MOVQ 0x50(SP), AX       
        if n < 1 {
                        4883f801                CMPQ $0x1, AX
-                       0f8ca8020000            JL 0x24b0               
+                       0f8ca5010000            JL 0x2347               
                        488d0d00000000          LEAQ 0(IP), CX          [3:7]R_PCREL:type.int
        perm := make([]int, n)
                        48890c24                MOVQ CX, 0(SP)
                        4889442408              MOVQ AX, 0x8(SP)
                        4889442410              MOVQ AX, 0x10(SP)
-                       e800000000              CALL 0x2222             [1:5]R_CALL:runtime.makeslice
-                       488b442420              MOVQ 0x20(SP), AX       
+                       e800000000              CALL 0x21bc             [1:5]R_CALL:runtime.makeslice
+                       488b442418              MOVQ 0x18(SP), AX       
                        4889442438              MOVQ AX, 0x38(SP)
-                       488b4c2418              MOVQ 0x18(SP), CX       
-                       48894c2468              MOVQ CX, 0x68(SP)       
-                       488d1500000000          LEAQ 0(IP), DX          [3:7]R_PCREL:type.int
+                       488d0d00000000          LEAQ 0(IP), CX          [3:7]R_PCREL:type.int
        perm1 := make([]int, n)
-                       48891424                MOVQ DX, 0(SP)          
-                       488b9c2488000000        MOVQ 0x88(SP), BX       
-                       48895c2408              MOVQ BX, 0x8(SP)        
-                       48895c2410              MOVQ BX, 0x10(SP)       
-                       e800000000              CALL 0x2258             [1:5]R_CALL:runtime.makeslice
-                       488b442418              MOVQ 0x18(SP), AX       
-                       4889442460              MOVQ AX, 0x60(SP)       
-                       488b4c2420              MOVQ 0x20(SP), CX       
-                       48894c2430              MOVQ CX, 0x30(SP)       
-                       488d1500000000          LEAQ 0(IP), DX          [3:7]R_PCREL:type.int
-       count := make([]int, n)
-                       48891424                MOVQ DX, 0(SP)          
-                       488b942488000000        MOVQ 0x88(SP), DX       
+                       48890c24                MOVQ CX, 0(SP)          
+                       488b542450              MOVQ 0x50(SP), DX       
                        4889542408              MOVQ DX, 0x8(SP)
                        4889542410              MOVQ DX, 0x10(SP)
-                       e800000000              CALL 0x228e             [1:5]R_CALL:runtime.makeslice
-                       488b442420              MOVQ 0x20(SP), AX       
-                       488b4c2418              MOVQ 0x18(SP), CX       
-                       488b942488000000        MOVQ 0x88(SP), DX       
-                       488b5c2460              MOVQ 0x60(SP), BX       
-                       488b742430              MOVQ 0x30(SP), SI       
-                       31ff                    XORL DI, DI             
+                       e800000000              CALL 0x21e5             [1:5]R_CALL:runtime.makeslice
+                       488b442418              MOVQ 0x18(SP), AX       
+                       4889442430              MOVQ AX, 0x30(SP)       
+                       488d0d00000000          LEAQ 0(IP), CX          [3:7]R_PCREL:type.int
+       count := make([]int, n)
+                       48890c24                MOVQ CX, 0(SP)          
+                       488b4c2450              MOVQ 0x50(SP), CX       
+                       48894c2408              MOVQ CX, 0x8(SP)        
+                       48894c2410              MOVQ CX, 0x10(SP)       
+                       e800000000              CALL 0x220e             [1:5]R_CALL:runtime.makeslice
+                       488b442418              MOVQ 0x18(SP), AX       
+                       488b4c2450              MOVQ 0x50(SP), CX       
+                       488b542430              MOVQ 0x30(SP), DX       
+                       31db                    XORL BX, BX             
        for i := 0; i < n; i++ {
-                       eb07                    JMP 0x22b5              
+                       eb07                    JMP 0x2228              
                perm1[i] = i // initial (trivial) permutation
-                       48893cfb                MOVQ DI, 0(BX)(DI*8)    
+                       48891cda                MOVQ BX, 0(DX)(BX*8)    
        for i := 0; i < n; i++ {
-                       48ffc7                  INCQ DI                 
-                       4839d7                  CMPQ DX, DI             
-                       7d0a                    JGE 0x22c4              
-               perm1[i] = i // initial (trivial) permutation
-                       4839f7                  CMPQ SI, DI             
-                       72ef                    JB 0x22ae               
-                       e93d020000              JMP 0x2501                      // panicindex
-                       48894c2470              MOVQ CX, 0x70(SP)       
+                       48ffc3                  INCQ BX                 
+                       4839cb                  CMPQ CX, BX             
+                       7cf4                    JL 0x2221

BCE could remove ^^^ panicindex because:

perm1 is initialized with make([]int, n) and its len is never changed
then it is initialized in a loop

        for i := 0; i < n; i++ {
                perm1[i] = i // initial (trivial) permutation
        }

so BCE knows n <= len(perm1) (actually ==), 0 <= i < n -> i < len(perm1) -> no BC needed.

        n1 := n - 1
-                       488d7aff                LEAQ -0x1(DX), DI       
-                       48897c2440              MOVQ DI, 0x40(SP)       
-                       4989d0                  MOVQ DX, R8             
-                       4531c9                  XORL R9, R9             
-                       4531d2                  XORL R10, R10           
+                       488d59ff                LEAQ -0x1(CX), BX       
+                       4889ce                  MOVQ CX, SI             
+                       31ff                    XORL DI, DI             
+                       4531c0                  XORL R8, R8             
        for {
-                       e986010000              JMP 0x2466              
+                       e9cf000000              JMP 0x230d              
                        count[r-1] = r
-                       488954d1f8              MOVQ DX, -0x8(CX)(DX*8) 
-                       4c89da                  MOVQ R11, DX            
+                       48894cc8f8              MOVQ CX, -0x8(AX)(CX*8) 
+                       48ffc9                  DECQ CX                 
                for ; r != 1; r-- {
-                       4883fa01                CMPQ $0x1, DX           
-                       740e                    JE 0x22fc               
-                       count[r-1] = r
-                       4c8d5aff                LEAQ -0x1(DX), R11      
-                       4939c3                  CMPQ AX, R11            
-                       72e9                    JB 0x22e0               
-                       e9fe010000              JMP 0x24fa                      // panicindex
+                       4883f901                CMPQ $0x1, CX           
+                       75f2                    JNE 0x223e

For count[r-1] = r BCE could remove ^^^ panicindex because:

count is initialized with make([]int, n) and its len is never changed
r is initialized as n, goes down to 1 in below loop, goes up in tail loop inside main loop for ; r < n; r++, so it is never > n. => r ∈ [1, n] always.
then there is loop

        for ; r != 1; r-- {
                count[r-1] = r
        }

so BCE knows

1 <= r <= n
-> 0 <= r-1 <= n-1 < n

so count[r-1] does not need BC.

                if perm1[0] != 0 && perm1[n1] != n1 {
-                       4885f6                  TESTQ SI, SI                    
-                       0f86ee010000            JBE 0x24f3                      // panicindex
-                       4c8b1b                  MOVQ 0(BX), R11                 
-                       4d85db                  TESTQ R11, R11                  
-                       0f8490010000            JE 0x24a1                       
-                       4839f7                  CMPQ SI, DI                     
-                       0f83d9010000            JAE 0x24f3                      // panicindex
-                       4e8b5cc3f8              MOVQ -0x8(BX)(R8*8), R11        
-                       4c39df                  CMPQ R11, DI                    
-                       0f846a010000            JE 0x2492                       
-                       4c8b5c2468              MOVQ 0x68(SP), R11              
-                       4c8b642438              MOVQ 0x38(SP), R12              
-                       41bd01000000            MOVL $0x1, R13                  
+                       4c8b0a                  MOVQ 0(DX), R9          
+                       4d85c9                  TESTQ R9, R9            
+                       0f84e5000000            JE 0x233d               
+                       4c8b4cf2f8              MOVQ -0x8(DX)(SI*8), R9 
+                       4c39cb                  CMPQ R9, BX             
+                       0f84cd000000            JE 0x2333               
+                       4c8b4c2438              MOVQ 0x38(SP), R9       
+                       41ba01000000            MOVL $0x1, R10

len(perm1) is known to be n
n is known to be >= 1 because in the beginning of fannkuch there is

        if n < 1 {
                return 0
        }

-> this way perm1[0] does not need BC.

n1 is initialized as n - 1 and is never changed
thus it is known 0 <= n1 <= n -1 < n
len(perm1) is once again known to be n

-> this way perm1[n1] does not need BC

                        for i := 1; i < n; i++ { // perm = perm1
-                       eb07                    JMP 0x2341              
+                       eb0b                    JMP 0x227e              
                                perm[i] = perm1[i]
-                       4f8934eb                MOVQ R14, 0(R11)(R13*8) 
+                       4e8b1cd2                MOVQ 0(DX)(R10*8), R11  
+                       4f891cd1                MOVQ R11, 0(R9)(R10*8)  
                        for i := 1; i < n; i++ { // perm = perm1
-                       49ffc5                  INCQ R13                
-                       4d39c5                  CMPQ R8, R13            
-                       7d17                    JGE 0x235d              
-                               perm[i] = perm1[i]
-                       4939f5                  CMPQ SI, R13            
-                       0f839d010000            JAE 0x24ec                      // panicindex
-                       4e8b34eb                MOVQ 0(BX)(R13*8), R14  
-                       4d39e5                  CMPQ R12, R13           
-                       72e2                    JB 0x233a               
-                       e98f010000              JMP 0x24ec                      // panicindex
-                       4c894c2450              MOVQ R9, 0x50(SP)       
+                       49ffc2                  INCQ R10                
+                       4939f2                  CMPQ SI, R10            
+                       7cf0                    JL 0x2273

i ∈ [1, n)
both perm and perm1 were initialized as make([]int, n) and their len is never changed

-> both perm[i] and perm1[i] do not need BC

                        k := perm1[0] // cache perm[0] in k
-                       4c8b2b                  MOVQ 0(BX), R13         
-                       4531f6                  XORL R14, R14           
+                       4c8b12                  MOVQ 0(DX), R10         
+                       4531db                  XORL R11, R11           
                        for {         // k!=0 ==> k>0
-                       eb6d                    JMP 0x23d7              
+                       eb32                    JMP 0x22bd              
                                        perm[i], perm[j] = perm[j], perm[i]
-                       4b8b0cfb                MOVQ 0(R11)(R15*8), CX  
-                       49890cfb                MOVQ CX, 0(R11)(DI*8)   
-                       4f890cfb                MOVQ R9, 0(R11)(R15*8)  
+                       4f8b34e1                MOVQ 0(R9)(R12*8), R14  
+                       4f8b3ce9                MOVQ 0(R9)(R13*8), R15  
+                       4f8934e9                MOVQ R14, 0(R9)(R13*8)  
+                       4f893ce1                MOVQ R15, 0(R9)(R12*8)  
                                for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
-                       488d4f01                LEAQ 0x1(DI), CX        
-                       49ffcf                  DECQ R15                
-                       488b7c2440              MOVQ 0x40(SP), DI       
-                       4c8b4c2450              MOVQ 0x50(SP), R9       
-                       48894c2448              MOVQ CX, 0x48(SP)       
-                       488b4c2470              MOVQ 0x70(SP), CX       
-                       488b7c2448              MOVQ 0x48(SP), DI       
-                       4c39ff                  CMPQ R15, DI            
-                       7d17                    JGE 0x23b2              
-                                       perm[i], perm[j] = perm[j], perm[i]
-                       4c39e7                  CMPQ R12, DI            
-                       0f8341010000            JAE 0x24e5                      // panicindex
-                       4d8b0cfb                MOVQ 0(R11)(DI*8), R9   
-                       4d39e7                  CMPQ R12, R15           
-                       72bd                    JB 0x236a               
-                       e933010000              JMP 0x24e5                      // panicindex

k is initialized as perm1[0] - no BC needed here since compiler already knows len(perm1) = n >= 1
initially perm1 is initialized as perm1[i] = i i ∈ [0, n)
from ^^^ the compiler knows initially ∀ x ∈ [0, n) perm1[x] ∈ [0, n)
later elements of perm1 are only exchanged with each other, so the compiler can prove ∀ x ∈ [0, n) perm1[x] ∈ [0, n) holds always
thus k itself can be proven to be in [0, n) on first iteration (see also below about all next iterations)
the inner loop is

        for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
                perm[i], perm[j] = perm[j], perm[i]
        }

j starts from k-1 < n-1 and is only decreasing
i starts from 1 and is only increasing
the loop is till i < j
thus inside loop:
- 1 < j < n - 1
- 1 <= i < n - 1

-> both i and j are in valid range and no BC is needed for perm[i] and perm[j]

-                               j := perm[k]
+                       49ffc5                  INCQ R13                
+                       49ffcc                  DECQ R12                
                        4d39e5                  CMPQ R12, R13
-                       0f8323010000            JAE 0x24de                      // panicindex
-                       4b8b3ceb                MOVQ 0(R11)(R13*8), DI  
+                       7ce5                    JL 0x228b               
+                               j := perm[k]
+                       4f8b24d1                MOVQ 0(R9)(R10*8), R12

k exchange is

        // Now exchange k (caching perm[0]) and perm[k]... with care!
        j := perm[k]
        perm[k] = k
        k = j

we know initially (first iter - see above) k ∈ [0, n)
thus perm[k] does not need BC
before loop perm is initialized as = perm1 - this way compiler can know ∀ x ∈ [0, n) perm[x] ∈ [0, n) initially
in j := perm[k] j is thus ∈ [0, n)
perm[k] = k preserves ∀ x ∈ [0, n) perm[x] ∈ [0, n) property.
from k = j compiler can see k continues to k ∈ [0, n) in next iterations

                                perm[k] = k
-                       4f892ceb                MOVQ R13, 0(R11)(R13*8) 
+                       4f8914d1                MOVQ R10, 0(R9)(R10*8)  
                                flips++
-                       4d8d6e01                LEAQ 0x1(R14), R13      
+                       4d8d5301                LEAQ 0x1(R11), R10      
                                if k == 0 {
-                       4885ff                  TESTQ DI, DI            
-                       7425                    JE 0x23f1               
-                       4d89ee                  MOVQ R13, R14           
-                       4989fd                  MOVQ DI, R13            
-                       488b7c2440              MOVQ 0x40(SP), DI       
+                       4d85e4                  TESTQ R12, R12          
+                       7412                    JE 0x22c9               
+                       4d89d3                  MOVQ R10, R11           
+                       4d89e2                  MOVQ R12, R10           
                                for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
-                       4d8d7dff                LEAQ -0x1(R13), R15     
-                       4889442458              MOVQ AX, 0x58(SP)       
-                       b801000000              MOVL $0x1, AX           
-                       4889442448              MOVQ AX, 0x48(SP)       
-                       488b442458              MOVQ 0x58(SP), AX       
-                       eba0                    JMP 0x2391              
+                       4d8d62ff                LEAQ -0x1(R10), R12     
+                       41bd01000000            MOVL $0x1, R13          
+                       ebd8                    JMP 0x22a1              
                        if flipsMax < flips {
-                       4d39ea                  CMPQ R13, R10           
-                       7c56                    JL 0x244c               
-                       e992000000              JMP 0x248d              
+                       4d39d0                  CMPQ R10, R8            
+                       7c2a                    JL 0x22f8               
+                       eb5e                    JMP 0x232e              
                                perm1[i] = perm1[i+1]
-                       4e893cd3                MOVQ R15, 0(BX)(R10*8)  
-                       4d89f2                  MOVQ R14, R10           
+                       4e8b64da08              MOVQ 0x8(DX)(R11*8), R12        
+                       4e8924da                MOVQ R12, 0(DX)(R11*8)          
+                       49ffc3                  INCQ R11                        
                        for i := 0; i < r; i++ {
-                       4939d2                  CMPQ DX, R10            
-                       7d1c                    JGE 0x2423              
-                               perm1[i] = perm1[i+1]
-                       4d8d7201                LEAQ 0x1(R10), R14              
-                       4939f6                  CMPQ SI, R14                    
-                       0f83c3000000            JAE 0x24d7                      // panicindex
-                       4e8b7cd308              MOVQ 0x8(BX)(R10*8), R15        
-                       4939f2                  CMPQ SI, R10                    
-                       72dd                    JB 0x23fb                       
-                       e9b4000000              JMP 0x24d7                      // panicindex
+                       4939cb                  CMPQ CX, R11            
+                       7cef                    JL 0x22d0

outer loop is for ; r < n; r++
inner loop is for i := 0; i < r; i++
inside loop r is thus < n, i.e. <= n-1
inside inner loop i is thus < r, i.e. < n-1, i.e. ∈ [0, n-1)
thus both perm1[i] and perm1[i+1] does not need BC
once again on perm1[i] = perm1[i+1] the compiler can see ∀ x ∈ [0, n) perm1[x] ∈ [0, n) property is preserved

                        perm1[r] = perm0
-                       4839f2                  CMPQ SI, DX             
-                       0f83a4000000            JAE 0x24d0              
-                       48893cd3                MOVQ DI, 0(BX)(DX*8)    
+                       4c8904ca                MOVQ R8, 0(DX)(CX*8)    
                        count[r]--
-                       4839c2                  CMPQ AX, DX             
-                       0f8390000000            JAE 0x24c9                      // panicindex
-                       488b3cd1                MOVQ 0(CX)(DX*8), DI    
-                       48ffcf                  DECQ DI                 
-                       48893cd1                MOVQ DI, 0(CX)(DX*8)    
+                       4c8b04c8                MOVQ 0(AX)(CX*8), R8    
+                       49ffc8                  DECQ R8                 
+                       4c8904c8                MOVQ R8, 0(AX)(CX*8)

outside main loop r is initially set to n
the main loop contains

        for ; r != 1; r-- {
                count[r-1] = r
        }

thus after this we know (at least on first mainloop iteration, see below about next) r = 1 furtner
the outer loop, once again, is for ; r < n; r++
thus inside r ∈ [1, n)
thus perm1[r] and count[r] do not need BC
the outer loop only increases r which was initially known to be = 1, so after it is known to be >= 1
this way when main loop runs next time in

        for ; r != 1; r-- {
                count[r-1] = r
        }

the compiler can prove, since r >= 1 the loop will terminate with r==1, not loop forever

                        if count[r] > 0 {
-                       4885ff                  TESTQ DI, DI            
-                       7f10                    JG 0x2459               
+                       4d85c0                  TESTQ R8, R8            
+                       7f10                    JG 0x2305               
                for ; r < n; r++ {
-                       48ffc2                  INCQ DX                 
-                       4c39c2                  CMPQ R8, DX             
-                       7d0b                    JGE 0x245c              
+                       48ffc1                  INCQ CX                 
+                       4839f1                  CMPQ SI, CX             
+                       7d0b                    JGE 0x2308              
                        perm0 := perm1[0]
-                       488b3b                  MOVQ 0(BX), DI          
-                       4531d2                  XORL R10, R10           
+                       4c8b02                  MOVQ 0(DX), R8          
+                       4531db                  XORL R11, R11           
                        for i := 0; i < r; i++ {
-                       eba9                    JMP 0x2402              
+                       ebd7                    JMP 0x22dc              
                for ; r < n; r++ {
-                       4c39c2                  CMPQ R8, DX             
+                       4839f1                  CMPQ SI, CX             
                if r == n {
-                       741a                    JE 0x2478               
-                       488b7c2440              MOVQ 0x40(SP), DI       
-                       4d89ea                  MOVQ R13, R10           
+                       7415                    JE 0x231f               
+                       4d89d0                  MOVQ R10, R8            
                if didpr < 30 {
-                       4983f91e                CMPQ $0x1e, R9          
-                       0f8d78feffff            JGE 0x22e8              
+                       4883ff1e                CMPQ $0x1e, DI          
+                       0f8d2fffffff            JGE 0x2246              
                        didpr++
-                       49ffc1                  INCQ R9                 
+                       48ffc7                  INCQ DI                 
                if didpr < 30 {
-                       e970feffff              JMP 0x22e8              
+                       e927ffffff              JMP 0x2246              
                        return flipsMax
-                       4c89ac2490000000        MOVQ R13, 0x90(SP)      
-                       488b6c2478              MOVQ 0x78(SP), BP       
-                       4881c480000000          ADDQ $0x80, SP          
+                       4c89542458              MOVQ R10, 0x58(SP)      
+                       488b6c2440              MOVQ 0x40(SP), BP       
+                       4883c448                ADDQ $0x48, SP          
                        c3                      RET                     
-                       4d89d5                  MOVQ R10, R13           
+                       4d89c2                  MOVQ R8, R10            
                        if flipsMax < flips {
-                       ebba                    JMP 0x244c              
-                       4c8b5c2468              MOVQ 0x68(SP), R11      
-                       4c8b642438              MOVQ 0x38(SP), R12      
-                       4d89d5                  MOVQ R10, R13           
+                       ebc5                    JMP 0x22f8              
+                       4c8b4c2438              MOVQ 0x38(SP), R9       
+                       4d89c2                  MOVQ R8, R10            
                if perm1[0] != 0 && perm1[n1] != n1 {
-                       ebab                    JMP 0x244c              
-                       4c8b5c2468              MOVQ 0x68(SP), R11      
-                       4c8b642438              MOVQ 0x38(SP), R12      
-                       4d89d5                  MOVQ R10, R13           
-                       eb9c                    JMP 0x244c              
+                       ebbb                    JMP 0x22f8              
+                       4c8b4c2438              MOVQ 0x38(SP), R9       
+                       4d89c2                  MOVQ R8, R10            
+                       ebb1                    JMP 0x22f8              
                return 0
-                       48c784249000000000000000        MOVQ $0x0, 0x90(SP)     
-                       488b6c2478                      MOVQ 0x78(SP), BP       
-                       4881c480000000                  ADDQ $0x80, SP          
-                       c3                              RET                     
-                       count[r]--
-  0x24c9               e800000000              CALL 0x24ce             [1:5]R_CALL:runtime.panicindex
-  0x24ce               0f0b                    UD2                     
-                       perm1[r] = perm0
-  0x24d0               e800000000              CALL 0x24d5             [1:5]R_CALL:runtime.panicindex
-  0x24d5               0f0b                    UD2                     
-                               perm1[i] = perm1[i+1]
-  0x24d7               e800000000              CALL 0x24dc             [1:5]R_CALL:runtime.panicindex
-  0x24dc               0f0b                    UD2                     
-                               j := perm[k]
-  0x24de               e800000000              CALL 0x24e3             [1:5]R_CALL:runtime.panicindex
-  0x24e3               0f0b                    UD2                     
-                                       perm[i], perm[j] = perm[j], perm[i]
-  0x24e5               e800000000              CALL 0x24ea             [1:5]R_CALL:runtime.panicindex
-  0x24ea               0f0b                    UD2                     
-                               perm[i] = perm1[i]
-  0x24ec               e800000000              CALL 0x24f1             [1:5]R_CALL:runtime.panicindex
-  0x24f1               0f0b                    UD2                     
-               if perm1[0] != 0 && perm1[n1] != n1 {
-  0x24f3               e800000000              CALL 0x24f8             [1:5]R_CALL:runtime.panicindex
-  0x24f8               0f0b                    UD2                     
-                       count[r-1] = r
-  0x24fa               e800000000              CALL 0x24ff             [1:5]R_CALL:runtime.panicindex
-  0x24ff               0f0b                    UD2                     
-               perm1[i] = i // initial (trivial) permutation
-  0x2501               e800000000              CALL 0x2506             [1:5]R_CALL:runtime.panicindex
-  0x2506               0f0b                    UD2                     
+                       48c744245800000000      MOVQ $0x0, 0x58(SP)     
+                       488b6c2440              MOVQ 0x40(SP), BP       
+                       4883c448                ADDQ $0x48, SP          
+                       c3                      RET                     
 func fannkuch(n int) int {
-                       e800000000              CALL 0x250d             [1:5]R_CALL:runtime.morestack_noctxt
-                       e9c0fcffff              JMP %22%22.fannkuch(SB) 
+                       e800000000              CALL 0x235f             [1:5]R_CALL:runtime.morestack_noctxt
+                       e90efeffff              JMP %22%22.fannkuch(SB)

^^^ are all CALL runtime.panicindex destinations and they were all covered by my comments.

So maybe I missed something (appologize then) but above I was able to prove all runtime bounds check are unneccessary in fannkuch. I don't see any reason the compiler cannot be taught to do the same. Thus for fannkuch it is theoretically possible to reach -B level with just BCE.

navytux commented 7 years ago

About compiler itself I've tried to see myself too. For this via passing GO_GCFLAGS=-B env var to make.bash I've bootstrapped second compiler (goB) which does not check bounds at compile runtime, checked it is really the case (e.g. go tool objdump pkg/tool/linux_amd64/compile |grep panicindex |wc -l), and compared go and goB via compilebench:

$ benchstat go.txt goB.txt 
name      old time/op     new time/op     delta
Template      171ms ± 3%      169ms ± 2%  -1.12%  (p=0.001 n=19+18)
Unicode      79.1ms ± 3%     78.3ms ± 4%    ~     (p=0.089 n=19+20)
GoTypes       559ms ± 2%      550ms ± 2%  -1.53%  (p=0.000 n=20+20)
Compiler      2.63s ± 2%      2.59s ± 1%  -1.52%  (p=0.000 n=20+20)
SSA           5.26s ± 1%      5.19s ± 1%  -1.40%  (p=0.000 n=19+19)
Flate         111ms ± 1%      109ms ± 1%  -1.74%  (p=0.000 n=20+18)
GoParser      139ms ± 2%      136ms ± 1%  -1.75%  (p=0.000 n=20+19)
Reflect       346ms ± 2%      341ms ± 1%  -1.61%  (p=0.000 n=19+19)
Tar          97.5ms ± 3%     96.3ms ± 3%  -1.18%  (p=0.002 n=19+19)
XML           192ms ± 1%      189ms ± 1%  -1.23%  (p=0.000 n=20+19)

It is hard for me to say it had no impact.

Josh, what was your way of testing? Same as mine, or maybe I'm missing something?

Thanks beforehand, Kirill

dr2chase commented 7 years ago

The optimization proposed here is very aggressive, and the proposed gains are smaller than what we hope to obtain from either getting better inlining into the compiler or passing arguments in registers. I'd like to target 1.11 instead of 1.10.

josharian commented 7 years ago

Josh, what was your way of testing? Same as mine, or maybe I'm missing something?

Effectively same as yours. My CPU's branch predictor just seems to be generally better than yours.

I'd like to target 1.11 instead of 1.10.

Agreed. @bradfitz any chance you could create 1.11 milestones for us?

navytux commented 7 years ago

@josharian thanks for feedback. For the reference my CPU is this:

http://cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-6600U+%40+2.60GHz&id=2608

it is the highest end one could get from 15W TDP mobile series in 6th Intel generation.

josharian commented 7 years ago

For reference, mine: https://cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-6920HQ+%40+2.90GHz&id=2699

navytux commented 7 years ago

Josh, thanks for feedback. I've found only below two "monster" heavy 5.5 kg weight 18.4" notebooks with your processor:

https://market.yandex.ru/product/14232631?show-uid=955252503320279700516001&nid=54544&glfilter=6069383%3A13313955 https://market.yandex.ru/product/14232101?show-uid=955252503320279700516002&nid=54544&glfilter=6069383%3A13313955

:)

Though maybe in your country the choice is more wide...

josharian commented 7 years ago

Mine's in one of the new MacBook Pros. To be clear, I'm not suggesting that you get a new processor! :) I'm still working on getting my udoo x86 up, which hopefully will have different branch prediction properties that I can use during development.

navytux commented 7 years ago

Josh thanks for feedback and good luck with getting udoo up and running.

gopherbot commented 6 years ago

Change https://golang.org/cl/100278 mentions this issue: cmd/compile: in prove, infer unsigned relations while branching

gopherbot commented 6 years ago

Change https://golang.org/cl/100277 mentions this issue: cmd/compile: in prove, add transitive closure of relations

rasky commented 6 years ago

The original issue tracked in this bug (BCE in bvec.AndNot) is fixed by CL100277 and CL100278. After the first lands, len(a) == len(b) can be used to trigger BCE; after the second, len(a) < len(b) will work too.

@navytux would it possible to move your wonderful comment about fannkuch in a separate issue? It would be easier to find and track.

navytux commented 6 years ago

@rasky, thanks for working on patches improving BCE. Sure, fannkuch case can be moved outside for easier observability and tracking -> please see https://github.com/golang/go/issues/24660.

josharian commented 6 years ago

The original issue tracked in this bug (BCE in bvec.AndNot) is fixed by CL100277 and CL100278.

Awesome.

Will it require additional work or hints to be added to bvec.AndNot, or will the existing code suffice? Just want to know whether I need to send any follow-up CLs.

rasky commented 6 years ago

Yes, you need to assert that all the involved vector have equal or compatible size (eg: >= of the one you're iterating on). In the OP you say "By construction, though, len(dst.b) == len(src1.b) == len(src2.b)" but unless that is visible to the compiler, you need explicitly assert it.

josharian commented 6 years ago

Cool. Once those CLs are in I'll find the right assertion to add.

gopherbot commented 6 years ago

Change https://golang.org/cl/110560 mentions this issue: cmd/compile: optimize bvec routines

golang / go

cmd/compile: no way to hint to eliminate bounds checks in bvec.AndNot #20393