cmd/compile: new backend; multiple efficiency issues...

Please answer these questions before submitting your issue. Thanks!

Version: go version go1.7beta2 windows/amd64
Environment: set GOARCH=amd64 set GOBIN= set GOEXE=.exe set GOHOSTARCH=amd64 set GOHOSTOS=windows set GOOS=windows set GOPATH=F:\Go\ set GORACE= set GOROOT=F:\Go set GOTOOLDIR=F:\Go\pkg\tool\windows_amd64 set CC=gcc set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\Super\AppData\Local\Temp\go-build631535858=/tmp/go-build -gno-record-gcc-switches set CXX=g++ set CGO_ENABLED=1

However, these issues are likely also related to other operating systems and architectures, as in at least x386.

Runnable Program: Go Playground link:

// SoE project main.go
package main

import (
"fmt"
"math"
)

func primesOdds(top uint32) func() uint32 {
topndx := int((top - 3) / 2)
topsqrtndx := (int(math.Sqrt(float64(top))) - 3) / 2
cmpsts := make([]uint32, (topndx/32)+1)
for i := 0; i <= topsqrtndx; i++ {
   if cmpsts[i>>5]&(1<<(uint32(i)&0x1F)) == 0 {
       p := i + i + 3
       for j := (p*p - 3) >> 1; j <= topndx; j += p {
           cmpsts[j>>5] |= uint32(1) << (uint32(j) & 0x1F)
       }
   }
}
i := -1
return func() uint32 {
   oi := i
   if i <= topndx {
       i++
   }
   for i <= topndx && cmpsts[i>>5]&(uint32(1)<<(uint32(i)&0x1F)) != 0 {
       i++
   }
   if oi < 0 {
       return 2
   } else {
       return (uint32(oi) << 1) + 3
   }
}
}

func main() {
iter := primesOdds(1000000)
count := 0
for v := iter(); v <= 1000000; v = iter() {
   count++
}
fmt.Printf("%v\r\n", count)
}

What I saw:

The issue is the assembly code as viewed when "go tool compile -S > main.asm" is run, with a portion of that file as follows:

0x00e9 00233 (main.go:15)   LEAQ    (CX)(CX*1), R8  ;; **ISSUE 1** - change
0x00ed 00237 (main.go:15)   LEAQ    3(CX)(CX*1), R9 ;; prime 'p' calculation; good
0x00f2 00242 (main.go:16)   IMULQ   R9, R9  ;; 'sqr' calculation - good
0x00f6 00246 (main.go:16)   ADDQ    $-3, R9 ;; 'sqr' - 3 is good
0x00fa 00250 (main.go:16)   SARQ    $1, R9  ;; including shortcut divide by 2 - good
0x00fd 00253 (main.go:16)   CMPQ    R9, SI  ;; advance range compare
0x0100 00256 (main.go:16)   JGT $0, 319     ;; check - good
0x0102 00258 (main.go:17)   MOVQ    R9, R11 ;; **ISSUE 2** - change
0x0105 00261 (main.go:17)   SARQ    $5, R9  ;; calculate word address - good
0x0109 00265 (main.go:17)   CMPQ    R9, DX  ;; *** only here if array bounds check
0x010c 00268 (main.go:17)   JCC PANIC       ;; *** only here if array bounds check
0x0112 00274 (main.go:17)   MOVL    (AX)(R9*4), R12 ;; **ISSUE 3** - not this way
0x0116 00278 (main.go:17)   MOVQ    R11, R13    ;; **ISSUE 4** - not this way
0x0119 00281 (main.go:17)   ANDQ    $31, R11        ;; **ISSUE 5** - unnecessary
0x011d 00285 (main.go:17)   MOVQ    R11, CX     ;; part of **ISSUE 2** - change
0x0120 00288 (main.go:17)   MOVL    R10, R14        ;; **ISSUE 6** - unnecessary
0x0123 00291 (main.go:17)   SHLL    CX, R10     ;; 1 << ('j'&0x1F) - good
0x0126 00294 (main.go:17)   ORL R10, R12                ;; part of **ISSUE 3**
0x0129 00297 (main.go:17)   MOVL    R12, (AX)(R9*4) ;; part of **ISSUE 3**
0x012d 00301 (main.go:16)   LEAQ    3(R8)(R13*1), R9    ;; part of **ISSUE 1**
0x0132 00306 (main.go:13)   MOVQ    "".i+64(SP), CX ;; **ISSUE 7**; unnecessary
0x0137 00311 (main.go:17)   MOVL    R14, R10        ;; part of **ISSUE 6**
0x013a 00314 (main.go:16)   CMPQ    R9, SI  ;; end of loop
0x013d 00317 (main.go:16)   JLE $0, 258     ;; check - good
0x013f 00319 (main.go:13)   LEAQ    1(CX), R9       ;; part of **ISSUE #7** - not this way

Expected: 78498 is expected and that is what is output - not the issue:

The issue is the assembly code as viewed when "go tool compile -S > main.asm" is run, with a portion of that file as follows:
```
OuterLoop:
LEAQ    3(CX)(CX*1), R8 ;; prime 'p' calculation; good, **left in R8 as per ISSUE 1**
MOVQ    R8, R9  ;; **ISSUE 1** fixed
IMULQ   R9, R9  ;; 'sqr' calculation - good
ADDQ    $-3, R9 ;; 'sqr' - 3 is good
SARQ    $1, R9  ;; including shortcut divide by 2 - good - left in R9
CMPQ    R9, SI  ;; advance range compare
JGT PastInner       ;; check - good
InnerLoop:
MOVQ    R9, R11 ;; **ISSUE 2** fixed
MOVQ    R9,CX   ;; **ISSUE 4** fixed
SARQ    $5, R11 ;; calculate word address - good
MOVQ    $1,R10  ;; **ISSUE 6** fixed
CMPQ    R11, DX ;; *** only here if array bounds check
JCC PANIC       ;; *** only here if array bounds check
SHLL    CX, R10 ;; 1 << ('j'&0x1F) - good
ADDQ    R8, R9  ;; part of **ISSUE  1** fixed
ORL R10, (AX)(R11*4)    ;; **ISSUE 3** fixed
CMPQ    R9, SI  ;; end of loop
JLE InnerLoop   ;; check - good
PastInner:
MOVQ    "".i+64(SP), CX ;; **ISSUE 7** fixed; 'i' may well already be in another register
LEAQ    1(CX), R9       ;; now more available registers, if other register, just ADD $1
```
ISSUE 1: Preserves "2 * 'i'", that requires a full 'p' calculation inside the loop using an LEAQ instruction at 272, instead of preserving the full 'p' ('i' + 'i' + 3), that would then eliminate needing to recalculate 'p' inside the loop and would allow for a simple add instruction at line 272, which is a little faster.

ISSUE 2: Preserves the original in a new register before clobbering the original register in order to save latency (ignoring that the CPU will likely use Out of Order Execution - OOE, anyway), where a simple reordering of instructions would do the same and not require the advanced contents be calculated/moved back to the original register at the end of the loop. This is a common pattern.

ISSUE 3: Ignores that the "cmpsts[j>>5] |= ..." can be encoded with a single instruction "ORL R..., (BASEREG)(INDEXREG*4)" to save some complexity and time.

ISSUE 4: the same as ISSUE 2, where a simple instruction order change can mean that no register use swap needs to be made and alleviates the need for more complex LEA use.

ISSUE 5: When a uint32 value is shifted by uint32 bits, the compiler correctly eliminates a logical "and" by 0x1F (31) as the CPU limits the shift to this anyway; the issue is that if shifted by a uint, it doesn't eliminate it as it should (workaround is to use uint32 for shifts). We should check to see if a int32 shifted by 31 bits also gets eliminated as it should; in fact any masking above 31 (above 63 for 64-bit registers) is unnecessary.

ISSUE 6 is #16092, where a register is preserved instead of using a simple MOV immediate. This is pervasive further outside these bounds: as if the compiler has a - "avoid immediate MOV at all costs".

ISSUE 7: This instruction is completely unnecessary in restoring a value to the CX register when it never gets used in the loop and gets clobbered for each loop. Correctly, the CX register should be reloaded if necessary outside the loop. This is issue #14761.

The general observation is that the compiler tends to overuse LEA instructions, which instructions are very effective when necessary, but cost a little bit in speed as used instead of other simpler instructions: they are slightly slower than those simpler instructions, which doesn't seem to be taken into account.

Summary: The golang compiler is quite a way from being optimum, and won't come close to "Cee" (C/C++) efficiency until is comes much closer than this. The changes here aren't that complex, and some simple rule changes/additions should suffice. While version 1.7 is better than 1.6, it still is nowhere near as efficient as it needs to be. Rule changes/additions as suggested above can make tight inner loops run up to twice as fast or more on some architectures and some situations, although not that much here for the amd64 as a high end CPU will re-order instructions to minimize the execution time.

Issue 2: yes, it looks like there are cases where we should perform the op on the copy instead of the original. I'll have to think about how to detect when that is beneficial. Issue 3: Even before doing things like OR AX, (BX), we also would want OR (BX), AX. The problem is on amd64 there are a ton of possible instructions you could potentially want, including all the ops times all the addressing modes. So although I think it would be easy to handle this particular case, doing it generally would be a large maintenance burden. Issue 5 : should be an easy rule fix. Issue 7: I think this is #14761.

@randall77:

Issue 2: yes, it looks like there are cases where we should perform the op on the copy instead of the original. I'll have to think about how to detect when that is beneficial.

I think Issue 1 and 4 are all various symptoms of this: to me it looks like it would be better to keep the original and modify the copy more times than not.

Issue 3: Even before doing things like OR AX, (BX), we also would want OR (BX), AX. The problem is on amd64 there are a ton of possible instructions you could potentially want, including all the ops times all the addressing modes. So although I think it would be easy to handle this particular case, doing it generally would be a large maintenance burden.

x86 doesn't have so many modes, but I think it has the read/modify/write instructions?

Regarding addressing complexity, all the addressing modes are already used by the LEA instruction, although the execution time assessments don't seem quite right yet; it you solve that one, then the read/modify/write is solved except for latency.

Your concern about OR (BX),AX is misplaced I think, as latency is just dependent on whether the memory is in the cache or not, as for all memory accesses; the OR AX,(BX) is more of a concern only if the destination is in the dependency chain, then it is quite long, but almost always shorter or about the same time as separate MOV (BX),AX/XOR DX,AX/MOV AX,(BX) instructions. Note that 1) there is a latency of the first MOV, but the manipulations of the AX register can hide it, and 2) the use of the instructions is usually is this order where there is no dependency on the final (BX) destination.

I'm just saying - GCC can do it in the specific case where the modify/assignment operators are used, why can't we? GCC does not use them when the modify/assignment operators are not used although Clang/LLVM does. We'll never be able to effectively do extreme loop unrolling of some kinds without it (modulos type of things).

Issue 7: I think this is #14761.

Yes, I think you're right.

x86 doesn't have so many modes, but I think it has the read/modify/write instructions?

The SSA backend encodes each addressing mode in a separate opcode. So for integer loads, we already already have: MOVBload, MOVBQSXload, MOVWload, MOVWQSXload, MOVLload, MOVLQSXload, MOVQload, MOVOload, MOVBloadidx1, MOVWloadidx1, MOVWloadidx2, MOVLloadidx1, MOVLloadidx4, MOVQloadidx1, MOVQloadidx8. Similarly for stores. And that's not all the loads, for instance we haven't implemented MOVWloadix4 yet. That's just the MOV opcode. Multiply that by ADD,SUB,AND,OR,XOR,MUL,DIV,NOT,NEG, and probably others. It gets pretty unwieldy pretty quickly.

I'd like to solve this problem at some point. Maybe we bite the bullet and just do all those opcodes (and corresponding rewrite rules). Maybe we autogenerate somehow. It's low priority at the moment because I don't think it buys a whole lot (it's the same microops, just encoded more efficiently) and it's only x86. It's not needed for all the other architectures, and those are higher priority at the moment.

Regarding addressing complexity, all the addressing modes are already used by the LEA instruction, although the execution time assessments don't seem quite right yet; it you solve that one, then the read/modify/write is solved except for latency.

Your concern about OR (BX),AX is misplaced I think, as latency is just dependent on whether the memory is in the cache or not, as for all memory accesses; the OR AX,(BX) is more of a concern only if the destination is in the dependency chain, then it is quite long, but almost always shorter or about the same time as separate MOV (BX),AX/XOR DX,AX/MOV AX,(BX) instructions. Note that 1) there is a latency of the first MOV, but the manipulations of the AX register can hide it, and 2) the use of the instructions is usually is this order where there is no dependency on the final (BX) destination.

@randall77:

The SSA backend encodes each addressing mode in a separate opcode. So for integer loads, we already already have: MOVBload, MOVBQSXload, MOVWload, MOVWQSXload, MOVLload, MOVLQSXload, MOVQload, MOVOload, MOVBloadidx1, MOVWloadidx1, MOVWloadidx2, MOVLloadidx1, MOVLloadidx4, MOVQloadidx1, MOVQloadidx8. Similarly for stores. And that's not all the loads, for instance we haven't implemented MOVWloadix4 yet. That's just the MOV opcode. Multiply that by ADD,SUB,AND,OR,XOR,MUL,DIV,NOT,NEG, and probably others. It gets pretty unwieldy pretty quickly.

I'd like to solve this problem at some point. Maybe we bite the bullet and just do all those opcodes (and corresponding rewrite rules). Maybe we autogenerate somehow. It's low priority at the moment because I don't think it buys a whole lot (it's the same microops, just encoded more efficiently) and it's only x86. It's not needed for all the other architectures, and those are higher priority at the moment.

You are right that in the general case, using the read/modify/write opcodes doesn't make much difference to timing as generally the latency of the first move will be masked by the computations of the modify value so the difference is likely only ten to fifteen percent for a tight loop as here. However, if one is optimizing using loop unrolling techniques (where the modify value is either immediate or a pre-calculated register), it can make a very large difference as in currently using a load followed by the very short modify followed by the store takes 3 clock cycles where it would take only 1 clock cycle for the read/modify/write instruction (both with no latency/dependency on the memory contents of the store/write).

Often, for maximum speed, the base address will be a "unsafe.Pointer", which has the same read/load followed by modify followed by write/store template as currently used here (in fact it has another issue as it produces the simple base address for the MOV's by a LEA instruction when not necessary, which makes the timing even worse; I'll raise an Issue about that once I document it properly).

Regarding addressing complexity, all the addressing modes are already used by the LEA instruction, although the execution time assessments don't seem quite right yet; it you solve that one, then the read/modify/write is solved except for latency.

As I said, it seems you are already handling all of the addressing modes for LEA, although not the load/store variations; couldn't you just apply those rules to all the variations of the other opcodes?

And autogeneration of the code seems to be the way to do it, as it would seem that the rules would be the same for each different load or store for each of the different opcodes.

I can't comment on your priorities, and the current implementation of read then modify then write is "adequate", but we'll never compete with Cee "extreme" optimizations without this.

Just as a further goal to shoot for, Rust/LLVM produces the following assembly code for the same algorithm, although in Rust I had to use an unsafe mutable pointer reference to eliminate the array bounds check; the following code is about as good as it gets without hand optimizing:

    .p2align    4, 0x90     ;; alignment for better efficiency on loop branch destination
.LBB22_73:
    movq    %rcx, %rax  ;; note that the loop variable is kept in rcx for this case: left shift ready
    shrq    $5, %rax            ;; moved for the right shift word calculation here
    movl    $1, %edx            ;; this should be fixed as here as per issue 16092 using immediate load
    shll    %cl, %edx       ;; only works because the and is not necessary when shifting 32-bit reg
    orl %edx, (%r14,%rax,4) ;; note use of combined read/modify/write instruction!
    addq    %rbx, %rcx  ;; note only use of a reg for the prime value added for next loop
.LBB22_68:
    cmpq    %r13, %rcx  ;; loop limit kept in register
    jb  .LBB22_73       ;; note that the array base is kept in a reg (r14), doesn't need load in loop

On my Sky Lake Intel i5-6500, this takes about 3.2 CPU cycles per loop. The CPU is quite likely executing these instructions Out Of Order (OOE) in order to minimize latencies when registers are updated and immediately used in the next instruction.

Note that the above code recognizes that the loop variable can sit in the rcx register and thus be used for the variable left shift when shifting 32-bit registers (also works when shifting 64-bit registers) as the mask of the lower five bits (six bits for 64-bit) is not necessary and has been eliminated. golang version 1.7 eliminates the mask, but does not then recognize that it can also then use rcx as the loop variable eliminating one register move and the use of one extra register. A total of only six registers are used in the above inner loop, meaning it can be just as efficient in x86 code.

Note that is seems that stable golang version 1.7 for at least x86_64 target is very good at eliminating the bounds check in this case, recognizing that topndx is related to the `len(cmpsts), that the bounds check is therefore not necessary, thus eliminating it.

I'll leave it to @randall77 to decide about all the specific recommendations here and whether we want to tackle any of them for 1.9.

A quick update on the overall execution speed, though--things have at least been getting steadily better. For 1.7.5, 1.8.1, and tip at b53acd89db5847b9ddcba076df89bef8788dd348:

name \ time/op  p17          p18          ptip
PrimesOdds-8    3.17ns ± 1%  2.56ns ± 1%  2.23ns ± 1%

This uses the same code as before, but converted straightforwardly into a benchmark.

package p

import (
    "math"
    "testing"
)

func primesOdds(top uint32) func() uint32 {
    topndx := int((top - 3) / 2)
    topsqrtndx := (int(math.Sqrt(float64(top))) - 3) / 2
    cmpsts := make([]uint32, (topndx/32)+1)
    for i := 0; i <= topsqrtndx; i++ {
        if cmpsts[i>>5]&(1<<(uint32(i)&0x1F)) == 0 {
            p := i + i + 3
            for j := (p*p - 3) >> 1; j <= topndx; j += p {
                cmpsts[j>>5] |= uint32(1) << (uint32(j) & 0x1F)
            }
        }
    }
    i := -1
    return func() uint32 {
        oi := i
        if i <= topndx {
            i++
        }
        for i <= topndx && cmpsts[i>>5]&(uint32(1)<<(uint32(i)&0x1F)) != 0 {
            i++
        }
        if oi < 0 {
            return 2
        } else {
            return (uint32(oi) << 1) + 3
        }
    }
}

func BenchmarkPrimesOdds(b *testing.B) {
    iter := primesOdds(1000000)
    for i := 0; i < b.N; i++ {
        iter()
    }
}

Josh, would you consider adding this benchmark to the Go1 suite?

On Wed, 17 May 2017, 04:54 Josh Bleecher Snyder notifications@github.com wrote:

I'll leave it to @randall77 https://github.com/randall77 to decide about all the specific recommendations here and whether we want to tackle any of them for 1.9.

A quick update on the overall execution speed, though--things have at least been getting steadily better. For 1.7.5, 1.8.1, and tip at b53acd8 https://github.com/golang/go/commit/b53acd89db5847b9ddcba076df89bef8788dd348 :

name \ time/op p17 p18 ptip PrimesOdds-8 3.17ns ± 1% 2.56ns ± 1% 2.23ns ± 1%

This uses the same code as before, but converted straightforwardly into a benchmark.

package p import ( "math" "testing"

) func primesOdds(top uint32) func() uint32 { topndx := int((top - 3) / 2) topsqrtndx := (int(math.Sqrt(float64(top))) - 3) / 2 cmpsts := make([]uint32, (topndx/32)+1) for i := 0; i <= topsqrtndx; i++ { if cmpsts[i>>5]&(1<<(uint32(i)&0x1F)) == 0 { p := i + i + 3 for j := (p*p - 3) >> 1; j <= topndx; j += p { cmpsts[j>>5] |= uint32(1) << (uint32(j) & 0x1F) } } } i := -1 return func() uint32 { oi := i if i <= topndx { i++ } for i <= topndx && cmpsts[i>>5]&(uint32(1)<<(uint32(i)&0x1F)) != 0 { i++ } if oi < 0 { return 2 } else { return (uint32(oi) << 1) + 3 } } }

func BenchmarkPrimesOdds(b *testing.B) { iter := primesOdds(1000000) for i := 0; i < b.N; i++ { iter() } }

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/16192#issuecomment-301879960, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcA1BqRjhOpMncUcq0sXFDCNg2cVasks5r6fDUgaJpZM4I-32j .

I'd love to see the Go1 suite expand. (And maybe shrink, too--I think fmt is over-represented.) I'm sure we could find a bunch of candidates (#16122 also comes to mind--recency effects). But we should do a considered, bulk addition. Would you mind filing a new issue, where we could collect a set of proposed additions and then consider them en masse?

@josharian, maybe an issue label would do here?

I think @bradfitz and @spf13 were trying to trim down labels. Let's discuss all this in a new issue so as to leave this one on-topic. I'll file one now.

My problem with the mini benchmark as per @josharian is that the timing includes the determination of the results and not just the culling of the composites as was the intention; thus, the improved results may only indicate better iteration rather than an improved tight composite culling loop that was intended to be tested.

The benchmark timing results should only include the

    iter := primesOdds(1000000)

line and not the

    for i := 0; i < b.N; i++ {
        iter()
    }

lines.

Sorry for misinterpreting. To be clear, is this the correct translation?


var sink func() uint32

func BenchmarkPrimesOdds(b *testing.B) {
    for i := 0; i < b.N; i++ {
        sink = primesOdds(1000000)
    }
}

For that, with tip at c20e54533ea49ca68640d9a59c9ed935b27da8e5, I see a regression from 1.8 to 1.9:

name \ time/op  17          18          tip
PrimesOdds-8    978µs ± 1%  884µs ± 1%  990µs ± 2%

I see: (where tip == 1caa0629)

bradfitz@gdev:~/src/issue_16192$ benchstat 1.6 tip
name          old time/op  new time/op  delta
PrimesOdds-4  4.07ns ± 1%  4.42ns ± 1%  +8.46%  (p=0.000 n=10+9)

bradfitz@gdev:~/src/issue_16192$ benchstat 1.6 1.7
name          old time/op  new time/op  delta
PrimesOdds-4  4.07ns ± 1%  4.40ns ± 1%  +8.18%  (p=0.000 n=10+10)

bradfitz@gdev:~/src/issue_16192$ benchstat 1.7 tip
name          old time/op  new time/op  delta
PrimesOdds-4  4.40ns ± 1%  4.42ns ± 1%   ~     (p=0.227 n=10+9)

golang / go

cmd/compile: new backend; multiple efficiency issues... #16192