cmd/compile: suboptimal assembly for a + b + 1

golang / go

The Go programming language

https://go.dev

BSD 3-Clause "New" or "Revised" License

124.03k stars 17.67k forks source link

cmd/compile: suboptimal assembly for a + b + 1 #31900

Open josharian opened 5 years ago

josharian commented 5 years ago

package p

func f(a, b uint64) uint64 {
    return a + b + 1
}

On amd64, this assembles to:

"".f STEXT nosplit size=24 args=0x18 locals=0x0
    0x0000 00000 (x.go:4)   MOVQ    "".a+8(SP), AX
    0x0005 00005 (x.go:4)   MOVQ    "".b+16(SP), CX
    0x000a 00010 (x.go:4)   LEAQ    (CX)(AX*1), AX
    0x000e 00014 (x.go:4)   LEAQ    1(AX), AX
    0x0012 00018 (x.go:4)   MOVQ    AX, "".~r2+24(SP)
    0x0017 00023 (x.go:4)   RET

I believe the two LEAQs should instead be ADDQ CX, AX, INC AX. This would be six bytes of instructions instead of 8.

I need to double-check, but I believe that this happens because we optimize a + b + const to LEAQ const(a)(b*1), dst, but then break apart the three-part LEAQ.

We can't fix this when lowering to instructions, because LEAQ doesn't clobber flags and ADD and INC do. So we need to somehow catch this in the rewrite rules. I wonder whether we should reconsider the previous strategy of breaking apart "slow" LEAQs at the last minute (for https://github.com/golang/go/issues/21735).

cc @martisch @randall77

ulikunitz commented 5 years ago

Splitting the 3-operand LEA instruction into two 2-operand LEA instructions is only faster in tight loops that fit into the µOp cache. If there is no tight loop the cost of decoding two instructions instead of one is higher than the one cycle the 2 two-operand LEAs save during execution.

For the example above the 3-operand LEA will be faster than two LEAs or ADDs. The situation changes if the function is inlined into a loop that fits the µOp cache.

Intel formulated following rule in the Intel® 64 and IA-32 Architectures Optimization Reference Manual.

Assembly/Compiler Coding Rule 33. (ML impact, L generality) If an LEA instruction using the scaled index is on the critical path, a sequence with ADDs may be better. If code density and bandwidth out of the trace cache are the critical factor, then use the LEA instruction.

martisch commented 5 years ago

Im fine with reverting the lea splitting to keep it simple and to reduce binary size.

However, Last time i benchmarked decoding two 2 operand leas was better than the latency of a 3 operand lea. Maybe thats changed and they require more than one or two cycles for decoding combined now. Otherwise i do not understand why they would be slower in the example above. The intel rule also only seems to suggest to me that the 3 operand lea is better for trace cache utilization and not that two leas are only better if served from the trace cache. I need to have a look again.

@ulikunitz can you post your cpu type and the benchmark and measurements that show 3 operand lea being faster than two 2 operand leas for comparison in the example above? Thanks

Generally i think we should prefer add 1 over inc. the later only has the upside of size while the former seems never slower (some archs take an extra cycle for flag update of inc), fuses better and doesnt partially update flags.

ulikunitz commented 5 years ago

I have microbenchmarked the three variants of the functions on multiple platforms.

Variant 1: ADDQ + INCQ Variant 2: LEAQ + LEAQ Variant 3: LEAQ (3 operands)

On Intel variant 3 is always the fastest. The 3-operand LEAQ was only slower on an old Athlon X2 5600.

I used following commands to produce the output.

$ sed -n '/family/,+3p;/processor.*1/q' /proc/cpuinfo
$ go test -bench . -count 10 | tee bench.txt
$ benchstat bench.txt

Skylake (Xeon 2GHz)

cpu family  : 6
model       : 85
model name  : Intel(R) Xeon(R) CPU @ 2.00GHz
stepping    : 3

name     time/op
Add/1-4  2.38ns ± 1%
Add/2-4  2.38ns ± 0%
Add/3-4  2.35ns ± 1%

Haswell

cpu family  : 6
model       : 69
model name  : Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
stepping    : 1

name     time/op
Add/1-4  2.46ns ± 6%
Add/2-4  2.41ns ± 5%
Add/3-4  2.39ns ± 5%

Nehalem - before Sandy Bridge, where LEAQ implementation changed

cpu family  : 6
model       : 26
model name  : Intel(R) Core(TM) i7 CPU         950  @ 3.07GHz
stepping    : 5

name     time/op
Add/1-8  2.42ns ± 1%
Add/2-8  2.45ns ± 2%
Add/3-8  2.42ns ± 0%

Athlon X2

cpu family  : 15
model       : 67
model name  : AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
stepping    : 3

name     time/op
Add/1-2  3.94ns ± 0%
Add/2-2  3.98ns ± 2%
Add/3-2  4.38ns ± 3%

The assembler code for reference:

#include "textflag.h"

// func add1_1(x uint64, y uint64) uint64
TEXT ·add1_1(SB), NOSPLIT, $0-24
    MOVQ x+0(FP), AX
    MOVQ y+8(FP), CX
    ADDQ AX, CX
    INCQ CX
    MOVQ CX, ret+16(FP)
    RET

// func add1_2(x uint64, y uint64) uint64
TEXT ·add1_2(SB), NOSPLIT, $0-24
    MOVQ x+0(FP), AX
    MOVQ y+8(FP), CX
    LEAQ (AX)(CX*1), CX
    LEAQ 1(CX), CX
    MOVQ CX, ret+16(FP)
    RET

// func add1_3(x uint64, y uint64) uint64
TEXT ·add1_3(SB), NOSPLIT, $0-24
    MOVQ x+0(FP), AX
    MOVQ y+8(FP), CX
    LEAQ 1(AX)(CX*1), CX
    MOVQ CX, ret+16(FP)
    RET

martisch commented 5 years ago

Thank you very much for the data. A difference of 0.03 ns doesnt usually mean it is generally slower but looks to be within benchmarking variance (going by the -+6%) which we will not be able to completely narrow down even on higher counts. where 0.2 or 0.3ns likely indeed mean its likely slower by a clock cycle. If these are really parallel benchmarks I usually run with -cpu=1 for these to reduce the load and interference and disable frequency scaling and turbo boost too.

I will run the 2 lea vs 3 op lea variant on my benchmarking computer once near it too.

martisch commented 5 years ago

Currently only have my laptop at hand (i7-3520M 2.9ghz, Ivy Bridge) to make a quick benchmark.

On it disabling slow lea splitting (go tip 83f205fa8829781b9a4ef67ab47ae5fc96ecb6b5 , ssa.go#L608 set to false then go install cmd/compile) https://github.com/golang/go/blob/2e4edf46977994c9d26df9327f0e41c1b60f3435/src/cmd/compile/internal/amd64/ssa.go#L608

makes the benchmarks 1 clock cycle (around 0.3ns) slower. I regard 0.1ns as within variance from runs here.

old = go tip 83f205fa8829781b9a4ef67ab47ae5fc96ecb6b5 new = go tip 83f205fa8829781b9a4ef67ab47ae5fc96ecb6b5 with slow lea splitting disabled

go test -cpu=1 -count=10 -bench=.*
benchstat ~/lea2.bench ~/lea3.bench 

name              old time/op  new time/op  delta
LEA22_1_noinline  4.24ns ± 0%  4.52ns ± 0%   +6.70%  (p=0.000 n=8+10)
LEA22_4_noinline  4.31ns ± 2%  4.61ns ± 2%   +6.91%  (p=0.000 n=10+10)
LEA22_1_inline    0.58ns ± 2%  0.87ns ± 4%  +50.05%  (p=0.000 n=10+10)
LEA22_4_inline    0.59ns ± 3%  0.97ns ± 3%  +64.47%  (p=0.000 n=10+9)

go test -cpu=1 -count=100 -bench="LEA22_1" > ~/lea3.bench
benchstat ~/lea2.bench ~/lea3.bench 
name              old time/op  new time/op  delta
LEA22_1_noinline  4.33ns ± 4%  4.60ns ± 3%   +6.41%  (p=0.000 n=100+100)
LEA22_1_inline    0.58ns ± 5%  0.86ns ± 4%  +49.52%  (p=0.000 n=98+98)

I used e.g. go test -c go tool objdump -s BenchmarkLEA22_4_inline to check that the expected LEA instructions were emitted.

Benchmarks

var global int

func BenchmarkLEA22_1_noinline(b *testing.B) {
    var sink int
    for i := 0; i < b.N; i++ {
        sink = lea22_1_noinline(sink, sink)
    }
    global = sink
}

func BenchmarkLEA22_4_noinline(b *testing.B) {
    var sink int
    for i := 0; i < b.N; i++ {
        sink = lea22_4_noinline(sink, sink)
    }
    global = sink
}

func BenchmarkLEA22_1_inline(b *testing.B) {
    var sink int
    for i := 0; i < b.N; i++ {
        sink = lea22_1_inline(sink, sink)
    }
    global = sink
}

func BenchmarkLEA22_4_inline(b *testing.B) {
    var sink int
    for i := 0; i < b.N; i++ {
        sink = lea22_4_inline(sink, sink)
    }
    global = sink
}

Functions

//go:noinline
func lea22_1_noinline(a, b int) int {
    return 1 + a + b
}

func lea22_1_inline(a, b int) int {
    return 1 + a + b
}

//go:noinline
func lea22_4_noinline(a, b int) int {
    return 1 + (a + 4*b)
}

func lea22_4_inline(a, b int) int {
    return 1 + (a + 4*b)
}

ulikunitz commented 5 years ago

Martin,

could you provide the code on github.com. I would like to check them on the machines I have access to.

Kind regards,

Ulrich

Am So., 12. Mai 2019 um 09:22 Uhr schrieb Martin Möhrmann < notifications@github.com>:

Currently only have my laptop at hand (i7-3520M 2.9ghz) to make a quick benchmark.

On it disabling slow lea splitting (go tip 83f205, https://github.com/golang/go/blob/2e4edf46977994c9d26df9327f0e41c1b60f3435/src/cmd/compile/internal/amd64/ssa.go#L608 set to false) makes the benchmarks 1 clock cycle (around 0.3ns) slower. I regard 0.1ns as within variance from runs here.

benchstat ~/lea2.bench ~/lea3.bench

name old time/op new time/op delta

LEA22_1_noinline 4.24ns ± 0% 4.52ns ± 0% +6.70% (p=0.000 n=8+10)

LEA22_4_noinline 4.31ns ± 2% 4.61ns ± 2% +6.91% (p=0.000 n=10+10)

LEA22_1_inline 0.58ns ± 2% 0.87ns ± 4% +50.05% (p=0.000 n=10+10)

LEA22_4_inline 0.59ns ± 3% 0.97ns ± 3% +64.47% (p=0.000 n=10+9)

I used go tool objdump -s BenchmarkLEA22_4_inline to check that the expected LEA instructions were emitted.

Command

go test -cpu=1 -count=10 -bench=.*

Benchmarks

var global int

func BenchmarkLEA22_1_noinline(b *testing.B) {

var sink int

for i := 0; i < b.N; i++ {
  sink = lea22_1_noinline(sink, sink)
}

global = sink

}

func BenchmarkLEA22_4_noinline(b *testing.B) {

var sink int

for i := 0; i < b.N; i++ {
  sink = lea22_4_noinline(sink, sink)
}

global = sink

}

func BenchmarkLEA22_1_inline(b *testing.B) {

var sink int

for i := 0; i < b.N; i++ {
  sink = lea22_1_inline(sink, sink)
}

global = sink

}

func BenchmarkLEA22_4_inline(b *testing.B) {

var sink int

for i := 0; i < b.N; i++ {
  sink = lea22_4_inline(sink, sink)
}

global = sink

}

Functions

//go:noinline

func lea22_1_noinline(a, b int) int {

return 1 + a + b

}

func lea22_1_inline(a, b int) int {

return 1 + a + b

}

//go:noinline

func lea22_4_noinline(a, b int) int {

return 1 + (a + 4*b)

}

func lea22_4_inline(a, b int) int {

return 1 + (a + 4*b)

}

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/31900#issuecomment-491571683, or mute the thread https://github.com/notifications/unsubscribe-auth/ACARSFMD4R5DG74UNO7WYK3PU7ATRANCNFSM4HLOC6ZQ .

gopherbot commented 5 years ago

Change https://golang.org/cl/176622 mentions this issue: cmd/compile: benchmark for slow lea

martisch commented 5 years ago

Uploaded as https://go-review.googlesource.com/c/go/+/176622, also contains the change to generate slow leas in src/cmd/compile/internal/amd64/ssa.go.

For the original issue posted: If it is better binary wise or for port utilization we could emit two adds. Thats also what gccgo 7.2.0 emits.

To make this work it would be nice if we have a last rule based optimization pass after the normal pass to do these kinds of low level transformations that should not interfere with other rules but need to be done before emitting instructions. This would also make amd64/ssa.go code simpler. Replacing mov with xor and some other optimization could fit this category as well. see #27034 where I had commented in that direction.

martisch commented 5 years ago

Update: Running the benchmark from above on a Haswell Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz also results in a measurable slow down of about 1 clock cycle when turning off slow lea splitting:

go test -run=none -bench=.* -cpu=1 -count=20

name              old time/op  new time/op  delta
LEA22_1_noinline  3.85ns ± 3%  4.39ns ± 2%  +13.84%  (p=0.000 n=20+20)
LEA22_4_noinline  3.90ns ± 1%  4.27ns ± 2%   +9.40%  (p=0.000 n=19+20)
LEA22_1_inline    0.55ns ± 1%  0.82ns ± 3%  +50.39%  (p=0.000 n=20+20)
LEA22_4_inline    0.55ns ± 3%  0.82ns ± 2%  +49.77%  (p=0.000 n=18+19)