Open josharian opened 5 years ago
Splitting the 3-operand LEA instruction into two 2-operand LEA instructions is only faster in tight loops that fit into the µOp cache. If there is no tight loop the cost of decoding two instructions instead of one is higher than the one cycle the 2 two-operand LEAs save during execution.
For the example above the 3-operand LEA will be faster than two LEAs or ADDs. The situation changes if the function is inlined into a loop that fits the µOp cache.
Intel formulated following rule in the Intel® 64 and IA-32 Architectures Optimization Reference Manual.
Assembly/Compiler Coding Rule 33. (ML impact, L generality) If an LEA instruction using the scaled index is on the critical path, a sequence with ADDs may be better. If code density and bandwidth out of the trace cache are the critical factor, then use the LEA instruction.
Im fine with reverting the lea splitting to keep it simple and to reduce binary size.
However, Last time i benchmarked decoding two 2 operand leas was better than the latency of a 3 operand lea. Maybe thats changed and they require more than one or two cycles for decoding combined now. Otherwise i do not understand why they would be slower in the example above. The intel rule also only seems to suggest to me that the 3 operand lea is better for trace cache utilization and not that two leas are only better if served from the trace cache. I need to have a look again.
@ulikunitz can you post your cpu type and the benchmark and measurements that show 3 operand lea being faster than two 2 operand leas for comparison in the example above? Thanks
Generally i think we should prefer add 1 over inc. the later only has the upside of size while the former seems never slower (some archs take an extra cycle for flag update of inc), fuses better and doesnt partially update flags.
I have microbenchmarked the three variants of the functions on multiple platforms.
Variant 1: ADDQ + INCQ Variant 2: LEAQ + LEAQ Variant 3: LEAQ (3 operands)
On Intel variant 3 is always the fastest. The 3-operand LEAQ was only slower on an old Athlon X2 5600.
I used following commands to produce the output.
$ sed -n '/family/,+3p;/processor.*1/q' /proc/cpuinfo
$ go test -bench . -count 10 | tee bench.txt
$ benchstat bench.txt
Skylake (Xeon 2GHz)
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) CPU @ 2.00GHz
stepping : 3
name time/op
Add/1-4 2.38ns ± 1%
Add/2-4 2.38ns ± 0%
Add/3-4 2.35ns ± 1%
Haswell
cpu family : 6
model : 69
model name : Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
stepping : 1
name time/op
Add/1-4 2.46ns ± 6%
Add/2-4 2.41ns ± 5%
Add/3-4 2.39ns ± 5%
Nehalem - before Sandy Bridge, where LEAQ implementation changed
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
stepping : 5
name time/op
Add/1-8 2.42ns ± 1%
Add/2-8 2.45ns ± 2%
Add/3-8 2.42ns ± 0%
Athlon X2
cpu family : 15
model : 67
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
stepping : 3
name time/op
Add/1-2 3.94ns ± 0%
Add/2-2 3.98ns ± 2%
Add/3-2 4.38ns ± 3%
The assembler code for reference:
#include "textflag.h"
// func add1_1(x uint64, y uint64) uint64
TEXT ·add1_1(SB), NOSPLIT, $0-24
MOVQ x+0(FP), AX
MOVQ y+8(FP), CX
ADDQ AX, CX
INCQ CX
MOVQ CX, ret+16(FP)
RET
// func add1_2(x uint64, y uint64) uint64
TEXT ·add1_2(SB), NOSPLIT, $0-24
MOVQ x+0(FP), AX
MOVQ y+8(FP), CX
LEAQ (AX)(CX*1), CX
LEAQ 1(CX), CX
MOVQ CX, ret+16(FP)
RET
// func add1_3(x uint64, y uint64) uint64
TEXT ·add1_3(SB), NOSPLIT, $0-24
MOVQ x+0(FP), AX
MOVQ y+8(FP), CX
LEAQ 1(AX)(CX*1), CX
MOVQ CX, ret+16(FP)
RET
Thank you very much for the data. A difference of 0.03 ns doesnt usually mean it is generally slower but looks to be within benchmarking variance (going by the -+6%) which we will not be able to completely narrow down even on higher counts. where 0.2 or 0.3ns likely indeed mean its likely slower by a clock cycle. If these are really parallel benchmarks I usually run with -cpu=1 for these to reduce the load and interference and disable frequency scaling and turbo boost too.
I will run the 2 lea vs 3 op lea variant on my benchmarking computer once near it too.
Currently only have my laptop at hand (i7-3520M 2.9ghz, Ivy Bridge) to make a quick benchmark.
On it disabling slow lea splitting (go tip 83f205fa8829781b9a4ef67ab47ae5fc96ecb6b5 , ssa.go#L608 set to false then go install cmd/compile
) https://github.com/golang/go/blob/2e4edf46977994c9d26df9327f0e41c1b60f3435/src/cmd/compile/internal/amd64/ssa.go#L608
makes the benchmarks 1 clock cycle (around 0.3ns) slower. I regard 0.1ns as within variance from runs here.
old = go tip 83f205fa8829781b9a4ef67ab47ae5fc96ecb6b5 new = go tip 83f205fa8829781b9a4ef67ab47ae5fc96ecb6b5 with slow lea splitting disabled
go test -cpu=1 -count=10 -bench=.*
benchstat ~/lea2.bench ~/lea3.bench
name old time/op new time/op delta
LEA22_1_noinline 4.24ns ± 0% 4.52ns ± 0% +6.70% (p=0.000 n=8+10)
LEA22_4_noinline 4.31ns ± 2% 4.61ns ± 2% +6.91% (p=0.000 n=10+10)
LEA22_1_inline 0.58ns ± 2% 0.87ns ± 4% +50.05% (p=0.000 n=10+10)
LEA22_4_inline 0.59ns ± 3% 0.97ns ± 3% +64.47% (p=0.000 n=10+9)
go test -cpu=1 -count=100 -bench="LEA22_1" > ~/lea3.bench
benchstat ~/lea2.bench ~/lea3.bench
name old time/op new time/op delta
LEA22_1_noinline 4.33ns ± 4% 4.60ns ± 3% +6.41% (p=0.000 n=100+100)
LEA22_1_inline 0.58ns ± 5% 0.86ns ± 4% +49.52% (p=0.000 n=98+98)
I used e.g. go test -c
go tool objdump -s BenchmarkLEA22_4_inline
to check that the expected LEA instructions were emitted.
Benchmarks
var global int
func BenchmarkLEA22_1_noinline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_1_noinline(sink, sink)
}
global = sink
}
func BenchmarkLEA22_4_noinline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_4_noinline(sink, sink)
}
global = sink
}
func BenchmarkLEA22_1_inline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_1_inline(sink, sink)
}
global = sink
}
func BenchmarkLEA22_4_inline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_4_inline(sink, sink)
}
global = sink
}
Functions
//go:noinline
func lea22_1_noinline(a, b int) int {
return 1 + a + b
}
func lea22_1_inline(a, b int) int {
return 1 + a + b
}
//go:noinline
func lea22_4_noinline(a, b int) int {
return 1 + (a + 4*b)
}
func lea22_4_inline(a, b int) int {
return 1 + (a + 4*b)
}
Martin,
could you provide the code on github.com. I would like to check them on the machines I have access to.
Kind regards,
Ulrich
Am So., 12. Mai 2019 um 09:22 Uhr schrieb Martin Möhrmann < notifications@github.com>:
Currently only have my laptop at hand (i7-3520M 2.9ghz) to make a quick benchmark.
On it disabling slow lea splitting (go tip 83f205, https://github.com/golang/go/blob/2e4edf46977994c9d26df9327f0e41c1b60f3435/src/cmd/compile/internal/amd64/ssa.go#L608 set to false) makes the benchmarks 1 clock cycle (around 0.3ns) slower. I regard 0.1ns as within variance from runs here.
benchstat ~/lea2.bench ~/lea3.bench
name old time/op new time/op delta
LEA22_1_noinline 4.24ns ± 0% 4.52ns ± 0% +6.70% (p=0.000 n=8+10)
LEA22_4_noinline 4.31ns ± 2% 4.61ns ± 2% +6.91% (p=0.000 n=10+10)
LEA22_1_inline 0.58ns ± 2% 0.87ns ± 4% +50.05% (p=0.000 n=10+10)
LEA22_4_inline 0.59ns ± 3% 0.97ns ± 3% +64.47% (p=0.000 n=10+9)
I used go tool objdump -s BenchmarkLEA22_4_inline to check that the expected LEA instructions were emitted.
Command
go test -cpu=1 -count=10 -bench=.*
Benchmarks
var global int
func BenchmarkLEA22_1_noinline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_1_noinline(sink, sink)
}
global = sink
}
func BenchmarkLEA22_4_noinline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_4_noinline(sink, sink)
}
global = sink
}
func BenchmarkLEA22_1_inline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_1_inline(sink, sink)
}
global = sink
}
func BenchmarkLEA22_4_inline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_4_inline(sink, sink)
}
global = sink
}
Functions
//go:noinline
func lea22_1_noinline(a, b int) int {
return 1 + a + b
}
func lea22_1_inline(a, b int) int {
return 1 + a + b
}
//go:noinline
func lea22_4_noinline(a, b int) int {
return 1 + (a + 4*b)
}
func lea22_4_inline(a, b int) int {
return 1 + (a + 4*b)
}
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/31900#issuecomment-491571683, or mute the thread https://github.com/notifications/unsubscribe-auth/ACARSFMD4R5DG74UNO7WYK3PU7ATRANCNFSM4HLOC6ZQ .
Change https://golang.org/cl/176622 mentions this issue: cmd/compile: benchmark for slow lea
Uploaded as https://go-review.googlesource.com/c/go/+/176622, also contains the change to generate slow leas in src/cmd/compile/internal/amd64/ssa.go.
For the original issue posted: If it is better binary wise or for port utilization we could emit two adds. Thats also what gccgo 7.2.0 emits.
To make this work it would be nice if we have a last rule based optimization pass after the normal pass to do these kinds of low level transformations that should not interfere with other rules but need to be done before emitting instructions. This would also make amd64/ssa.go code simpler. Replacing mov with xor and some other optimization could fit this category as well. see #27034 where I had commented in that direction.
Update:
Running the benchmark from above on a Haswell Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
also results in a measurable slow down of about 1 clock cycle when turning off slow lea splitting:
go test -run=none -bench=.* -cpu=1 -count=20
name old time/op new time/op delta
LEA22_1_noinline 3.85ns ± 3% 4.39ns ± 2% +13.84% (p=0.000 n=20+20)
LEA22_4_noinline 3.90ns ± 1% 4.27ns ± 2% +9.40% (p=0.000 n=19+20)
LEA22_1_inline 0.55ns ± 1% 0.82ns ± 3% +50.39% (p=0.000 n=20+20)
LEA22_4_inline 0.55ns ± 3% 0.82ns ± 2% +49.77% (p=0.000 n=18+19)
On amd64, this assembles to:
I believe the two LEAQs should instead be
ADDQ CX, AX
,INC AX
. This would be six bytes of instructions instead of 8.I need to double-check, but I believe that this happens because we optimize
a + b + const
toLEAQ const(a)(b*1), dst
, but then break apart the three-part LEAQ.We can't fix this when lowering to instructions, because LEAQ doesn't clobber flags and ADD and INC do. So we need to somehow catch this in the rewrite rules. I wonder whether we should reconsider the previous strategy of breaking apart "slow" LEAQs at the last minute (for https://github.com/golang/go/issues/21735).
cc @martisch @randall77