Open josharian opened 8 years ago
Just for the kicks, I tried that, showed some mixed results for go test -v -bench=. -run=NONE runtime
:
BenchmarkGrowSliceBytes-8 33.3 32.4 -2.70%
BenchmarkGrowSliceInts-8 58.0 59.4 +2.41%
BenchmarkAppend-8 18.6 17.4 -6.45%
BenchmarkAppendGrowByte-8 1903243 1920778 +0.92%
BenchmarkAppendGrowString-8 278920100 143081182 -48.70% # << this is really interesting
BenchmarkAppend1Byte-8 3.13 3.11 -0.64%
BenchmarkAppend4Bytes-8 3.08 3.18 +3.25%
BenchmarkAppend7Bytes-8 3.12 3.19 +2.24%
BenchmarkAppend8Bytes-8 2.79 3.04 +8.96%
BenchmarkAppend15Bytes-8 4.03 4.27 +5.96%
BenchmarkAppend16Bytes-8 4.28 4.18 -2.34%
BenchmarkAppend32Bytes-8 3.16 3.38 +6.96%
BenchmarkAppendStr1Byte-8 3.10 3.62 +16.77%
BenchmarkAppendStr4Bytes-8 3.05 3.23 +5.90%
BenchmarkAppendStr8Bytes-8 3.06 2.83 -7.52%
BenchmarkAppendStr16Bytes-8 4.18 4.00 -4.31%
BenchmarkAppendStr32Bytes-8 3.45 4.38 +26.96%
BenchmarkAppendSpecialCase-8 18.3 19.3 +5.46%
BenchmarkCopy1Byte-8 3.86 4.47 +15.80%
BenchmarkCopy2Byte-8 3.93 4.20 +6.87%
Want me to prep a CL?
By "that", do you mean (1), (2), or (3)?
We have to be careful not to run afoul of the issue in #14855 . I don't think this optimization would cause that problem, but it probably deserves a closer look.
I tried 2, modified writebarrierptr
itself.
I'll mess with it later and test against the case in #14855.
CL https://golang.org/cl/21027 mentions this issue.
Thinking out loud, one interesting aspect of modifying the calling code (3 above) is that it might allow other optimization passes to remove the code entirely by proving that *dst == src
. I guess that depends on whether *dst
is actually a load or is an ssa.Value.
The benchmark results from CL 21027 suggest that append may be the main source of idempotent pointer updates. Josh, do you happen to know what was one level up from the *dst==src
write barriers? If it's mostly append, I suspect we can optimize just append to avoid the unnecessary pointer updates altogether.
One potential downside of modifying writebarrierptr this way is that it can cause extra coherence traffic by first pulling the pointer's cache line into shared and then upgrading it to modified. Though I doubt that's what causing the slowdowns shown in the CL commit message since those benchmarks are sequential.
I don't think these changes run afoul of the races in #14855. In that case we were doing the pointer write, but not doing the write barrier. It should be safe if we don't do either.
If we go the route of modifying the calling code, it would probably be better to make it:
if writeBarrier.enabled {
if *dst != src {
writebarrierptr(dst, src)
}
} else {
*dst = src
}
so we don't penalize the code with an extra (and potentially poorly predictable) branch when write barriers are off.
@aclements I'm trying something like that, I'll let you know what happens after the benchmarks finish.
I did a bit of experimenting. All of these results are for 4c9a470 (two days ago; forgot to sync).
First, I reproduced @josharian's result using the following patch:
diff --git a/src/runtime/mbarrier.go b/src/runtime/mbarrier.go
index 523d890..15244f8 100644
--- a/src/runtime/mbarrier.go
+++ b/src/runtime/mbarrier.go
@@ -14,6 +14,7 @@
package runtime
import (
+ "runtime/internal/atomic"
"runtime/internal/sys"
"unsafe"
)
@@ -123,10 +124,19 @@ func writebarrierptr_nostore1(dst *uintptr, src uintptr) {
releasem(mp)
}
+var wb, wbe, wbz uint64
+
// NOTE: Really dst *unsafe.Pointer, src unsafe.Pointer,
// but if we do that, Go inserts a write barrier on *dst = src.
//go:nosplit
func writebarrierptr(dst *uintptr, src uintptr) {
+ if src == 0 {
+ atomic.Xadd64(&wbz, +1)
+ } else if *dst == src {
+ atomic.Xadd64(&wbe, +1)
+ }
+ atomic.Xadd64(&wb, +1)
+
*dst = src
if writeBarrier.cgo {
cgoCheckWriteBarrier(dst, src)
diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index d386797..b146fa4 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -208,6 +208,7 @@ func main() {
// os_beforeExit is called from os.Exit(0).
//go:linkname os_beforeExit os.runtime_beforeExit
func os_beforeExit() {
+ println("WB", wb, wbe, wbz)
if raceenabled {
racefini()
}
I get this result:
$ go build -a |& awk '/^WB/{n+=$2; eq+=$3; z+=$4} END{print n " " eq " " z}'
14525087 3412554 1422002
That is, of 14M write barriers, 10% are writing nil and 23.5% are dst == src (but not writing nil). I separately checked that almost all of the 10% nil writes also have dst == src.
We should probably skip the write barrier earlier if src is nil. Currently we go through quite a bit of work before checking for a nil pointer.
Focusing on the ones where *dst == src, but src != nil, I profiled where they're coming from using this patch:
diff --git a/src/runtime/mbarrier.go b/src/runtime/mbarrier.go
index 523d890..e8ce821 100644
--- a/src/runtime/mbarrier.go
+++ b/src/runtime/mbarrier.go
@@ -127,6 +127,13 @@ func writebarrierptr_nostore1(dst *uintptr, src uintptr) {
// but if we do that, Go inserts a write barrier on *dst = src.
//go:nosplit
func writebarrierptr(dst *uintptr, src uintptr) {
+ if src != 0 && *dst == src {
+ cpc := getcallerpc((unsafe.Pointer)(&dst)) - 1
+ f := findfunc(cpc)
+ file, line := funcline(f, cpc)
+ print("WBE ", file, ":", line, " ", funcname(f), "\n")
+ }
+
*dst = src
if writeBarrier.cgo {
cgoCheckWriteBarrier(dst, src)
The result is
$ go build -a std |& awk '/^WBE/{print $2}' logx | sort | uniq -c | sort -nr | head -n 10
16519 /home/austin/go.dev/src/cmd/compile/internal/gc/typecheck.go:176
15369 /home/austin/go.dev/src/cmd/compile/internal/gc/dcl.go:58
15155 /home/austin/go.dev/src/cmd/compile/internal/ssa/sparseset.go:45
12066 /home/austin/go.dev/src/cmd/compile/internal/gc/dcl.go:753
8865 /home/austin/go.dev/src/cmd/compile/internal/ssa/value.go:167
8493 /home/austin/go.dev/src/cmd/compile/internal/ssa/func.go:94
8206 /home/austin/go.dev/src/cmd/compile/internal/ssa/stackalloc.go:266
7828 /home/austin/go.dev/src/runtime/mheap.go:603
7288 /home/austin/go.dev/src/runtime/mheap.go:196
6871 /home/austin/go.dev/src/runtime/hashmap.go:793
Of these, all but dcl.go:58, dcl.go:753, mheap.go:603, and hashmap.go:793 are appends. @randall77, can we optimize the generated append code so that the path that grows the slice in place doesn't write back the pointer if we're writing to the same slice we're growing? I imagine this would be something like the slicing optimization.
We can safely skip the write barrier on mheap.go:603. I'm a little surprised that does equal pointer writes so often. There may be also be an algorithmic improvement here.
We could put a conditional around the one in hashmap.go. I think that happens when we're iterating over a map and the iterator stays in the same bucket. I don't totally understand this code; there may be better ways to improve it.
I accidentally uploaded a 2nd CL, https://go-review.googlesource.com/#/c/21138/
➜ benchcmp /tmp/master.txt /tmp/wbfat.txt | grep -v 0\. # just show changes > +-0.x%
benchmark old ns/op new ns/op delta
BenchmarkGrowSliceBytes-8 35.4 32.1 -9.32%
BenchmarkAppend1Byte-8 3.16 3.11 -1.58%
BenchmarkAppend8Bytes-8 2.79 2.69 -3.58%
BenchmarkAppendStr8Bytes-8 3.38 3.23 -4.44%
BenchmarkAppendStr32Bytes-8 3.23 3.18 -1.55%
BenchmarkAppendSpecialCase-8 17.6 18.5 +5.11%
BenchmarkCopy4Byte-8 3.89 3.78 -2.83%
BenchmarkCopy128Byte-8 5.47 5.62 +2.74%
BenchmarkCopy2String-8 3.52 3.62 +2.84%
BenchmarkCopy12String-8 3.68 3.52 -4.35%
BenchmarkCopy32String-8 3.47 3.35 -3.46%
BenchmarkCopy128String-8 5.47 5.33 -2.56%
BenchmarkCopyFat32-8 1.36 1.31 -3.68%
BenchmarkStackGrowthDeep-8 12598 13692 +8.68%
benchmark old MB/s new MB/s speedup
benchmark old allocs new allocs delta
BenchmarkAppendGrowByte-8 51 52 +1.96%
BenchmarkAppendGrowString-8 184 188 +2.17%
@randall77, this is a regression in the SSA back end, similar to the reslicing one. The old back end avoided updating the base pointer in the append fast path, and therefore avoided the write barrier.
$ cat /tmp/x.go
package p
func f(x *[]int) []int {
*x = append(*x, 1)
return *x
}
$ go tool compile -ssa=0 -S /tmp/x.go | grep :4
// fast path - no write barrier
0x0028 00040 (/tmp/x.go:4) MOVQ "".x+88(FP), BX
0x002d 00045 (/tmp/x.go:4) NOP
0x002d 00045 (/tmp/x.go:4) MOVQ (BX), DX
0x0030 00048 (/tmp/x.go:4) MOVQ 8(BX), AX
0x0034 00052 (/tmp/x.go:4) MOVQ 16(BX), CX
0x0038 00056 (/tmp/x.go:4) MOVQ AX, BP
0x003b 00059 (/tmp/x.go:4) INCQ BP
0x003e 00062 (/tmp/x.go:4) CMPQ BP, CX
0x0041 00065 (/tmp/x.go:4) JHI $1, 128
0x0043 00067 (/tmp/x.go:4) MOVQ BP, 8(BX)
0x0047 00071 (/tmp/x.go:4) LEAQ (DX)(AX*8), BX
0x004b 00075 (/tmp/x.go:4) MOVQ $1, (BX)
// slow path code at bottom of function
0x0080 00128 (/tmp/x.go:4) LEAQ type.[]int(SB), BX
0x0087 00135 (/tmp/x.go:4) MOVQ BX, (SP)
0x008b 00139 (/tmp/x.go:4) MOVQ DX, 8(SP)
0x0090 00144 (/tmp/x.go:4) MOVQ AX, 16(SP)
0x0095 00149 (/tmp/x.go:4) MOVQ CX, 24(SP)
0x009a 00154 (/tmp/x.go:4) MOVQ BP, 32(SP)
0x009f 00159 (/tmp/x.go:4) PCDATA $0, $0
0x009f 00159 (/tmp/x.go:4) CALL runtime.growslice(SB)
0x00a4 00164 (/tmp/x.go:4) MOVQ 40(SP), DX
0x00a9 00169 (/tmp/x.go:4) MOVQ 48(SP), AX
0x00ae 00174 (/tmp/x.go:4) MOVQ 56(SP), CX
0x00b3 00179 (/tmp/x.go:4) MOVQ "".x+88(FP), BX
0x00b8 00184 (/tmp/x.go:4) CMPQ BX, $0
0x00bc 00188 (/tmp/x.go:4) JEQ $1, 260
0x00be 00190 (/tmp/x.go:4) MOVQ AX, BP
0x00c1 00193 (/tmp/x.go:4) MOVQ AX, "".autotmp_0001+64(SP)
0x00c6 00198 (/tmp/x.go:4) INCQ BP
0x00c9 00201 (/tmp/x.go:4) MOVQ BP, 8(BX)
0x00cd 00205 (/tmp/x.go:4) MOVQ CX, 16(BX)
0x00d1 00209 (/tmp/x.go:4) MOVQ DX, "".autotmp_0000+72(SP)
0x00d6 00214 (/tmp/x.go:4) CMPB runtime.writeBarrier(SB), $0
0x00dd 00221 (/tmp/x.go:4) JNE $0, 231
0x00df 00223 (/tmp/x.go:4) MOVQ DX, (BX)
0x00e2 00226 (/tmp/x.go:4) JMP 71
0x00e7 00231 (/tmp/x.go:4) MOVQ BX, (SP)
0x00eb 00235 (/tmp/x.go:4) MOVQ DX, 8(SP)
0x00f0 00240 (/tmp/x.go:4) PCDATA $0, $1
0x00f0 00240 (/tmp/x.go:4) CALL runtime.writebarrierptr(SB)
0x00f5 00245 (/tmp/x.go:4) MOVQ "".autotmp_0000+72(SP), DX
0x00fa 00250 (/tmp/x.go:4) MOVQ "".autotmp_0001+64(SP), AX
0x00ff 00255 (/tmp/x.go:4) JMP 71
0x0104 00260 (/tmp/x.go:4) MOVL AX, (BX)
0x0106 00262 (/tmp/x.go:4) JMP 190
0x0108 00264 (/tmp/x.go:4) NOP
$
$ go tool compile -ssa=1 -S /tmp/x.go | grep :4
// fast path - unnecessary update of base, including test of write barrier flag + possible call
0x0017 00023 (/tmp/x.go:4) MOVQ "".x+88(FP), CX
0x001c 00028 (/tmp/x.go:4) MOVQ 16(CX), DX
0x0020 00032 (/tmp/x.go:4) MOVQ 8(CX), BX
0x0024 00036 (/tmp/x.go:4) MOVQ BX, "".autotmp_0000+72(SP)
0x0029 00041 (/tmp/x.go:4) MOVQ (CX), BP
0x002c 00044 (/tmp/x.go:4) LEAQ 1(BX), SI
0x0030 00048 (/tmp/x.go:4) MOVQ SI, "".autotmp_0001+64(SP)
0x0035 00053 (/tmp/x.go:4) CMPQ SI, DX
0x0038 00056 (/tmp/x.go:4) JGT $0, 140
0x003a 00058 (/tmp/x.go:4) MOVQ $1, (BP)(BX*8)
0x0043 00067 (/tmp/x.go:4) MOVQ SI, 8(CX)
0x0047 00071 (/tmp/x.go:4) MOVQ DX, 16(CX)
0x004b 00075 (/tmp/x.go:4) MOVL runtime.writeBarrier(SB), AX
0x0051 00081 (/tmp/x.go:4) TESTB AL, AL
0x0053 00083 (/tmp/x.go:4) JNE $0, 119
0x0055 00085 (/tmp/x.go:4) MOVQ BP, (CX)
// slow path for write barrier call in append "fast" path
0x0077 00119 (/tmp/x.go:4) MOVQ CX, (SP)
0x007b 00123 (/tmp/x.go:4) MOVQ BP, 8(SP)
0x0080 00128 (/tmp/x.go:4) PCDATA $0, $0
0x0080 00128 (/tmp/x.go:4) CALL runtime.writebarrierptr(SB)
// slow path for append calling growslice
0x008c 00140 (/tmp/x.go:4) LEAQ type.[]int(SB), AX
0x0093 00147 (/tmp/x.go:4) MOVQ AX, (SP)
0x0097 00151 (/tmp/x.go:4) MOVQ BP, 8(SP)
0x009c 00156 (/tmp/x.go:4) MOVQ BX, 16(SP)
0x00a1 00161 (/tmp/x.go:4) MOVQ DX, 24(SP)
0x00a6 00166 (/tmp/x.go:4) MOVQ SI, 32(SP)
0x00ab 00171 (/tmp/x.go:4) PCDATA $0, $0
0x00ab 00171 (/tmp/x.go:4) CALL runtime.growslice(SB)
0x00b0 00176 (/tmp/x.go:4) MOVQ 40(SP), BP
0x00b5 00181 (/tmp/x.go:4) MOVQ 56(SP), DX
0x00bf 00191 (/tmp/x.go:4) MOVQ "".autotmp_0000+72(SP), BX
0x00c4 00196 (/tmp/x.go:4) MOVQ "".autotmp_0001+64(SP), SI
0x00c9 00201 (/tmp/x.go:4) JMP 58
0x00ce 00206 (/tmp/x.go:4) NOP
$
For the hashtable iterator case, we can just do the check to see if we're writing something new back explicitly. I'll open a separate bug for the append writeback.
I remeasured this after CLs 21812, 21813, and 21814, which address the append and the map iteration cases.
New results, using Austin's approach above:
29354 1173 2934
So 6% of writes have *dst == src, and 10% have src == nil.
The new top lines for *dst == src (line numbers at commit de7ee57c7ead59899d5b412a839c995de0e813b5) are:
37.57% 420 runtime/proc.go:259 runtime.gopark
14.13% 158 runtime/mheap.go:603 runtime.(*mheap).allocSpanLocked
5.28% 59 runtime/mheap.go:588 runtime.(*mheap).allocSpanLocked
4.92% 55 runtime/mgc.go:1725 runtime.gcCopySpans
4.92% 55 runtime/mgc.go:1726 runtime.gcCopySpans
The gopark calls are unnecessary, since all the strings that reach it are constants, but there's no way for the compiler to know that.
I've filed #15226 for allocSpanLocked. gcCopySpans might deserve a look as well, but we're close to scraping bottom (yay!).
I will investigate whether it is better to exit writebarrierptr early if src == nil or whether it is better to check it in the generated code and send a CL for one of those to close this issue.
CL https://golang.org/cl/21813 mentions this issue.
CL https://golang.org/cl/21814 mentions this issue.
Mailed CLs 21820 and 21821 that implement the two "do less work if src==nil" strategies described above. The numbers for adding a check to the wrapper code are better, although it does increase binary size by ~0.1%. It's not obvious to me which of the two (if either) should go in. Feedback or better benchmark numbers welcomed.
CL https://golang.org/cl/21821 mentions this issue.
CL https://golang.org/cl/21820 mentions this issue.
So besides src==nil and *dst==src, there's another condition that would be interesting to get stats on: dst points to the stack. It might illuminate code (I'm looking at you, convT2I and friends) where the writes are always within the stack. With might still need the write barrier because of stack barriers, but maybe some rearrangement/optimization would be possible.
I had a sudden suspicion that I should look again at this. With the move of the write barrier to assembly, it has gotten a bit harder to instrument. Leaving in the vestigial wbe
variable so I could re-use Austin's awk command, here is my instrumentation for amd64:
diff --git a/src/runtime/asm_amd64.s b/src/runtime/asm_amd64.s
index 2376fe0aae..f07c0bcbe1 100644
--- a/src/runtime/asm_amd64.s
+++ b/src/runtime/asm_amd64.s
@@ -2384,6 +2384,15 @@ TEXT runtime·gcWriteBarrier(SB),NOSPLIT,$120
// faster than having the caller spill these.
MOVQ R14, 104(SP)
MOVQ R13, 112(SP)
+
+ MOVQ $1, R13
+ CMPQ AX, (DI)
+ JNE 3(PC)
+ LOCK; XADDQ R13, runtime·wbe(SB)
+
+ MOVQ $1, R13
+ LOCK; XADDQ R13, runtime·wb(SB)
+
// TODO: Consider passing g.m.p in as an argument so they can be shared
// across a sequence of write barriers.
get_tls(R13)
diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f20e77eee5..0e0c1d446c 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -224,9 +224,12 @@ func main() {
}
}
+var wb, wbe, wbz uint64
+
// os_beforeExit is called from os.Exit(0).
//go:linkname os_beforeExit os.runtime_beforeExit
func os_beforeExit() {
+ println("WB", wb, wbe, wbz)
if raceenabled {
racefini()
}
The result for make.bash piped through Austin's awk command is: 70074409 16189273 0
.
That's >23% of write barriers with *dst == src
, which is suspiciously similar to the amount prior to the append optimization. I suspect that that optimization got lost when the write barrier insertion got moved in to the SSA backend. I've filed #24209 to look at that again.
I suspect that that optimization got lost when the write barrier insertion got moved in to the SSA backend.
No, it is not lost. Using @rsc's test case:
$ cat x.go
package p
func f(x *[]int) []int {
*x = append(*x, 1)
return *x
}
$ go tool compile -S x.go | grep :4
// fast path
0x0026 00038 (x.go:4) MOVQ 8(AX), CX
0x002a 00042 (x.go:4) MOVQ 16(AX), DX
0x002e 00046 (x.go:4) MOVQ (AX), BX
0x0031 00049 (x.go:4) LEAQ 1(CX), SI
0x0035 00053 (x.go:4) CMPQ SI, DX
0x0038 00056 (x.go:4) JGT 110
0x003a 00058 (x.go:4) LEAQ 1(CX), DX
0x003e 00062 (x.go:4) MOVQ DX, 8(AX)
0x0042 00066 (x.go:4) MOVQ $1, (BX)(CX*8)
// slow path
0x006e 00110 (x.go:4) LEAQ type.int(SB), AX
0x0075 00117 (x.go:4) MOVQ AX, (SP)
0x0079 00121 (x.go:4) MOVQ BX, 8(SP)
0x007e 00126 (x.go:4) MOVQ CX, 16(SP)
0x0083 00131 (x.go:4) MOVQ DX, 24(SP)
0x0088 00136 (x.go:4) MOVQ SI, 32(SP)
0x008d 00141 (x.go:4) PCDATA $0, $0
0x008d 00141 (x.go:4) CALL runtime.growslice(SB)
0x0092 00146 (x.go:4) MOVQ 40(SP), AX
0x0097 00151 (x.go:4) MOVQ 48(SP), CX
0x009c 00156 (x.go:4) MOVQ 56(SP), DX
0x00a1 00161 (x.go:4) MOVQ "".x+80(SP), DI
0x00a6 00166 (x.go:4) MOVQ DX, 16(DI)
0x00aa 00170 (x.go:4) CMPL runtime.writeBarrier(SB), $0
0x00b1 00177 (x.go:4) JNE 193
0x00b3 00179 (x.go:4) MOVQ AX, (DI)
0x00b6 00182 (x.go:4) MOVQ AX, BX
0x00b9 00185 (x.go:4) MOVQ DI, AX
0x00bc 00188 (x.go:4) JMP 58
0x00c1 00193 (x.go:4) CALL runtime.gcWriteBarrier(SB)
0x00c6 00198 (x.go:4) JMP 182
0x00c8 00200 (x.go:4) NOP
I think we also have a test that makes sure that we never write the pointer field in the non-growing case.
Hmm. Thanks. I wonder where all the *dst == src writebarrier calls are coming from now.
Maybe we need the check for zero...maybe all the *dst == src when both are nil?
Any which way, this seems worth digging into a bit.
Change https://golang.org/cl/99078 mentions this issue: runtime: convert g.waitreason from string to uint8
Change https://golang.org/cl/111255 mentions this issue: runtime: convert g.waitreason from string to uint8
Change https://golang.org/cl/111256 mentions this issue: runtime: convert g.waitreason from string to uint8
I did a quick and dirty instrumentation of
writebarrierptr
and found that when executingcmd/compile
to build std, about 25% of calls on average had*dst == src
. In that case, there's no need to do the actual assignment or gray any objects (I think). I don't know how (a)typical that 25% number is.We should investigate whether checking for this and short-circuiting is a net performance gain, in general. There are multiple places in the stack this could occur:
(1) writebarrierptr (and friends)
Add
to the beginning of the method.
(2) Wrapper routines
Add checks like:
to the runtime routines that call
writebarrierptr
, like the wbfat.go routines, andwritebarrierstring
and friends. This is different than (1) insofar as it skips thewritebarrierptr
function call instead of having it return immediately.(3) Generated code
We currently generate wb-calling code like:
We could instead generate code like:
cc @aclements @randall77
I don't plan to work on this soon, as I need to undistract myself.