runtime, cmd/compile: consider write barrier short-circuit when *dst == src

josharian commented 8 years ago

I did a quick and dirty instrumentation of writebarrierptr and found that when executing cmd/compile to build std, about 25% of calls on average had *dst == src. In that case, there's no need to do the actual assignment or gray any objects (I think). I don't know how (a)typical that 25% number is.

We should investigate whether checking for this and short-circuiting is a net performance gain, in general. There are multiple places in the stack this could occur:

(1) writebarrierptr (and friends)

Add

if *dst == src {
  return
}

to the beginning of the method.

(2) Wrapper routines

Add checks like:

if dst[1] != src[1] {
    writebarrierptr(&dst[1], src[1])
}

to the runtime routines that call writebarrierptr, like the wbfat.go routines, and writebarrierstring and friends. This is different than (1) insofar as it skips the writebarrierptr function call instead of having it return immediately.

(3) Generated code

We currently generate wb-calling code like:

if writeBarrier.enabled {
  writebarrierptr(dst, src)
} else {
  *dst = src
}

We could instead generate code like:

if *dst != src {
  if writeBarrier.enabled {
    writebarrierptr(dst, src)
  } else {
    *dst = src
  }
}

cc @aclements @randall77

I don't plan to work on this soon, as I need to undistract myself.

OneOfOne commented 8 years ago

Just for the kicks, I tried that, showed some mixed results for go test -v -bench=. -run=NONE runtime:

BenchmarkGrowSliceBytes-8        33.3          32.4          -2.70%
BenchmarkGrowSliceInts-8         58.0          59.4          +2.41%
BenchmarkAppend-8                18.6          17.4          -6.45%
BenchmarkAppendGrowByte-8        1903243       1920778       +0.92%
BenchmarkAppendGrowString-8      278920100     143081182     -48.70% # << this is really interesting
BenchmarkAppend1Byte-8           3.13          3.11          -0.64%
BenchmarkAppend4Bytes-8          3.08          3.18          +3.25%
BenchmarkAppend7Bytes-8          3.12          3.19          +2.24%
BenchmarkAppend8Bytes-8          2.79          3.04          +8.96%
BenchmarkAppend15Bytes-8         4.03          4.27          +5.96%
BenchmarkAppend16Bytes-8         4.28          4.18          -2.34%
BenchmarkAppend32Bytes-8         3.16          3.38          +6.96%
BenchmarkAppendStr1Byte-8        3.10          3.62          +16.77%
BenchmarkAppendStr4Bytes-8       3.05          3.23          +5.90%
BenchmarkAppendStr8Bytes-8       3.06          2.83          -7.52%
BenchmarkAppendStr16Bytes-8      4.18          4.00          -4.31%
BenchmarkAppendStr32Bytes-8      3.45          4.38          +26.96%
BenchmarkAppendSpecialCase-8     18.3          19.3          +5.46%
BenchmarkCopy1Byte-8             3.86          4.47          +15.80%
BenchmarkCopy2Byte-8             3.93          4.20          +6.87%

Want me to prep a CL?

randall77 commented 8 years ago

By "that", do you mean (1), (2), or (3)?

randall77 commented 8 years ago

We have to be careful not to run afoul of the issue in #14855 . I don't think this optimization would cause that problem, but it probably deserves a closer look.

OneOfOne commented 8 years ago

I tried 2, modified writebarrierptr itself.

I'll mess with it later and test against the case in #14855.

OneOfOne commented 8 years ago

I posted a CL21027 with the benchstat.

gopherbot commented 8 years ago

CL https://golang.org/cl/21027 mentions this issue.

josharian commented 8 years ago

Thinking out loud, one interesting aspect of modifying the calling code (3 above) is that it might allow other optimization passes to remove the code entirely by proving that *dst == src. I guess that depends on whether *dst is actually a load or is an ssa.Value.

aclements commented 8 years ago

The benchmark results from CL 21027 suggest that append may be the main source of idempotent pointer updates. Josh, do you happen to know what was one level up from the *dst==src write barriers? If it's mostly append, I suspect we can optimize just append to avoid the unnecessary pointer updates altogether.

One potential downside of modifying writebarrierptr this way is that it can cause extra coherence traffic by first pulling the pointer's cache line into shared and then upgrading it to modified. Though I doubt that's what causing the slowdowns shown in the CL commit message since those benchmarks are sequential.

I don't think these changes run afoul of the races in #14855. In that case we were doing the pointer write, but not doing the write barrier. It should be safe if we don't do either.

If we go the route of modifying the calling code, it would probably be better to make it:

if writeBarrier.enabled {
  if *dst != src {
    writebarrierptr(dst, src)
  }
} else {
  *dst = src
}

so we don't penalize the code with an extra (and potentially poorly predictable) branch when write barriers are off.

OneOfOne commented 8 years ago

@aclements I'm trying something like that, I'll let you know what happens after the benchmarks finish.

aclements commented 8 years ago

I did a bit of experimenting. All of these results are for 4c9a470 (two days ago; forgot to sync).

First, I reproduced @josharian's result using the following patch:

diff --git a/src/runtime/mbarrier.go b/src/runtime/mbarrier.go
index 523d890..15244f8 100644
--- a/src/runtime/mbarrier.go
+++ b/src/runtime/mbarrier.go
@@ -14,6 +14,7 @@
 package runtime

 import (
+       "runtime/internal/atomic"
        "runtime/internal/sys"
        "unsafe"
 )
@@ -123,10 +124,19 @@ func writebarrierptr_nostore1(dst *uintptr, src uintptr) {
        releasem(mp)
 }

+var wb, wbe, wbz uint64
+
 // NOTE: Really dst *unsafe.Pointer, src unsafe.Pointer,
 // but if we do that, Go inserts a write barrier on *dst = src.
 //go:nosplit
 func writebarrierptr(dst *uintptr, src uintptr) {
+       if src == 0 {
+               atomic.Xadd64(&wbz, +1)
+       } else if *dst == src {
+               atomic.Xadd64(&wbe, +1)
+       }
+       atomic.Xadd64(&wb, +1)
+
        *dst = src
        if writeBarrier.cgo {
                cgoCheckWriteBarrier(dst, src)
diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index d386797..b146fa4 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -208,6 +208,7 @@ func main() {
 // os_beforeExit is called from os.Exit(0).
 //go:linkname os_beforeExit os.runtime_beforeExit
 func os_beforeExit() {
+       println("WB", wb, wbe, wbz)
        if raceenabled {
                racefini()
        }

I get this result:

$ go build -a |& awk '/^WB/{n+=$2; eq+=$3; z+=$4} END{print n " " eq " " z}'
14525087 3412554 1422002

That is, of 14M write barriers, 10% are writing nil and 23.5% are dst == src (but not writing nil). I separately checked that almost all of the 10% nil writes also have dst == src.

We should probably skip the write barrier earlier if src is nil. Currently we go through quite a bit of work before checking for a nil pointer.

Focusing on the ones where *dst == src, but src != nil, I profiled where they're coming from using this patch:

diff --git a/src/runtime/mbarrier.go b/src/runtime/mbarrier.go
index 523d890..e8ce821 100644
--- a/src/runtime/mbarrier.go
+++ b/src/runtime/mbarrier.go
@@ -127,6 +127,13 @@ func writebarrierptr_nostore1(dst *uintptr, src uintptr) {
 // but if we do that, Go inserts a write barrier on *dst = src.
 //go:nosplit
 func writebarrierptr(dst *uintptr, src uintptr) {
+       if src != 0 && *dst == src {
+               cpc := getcallerpc((unsafe.Pointer)(&dst)) - 1
+               f := findfunc(cpc)
+               file, line := funcline(f, cpc)
+               print("WBE ", file, ":", line, " ", funcname(f), "\n")
+       }
+
        *dst = src
        if writeBarrier.cgo {
                cgoCheckWriteBarrier(dst, src)

The result is

$ go build -a std |& awk '/^WBE/{print $2}' logx | sort | uniq -c | sort -nr | head -n 10
  16519 /home/austin/go.dev/src/cmd/compile/internal/gc/typecheck.go:176
  15369 /home/austin/go.dev/src/cmd/compile/internal/gc/dcl.go:58
  15155 /home/austin/go.dev/src/cmd/compile/internal/ssa/sparseset.go:45
  12066 /home/austin/go.dev/src/cmd/compile/internal/gc/dcl.go:753
   8865 /home/austin/go.dev/src/cmd/compile/internal/ssa/value.go:167
   8493 /home/austin/go.dev/src/cmd/compile/internal/ssa/func.go:94
   8206 /home/austin/go.dev/src/cmd/compile/internal/ssa/stackalloc.go:266
   7828 /home/austin/go.dev/src/runtime/mheap.go:603
   7288 /home/austin/go.dev/src/runtime/mheap.go:196
   6871 /home/austin/go.dev/src/runtime/hashmap.go:793

Of these, all but dcl.go:58, dcl.go:753, mheap.go:603, and hashmap.go:793 are appends. @randall77, can we optimize the generated append code so that the path that grows the slice in place doesn't write back the pointer if we're writing to the same slice we're growing? I imagine this would be something like the slicing optimization.

We can safely skip the write barrier on mheap.go:603. I'm a little surprised that does equal pointer writes so often. There may be also be an algorithmic improvement here.

We could put a conditional around the one in hashmap.go. I think that happens when we're iterating over a map and the iterator stays in the same bucket. I don't totally understand this code; there may be better ways to improve it.

OneOfOne commented 8 years ago

I accidentally uploaded a 2nd CL, https://go-review.googlesource.com/#/c/21138/

➜ benchcmp /tmp/master.txt /tmp/wbfat.txt | grep -v 0\. # just show changes > +-0.x%
benchmark                        old ns/op     new ns/op     delta
BenchmarkGrowSliceBytes-8        35.4          32.1          -9.32%
BenchmarkAppend1Byte-8           3.16          3.11          -1.58%
BenchmarkAppend8Bytes-8          2.79          2.69          -3.58%
BenchmarkAppendStr8Bytes-8       3.38          3.23          -4.44%
BenchmarkAppendStr32Bytes-8      3.23          3.18          -1.55%
BenchmarkAppendSpecialCase-8     17.6          18.5          +5.11%
BenchmarkCopy4Byte-8             3.89          3.78          -2.83%
BenchmarkCopy128Byte-8           5.47          5.62          +2.74%
BenchmarkCopy2String-8           3.52          3.62          +2.84%
BenchmarkCopy12String-8          3.68          3.52          -4.35%
BenchmarkCopy32String-8          3.47          3.35          -3.46%
BenchmarkCopy128String-8         5.47          5.33          -2.56%
BenchmarkCopyFat32-8             1.36          1.31          -3.68%
BenchmarkStackGrowthDeep-8       12598         13692         +8.68%

benchmark                     old MB/s     new MB/s     speedup

benchmark                        old allocs     new allocs     delta
BenchmarkAppendGrowByte-8        51             52             +1.96%
BenchmarkAppendGrowString-8      184            188            +2.17%

rsc commented 8 years ago

@randall77, this is a regression in the SSA back end, similar to the reslicing one. The old back end avoided updating the base pointer in the append fast path, and therefore avoided the write barrier.

$ cat /tmp/x.go
package p

func f(x *[]int) []int {
    *x = append(*x, 1)
    return *x
}
$ go tool compile -ssa=0 -S /tmp/x.go | grep :4
    // fast path - no write barrier
    0x0028 00040 (/tmp/x.go:4)  MOVQ    "".x+88(FP), BX
    0x002d 00045 (/tmp/x.go:4)  NOP
    0x002d 00045 (/tmp/x.go:4)  MOVQ    (BX), DX
    0x0030 00048 (/tmp/x.go:4)  MOVQ    8(BX), AX
    0x0034 00052 (/tmp/x.go:4)  MOVQ    16(BX), CX
    0x0038 00056 (/tmp/x.go:4)  MOVQ    AX, BP
    0x003b 00059 (/tmp/x.go:4)  INCQ    BP
    0x003e 00062 (/tmp/x.go:4)  CMPQ    BP, CX
    0x0041 00065 (/tmp/x.go:4)  JHI $1, 128
    0x0043 00067 (/tmp/x.go:4)  MOVQ    BP, 8(BX)
    0x0047 00071 (/tmp/x.go:4)  LEAQ    (DX)(AX*8), BX
    0x004b 00075 (/tmp/x.go:4)  MOVQ    $1, (BX)

    // slow path code at bottom of function
    0x0080 00128 (/tmp/x.go:4)  LEAQ    type.[]int(SB), BX
    0x0087 00135 (/tmp/x.go:4)  MOVQ    BX, (SP)
    0x008b 00139 (/tmp/x.go:4)  MOVQ    DX, 8(SP)
    0x0090 00144 (/tmp/x.go:4)  MOVQ    AX, 16(SP)
    0x0095 00149 (/tmp/x.go:4)  MOVQ    CX, 24(SP)
    0x009a 00154 (/tmp/x.go:4)  MOVQ    BP, 32(SP)
    0x009f 00159 (/tmp/x.go:4)  PCDATA  $0, $0
    0x009f 00159 (/tmp/x.go:4)  CALL    runtime.growslice(SB)
    0x00a4 00164 (/tmp/x.go:4)  MOVQ    40(SP), DX
    0x00a9 00169 (/tmp/x.go:4)  MOVQ    48(SP), AX
    0x00ae 00174 (/tmp/x.go:4)  MOVQ    56(SP), CX
    0x00b3 00179 (/tmp/x.go:4)  MOVQ    "".x+88(FP), BX
    0x00b8 00184 (/tmp/x.go:4)  CMPQ    BX, $0
    0x00bc 00188 (/tmp/x.go:4)  JEQ $1, 260
    0x00be 00190 (/tmp/x.go:4)  MOVQ    AX, BP
    0x00c1 00193 (/tmp/x.go:4)  MOVQ    AX, "".autotmp_0001+64(SP)
    0x00c6 00198 (/tmp/x.go:4)  INCQ    BP
    0x00c9 00201 (/tmp/x.go:4)  MOVQ    BP, 8(BX)
    0x00cd 00205 (/tmp/x.go:4)  MOVQ    CX, 16(BX)
    0x00d1 00209 (/tmp/x.go:4)  MOVQ    DX, "".autotmp_0000+72(SP)
    0x00d6 00214 (/tmp/x.go:4)  CMPB    runtime.writeBarrier(SB), $0
    0x00dd 00221 (/tmp/x.go:4)  JNE $0, 231
    0x00df 00223 (/tmp/x.go:4)  MOVQ    DX, (BX)
    0x00e2 00226 (/tmp/x.go:4)  JMP 71
    0x00e7 00231 (/tmp/x.go:4)  MOVQ    BX, (SP)
    0x00eb 00235 (/tmp/x.go:4)  MOVQ    DX, 8(SP)
    0x00f0 00240 (/tmp/x.go:4)  PCDATA  $0, $1
    0x00f0 00240 (/tmp/x.go:4)  CALL    runtime.writebarrierptr(SB)
    0x00f5 00245 (/tmp/x.go:4)  MOVQ    "".autotmp_0000+72(SP), DX
    0x00fa 00250 (/tmp/x.go:4)  MOVQ    "".autotmp_0001+64(SP), AX
    0x00ff 00255 (/tmp/x.go:4)  JMP 71
    0x0104 00260 (/tmp/x.go:4)  MOVL    AX, (BX)
    0x0106 00262 (/tmp/x.go:4)  JMP 190
    0x0108 00264 (/tmp/x.go:4)  NOP
$ 

$ go tool compile -ssa=1 -S /tmp/x.go | grep :4
    // fast path - unnecessary update of base, including test of write barrier flag + possible call
    0x0017 00023 (/tmp/x.go:4)  MOVQ    "".x+88(FP), CX
    0x001c 00028 (/tmp/x.go:4)  MOVQ    16(CX), DX
    0x0020 00032 (/tmp/x.go:4)  MOVQ    8(CX), BX
    0x0024 00036 (/tmp/x.go:4)  MOVQ    BX, "".autotmp_0000+72(SP)
    0x0029 00041 (/tmp/x.go:4)  MOVQ    (CX), BP
    0x002c 00044 (/tmp/x.go:4)  LEAQ    1(BX), SI
    0x0030 00048 (/tmp/x.go:4)  MOVQ    SI, "".autotmp_0001+64(SP)
    0x0035 00053 (/tmp/x.go:4)  CMPQ    SI, DX
    0x0038 00056 (/tmp/x.go:4)  JGT $0, 140
    0x003a 00058 (/tmp/x.go:4)  MOVQ    $1, (BP)(BX*8)
    0x0043 00067 (/tmp/x.go:4)  MOVQ    SI, 8(CX)
    0x0047 00071 (/tmp/x.go:4)  MOVQ    DX, 16(CX)
    0x004b 00075 (/tmp/x.go:4)  MOVL    runtime.writeBarrier(SB), AX
    0x0051 00081 (/tmp/x.go:4)  TESTB   AL, AL
    0x0053 00083 (/tmp/x.go:4)  JNE $0, 119
    0x0055 00085 (/tmp/x.go:4)  MOVQ    BP, (CX)

    // slow path for write barrier call in append "fast" path
    0x0077 00119 (/tmp/x.go:4)  MOVQ    CX, (SP)
    0x007b 00123 (/tmp/x.go:4)  MOVQ    BP, 8(SP)
    0x0080 00128 (/tmp/x.go:4)  PCDATA  $0, $0
    0x0080 00128 (/tmp/x.go:4)  CALL    runtime.writebarrierptr(SB)

    // slow path for append calling growslice
    0x008c 00140 (/tmp/x.go:4)  LEAQ    type.[]int(SB), AX
    0x0093 00147 (/tmp/x.go:4)  MOVQ    AX, (SP)
    0x0097 00151 (/tmp/x.go:4)  MOVQ    BP, 8(SP)
    0x009c 00156 (/tmp/x.go:4)  MOVQ    BX, 16(SP)
    0x00a1 00161 (/tmp/x.go:4)  MOVQ    DX, 24(SP)
    0x00a6 00166 (/tmp/x.go:4)  MOVQ    SI, 32(SP)
    0x00ab 00171 (/tmp/x.go:4)  PCDATA  $0, $0
    0x00ab 00171 (/tmp/x.go:4)  CALL    runtime.growslice(SB)
    0x00b0 00176 (/tmp/x.go:4)  MOVQ    40(SP), BP
    0x00b5 00181 (/tmp/x.go:4)  MOVQ    56(SP), DX
    0x00bf 00191 (/tmp/x.go:4)  MOVQ    "".autotmp_0000+72(SP), BX
    0x00c4 00196 (/tmp/x.go:4)  MOVQ    "".autotmp_0001+64(SP), SI
    0x00c9 00201 (/tmp/x.go:4)  JMP 58
    0x00ce 00206 (/tmp/x.go:4)  NOP
$

randall77 commented 8 years ago

For the hashtable iterator case, we can just do the check to see if we're writing something new back explicitly. I'll open a separate bug for the append writeback.

josharian commented 8 years ago

I remeasured this after CLs 21812, 21813, and 21814, which address the append and the map iteration cases.

New results, using Austin's approach above:

29354 1173 2934

So 6% of writes have *dst == src, and 10% have src == nil.

The new top lines for *dst == src (line numbers at commit de7ee57c7ead59899d5b412a839c995de0e813b5) are:

 37.57%   420 runtime/proc.go:259 runtime.gopark
 14.13%   158 runtime/mheap.go:603 runtime.(*mheap).allocSpanLocked
  5.28%    59 runtime/mheap.go:588 runtime.(*mheap).allocSpanLocked
  4.92%    55 runtime/mgc.go:1725 runtime.gcCopySpans
  4.92%    55 runtime/mgc.go:1726 runtime.gcCopySpans

The gopark calls are unnecessary, since all the strings that reach it are constants, but there's no way for the compiler to know that.

I've filed #15226 for allocSpanLocked. gcCopySpans might deserve a look as well, but we're close to scraping bottom (yay!).

I will investigate whether it is better to exit writebarrierptr early if src == nil or whether it is better to check it in the generated code and send a CL for one of those to close this issue.

gopherbot commented 8 years ago

CL https://golang.org/cl/21813 mentions this issue.

gopherbot commented 8 years ago

CL https://golang.org/cl/21814 mentions this issue.

josharian commented 8 years ago

Mailed CLs 21820 and 21821 that implement the two "do less work if src==nil" strategies described above. The numbers for adding a check to the wrapper code are better, although it does increase binary size by ~0.1%. It's not obvious to me which of the two (if either) should go in. Feedback or better benchmark numbers welcomed.

gopherbot commented 8 years ago

CL https://golang.org/cl/21821 mentions this issue.

gopherbot commented 8 years ago

CL https://golang.org/cl/21820 mentions this issue.

randall77 commented 8 years ago

So besides src==nil and *dst==src, there's another condition that would be interesting to get stats on: dst points to the stack. It might illuminate code (I'm looking at you, convT2I and friends) where the writes are always within the stack. With might still need the write barrier because of stack barriers, but maybe some rearrangement/optimization would be possible.

josharian commented 6 years ago

I had a sudden suspicion that I should look again at this. With the move of the write barrier to assembly, it has gotten a bit harder to instrument. Leaving in the vestigial wbe variable so I could re-use Austin's awk command, here is my instrumentation for amd64:

diff --git a/src/runtime/asm_amd64.s b/src/runtime/asm_amd64.s
index 2376fe0aae..f07c0bcbe1 100644
--- a/src/runtime/asm_amd64.s
+++ b/src/runtime/asm_amd64.s
@@ -2384,6 +2384,15 @@ TEXT runtime·gcWriteBarrier(SB),NOSPLIT,$120
        // faster than having the caller spill these.
        MOVQ    R14, 104(SP)
        MOVQ    R13, 112(SP)
+
+       MOVQ    $1, R13
+       CMPQ    AX, (DI)
+       JNE     3(PC)
+       LOCK; XADDQ R13, runtime·wbe(SB)
+
+       MOVQ    $1, R13
+       LOCK; XADDQ R13, runtime·wb(SB)
+
        // TODO: Consider passing g.m.p in as an argument so they can be shared
        // across a sequence of write barriers.
        get_tls(R13)
diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f20e77eee5..0e0c1d446c 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -224,9 +224,12 @@ func main() {
        }
 }

+var wb, wbe, wbz uint64
+
 // os_beforeExit is called from os.Exit(0).
 //go:linkname os_beforeExit os.runtime_beforeExit
 func os_beforeExit() {
+       println("WB", wb, wbe, wbz)
        if raceenabled {
                racefini()
        }

The result for make.bash piped through Austin's awk command is: 70074409 16189273 0.

That's >23% of write barriers with *dst == src, which is suspiciously similar to the amount prior to the append optimization. I suspect that that optimization got lost when the write barrier insertion got moved in to the SSA backend. I've filed #24209 to look at that again.

cherrymui commented 6 years ago

I suspect that that optimization got lost when the write barrier insertion got moved in to the SSA backend.

No, it is not lost. Using @rsc's test case:

$ cat x.go
package p

func f(x *[]int) []int {
    *x = append(*x, 1)
    return *x
}
$ go tool compile -S x.go | grep :4
    // fast path
    0x0026 00038 (x.go:4)   MOVQ    8(AX), CX
    0x002a 00042 (x.go:4)   MOVQ    16(AX), DX
    0x002e 00046 (x.go:4)   MOVQ    (AX), BX
    0x0031 00049 (x.go:4)   LEAQ    1(CX), SI
    0x0035 00053 (x.go:4)   CMPQ    SI, DX
    0x0038 00056 (x.go:4)   JGT 110
    0x003a 00058 (x.go:4)   LEAQ    1(CX), DX
    0x003e 00062 (x.go:4)   MOVQ    DX, 8(AX)
    0x0042 00066 (x.go:4)   MOVQ    $1, (BX)(CX*8)
    // slow path
    0x006e 00110 (x.go:4)   LEAQ    type.int(SB), AX
    0x0075 00117 (x.go:4)   MOVQ    AX, (SP)
    0x0079 00121 (x.go:4)   MOVQ    BX, 8(SP)
    0x007e 00126 (x.go:4)   MOVQ    CX, 16(SP)
    0x0083 00131 (x.go:4)   MOVQ    DX, 24(SP)
    0x0088 00136 (x.go:4)   MOVQ    SI, 32(SP)
    0x008d 00141 (x.go:4)   PCDATA  $0, $0
    0x008d 00141 (x.go:4)   CALL    runtime.growslice(SB)
    0x0092 00146 (x.go:4)   MOVQ    40(SP), AX
    0x0097 00151 (x.go:4)   MOVQ    48(SP), CX
    0x009c 00156 (x.go:4)   MOVQ    56(SP), DX
    0x00a1 00161 (x.go:4)   MOVQ    "".x+80(SP), DI
    0x00a6 00166 (x.go:4)   MOVQ    DX, 16(DI)
    0x00aa 00170 (x.go:4)   CMPL    runtime.writeBarrier(SB), $0
    0x00b1 00177 (x.go:4)   JNE 193
    0x00b3 00179 (x.go:4)   MOVQ    AX, (DI)
    0x00b6 00182 (x.go:4)   MOVQ    AX, BX
    0x00b9 00185 (x.go:4)   MOVQ    DI, AX
    0x00bc 00188 (x.go:4)   JMP 58
    0x00c1 00193 (x.go:4)   CALL    runtime.gcWriteBarrier(SB)
    0x00c6 00198 (x.go:4)   JMP 182
    0x00c8 00200 (x.go:4)   NOP

I think we also have a test that makes sure that we never write the pointer field in the non-growing case.

josharian commented 6 years ago

Hmm. Thanks. I wonder where all the *dst == src writebarrier calls are coming from now.

josharian commented 6 years ago

Maybe we need the check for zero...maybe all the *dst == src when both are nil?

Any which way, this seems worth digging into a bit.

gopherbot commented 6 years ago

Change https://golang.org/cl/99078 mentions this issue: runtime: convert g.waitreason from string to uint8

gopherbot commented 6 years ago

Change https://golang.org/cl/111255 mentions this issue: runtime: convert g.waitreason from string to uint8

gopherbot commented 6 years ago

Change https://golang.org/cl/111256 mentions this issue: runtime: convert g.waitreason from string to uint8

golang / go

runtime, cmd/compile: consider write barrier short-circuit when *dst == src #14921