Open josharian opened 7 years ago
CL https://golang.org/cl/37300 mentions this issue.
I've been doing some work on this front. Branch is here: https://github.com/philhofer/go/tree/store-forward
On that branch, that particular bit of code compiles to:
b54 29841 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1183) JCS $1, 29828
v235 29842 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1158) MOVQ R8, "".i-64(SP)
v224 29843 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1165) MOVL CX, ""..autotmp_2885-84(SP)
v365 29844 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1165) MOVQ SI, "".h.bitp-24(SP)
v270 29845 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1158) MOVQ DI, "".n-72(SP)
v286 29846 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) MOVQ R11, (SP)
v288 29847 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) MOVQ BX, 8(SP)
v291 29848 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) MOVQ R8, 16(SP)
v292 29849 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) CALL "".heapBitsForObject(SB)
v294 29850 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) MOVQ 24(SP), AX
v146 29851 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) TESTQ AX, AX
b52 29852 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) JNE $0, 29862
v335 29853 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1183) MOVQ "".arena_start-48(SP), AX
v326 29854 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1165) MOVL ""..autotmp_2885-84(SP), CX
v395 29855 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1183) MOVQ "".arena_used-56(SP), DX
v390 29856 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1179) MOVQ "".b(SP), BX
v328 29857 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1165) MOVQ "".h.bitp-24(SP), SI
v55 29858 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1158) MOVQ "".n-72(SP), DI
v208 29859 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1158) MOVQ "".i-64(SP), R8
v222 29860 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1165) MOVL CX, R9
b58 29861 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1183) JMP 29828
v304 29862 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) MOVQ AX, (SP)
v408 29863 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) MOVQ "".b(SP), AX
v306 29864 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) MOVQ AX, 8(SP)
v424 29865 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) MOVQ "".i-64(SP), CX
v308 29866 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) MOVQ CX, 16(SP)
v321 29867 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) MOVQ 32(SP), DX
v205 29868 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) MOVQ DX, 24(SP)
v392 29869 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) MOVL 40(SP), DX
v311 29870 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) MOVL DX, 32(SP)
v298 29871 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) MOVQ 48(SP), DX
v314 29872 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) MOVQ DX, 40(SP)
v237 29873 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) MOVQ "".gcw+8(SP), DX
v317 29874 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) MOVQ DX, 48(SP)
v300 29875 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) MOVQ 56(SP), BX
v319 29876 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) MOVQ BX, 56(SP)
v320 29877 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186) CALL "".greyobject(SB)
b57 29878 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185) JMP 29853
Alias analysis plus tighten plus a load-sinking pass is clever enough to push the loads down until they absolutely need to be loaded. In this particular example the shuffling within the destination basic block doesn't help much, but it appears to help most go programs. Geomean improvement in go1bench is about 6% right now.
CL https://golang.org/cl/38448 mentions this issue.
runtime.scanobject contains this bit of code:
This compiles to (excerpt):
The loads at instructions 0x025e, 0x0267, 0x026c, and 0x0271 are too early. They're loading return values from heapBitsForObject, but if the call to greyobject isn't necessary (if the jump at 0x0279 isn't taken), their values are unnecessary and will be overwritten. This is easy enough to see in the original code as well.
The loads are scheduled where they are due to memory ordering. But maybe there's something we can do to improve the situation, perhaps using the knowledge that stack slots and function return values are always disjoint in memory?
cc @randall77 @dr2chase @cherrymui