Open josharian opened 6 years ago
We got a nice improvement in compilation time from improvements to imports; we get to spend some of that on other things.
There is in fact a dependence on value order, and not depending on value order would induce a much larger compile-time cost.
The value to the end user is that debugger breakpoints more often correspond to the correct position in the source code, and that the stepping/nexting experience within a debugger is much less erratic and confusing. This also means that printing values in the debugger more often yields the "correct" result (i.e., conforming to the user's expectations).
As a side-effect, it becomes less necessary to get this same effect by changing the line number of "unimportant" instructions (to avoid debugger-visible line-number churn) which should make profiling slightly more accurate.
Late in the game I thought of what could be a better approach to this that should be both faster and less sensitive to value order, but it was very late, and the alternate method is completely untested.
So, re late-in-the-game method, how about we make investigating that a high priority for 1.12? The method is to quit tracking at the value level until late scheduling, and record instead (for each block) the sequence of statements encountered within it, and then pinning statement boundaries to values during/after scheduling. It should cost less time because it's only done once, and it removes the value-order dependence, which is a good thing.
It won't reduce allocations much beyond what we do now, because that's already been optimized down to O(1) allocation per function (I think) and it won't help binary size in the least because that information takes space no matter what. We are, however, talking about compressing debugging information in the near-ish future, so there's that hope.
Thanks, @dr2chase. I want to read a bit more before I reply to your comments.
One quick question, though. Was the intent that this CL only change DWARF info? I ask because I notice that the hello world text bytes are up 1.22%, and I would have expected that DWARF changes would only impact overall binary size (which is up 4%). And a quick late night test with -ldflags=-w
shows a binary size increase. The obvious follow-on questions (if I understand correctly) are: Where does the extra 1.22% comes from? And have the effectiveness of any (important) optimizations been impaired?
I am not sure where that extra text came from, and I don't think I touched any of the optimizations. I'll look.
It is possible that some of the scheduler heuristics changed slightly because I rolled back some of the line-number-removals previously done to clean up debugger experience, and line number is one of the tie-breakers.
I see differences, they go in both directions, I think there is some other order-dependent bug in rewrite triggered by my change to the value ordering in fuse. Here, "+" is the post-CL version, i.e., shorter.
This is from strconv.roundShortest
:
- v148 = LEAQ <*decimal> {.autotmp_16} v2 : R9
- v173 = MOVBloadidx1 <byte> v119 v148 v115 : R10 (u[byte])
+ v173 = MOVBloadidx1 <byte> {.autotmp_16} v119 v2 v115 : R9 (u[byte])
I did a quick experiment with the order of visiting blocks and values in rewrite.go, and it is definitely order-sensitive. Sadly, I proved this only by making the text larger. I suspect if we visited blocks in reverse post order and also scheduled their contents, that we would get the best outcome, though any new values allocated will probably invalidate the schedule.
This is amusing. The cause of the larger text is runtime.pclntab
.
But rewrite IS order dependent.
However, changing fuse to preserve value order no matter what the respective block.Value capacities are actually tickles this bug in the proper direction to produce smaller code, not larger (and hence probably slightly more efficient).
runtime.pclntab
is larger because I rolled back some changes intended to improve debugging the old way, where "unimportant" instructions are assigned the line number of their neighbors to avoid debugger-step-churn.
I probably should have done this in two CLs, I'll let that be my lesson. Your grumbling about a need for tools to understand rewrite application is also looking prescient.
-238 runtime.epclntab 2928,2690
-80 encoding/asn1.parseField 14784,14704
-48 encoding/json.(*decodeState).object 8320,8272
-48 vendor/golang_org/x/crypto/cryptobyte.(*String).ReadASN1Integer 1968,1920
-32 encoding/json.(*decodeState).literalStore 10176,10144
-32 fmt.(*pp).printArg 2688,2656
-32 reflect.DeepEqual 928,896
-32 regexp/syntax.(*parser).repeat 1552,1520
-32 runtime.mapaccess1 576,544
-32 runtime.mapaccess1_fast32 448,416
-32 runtime.updatememstats 672,640
-32 text/template.(*state).idealConstant 1104,1072
-32 text/template.(*state).walkRange 2128,2096
-32 vendor/golang_org/x/crypto/cryptobyte.(*String).ReadOptionalASN1Integer 2016,1984
-16 crypto/tls.(*serverHelloMsg).marshal 3280,3264
-16 crypto/x509.(*Certificate).checkNameConstraints 2512,2496
-16 encoding/asn1.MarshalWithParams 720,704
-16 encoding/asn1.UnmarshalWithParams 720,704
-16 encoding/gob.(*Decoder).Decode 576,560
-16 encoding/gob.(*Decoder).recvType 544,528
-16 encoding/gob.(*Encoder).Encode 288,272
-16 encoding/gob.(*Encoder).sendActualType 1696,1680
-16 encoding/gob.compileEnc 1808,1792
-16 fmt.(*pp).catchPanic 1264,1248
-16 fmt.(*pp).fmtBytes 2416,2400
-16 fmt.(*pp).printValue 10256,10240
-16 fmt.(*ss).scanOne 5024,5008
-16 fmt.intFromArg 512,496
-16 internal/poll.(*FD).Accept 704,688
-16 math/big.lehmerSimulate 384,368
-16 net/http.(*Request).write 3408,3392
-16 net/http.(*http2ClientConn).encodeHeaders 2736,2720
-16 reflect.Value.Elem 480,464
-16 reflect.Value.Field 352,336
-16 regexp.mergeRuneSets 2608,2592
-16 runtime.(*mcache).releaseAll 176,160
-16 runtime.(*mspan).sweep 2400,2384
-16 runtime.adjustpointers 576,560
-16 runtime.advanceEvacuationMark 240,224
-16 runtime.clearpools 368,352
-16 runtime.freeStackSpans 432,416
-16 runtime.freedefer.func1 368,352
-16 runtime.mallocinit 608,592
-16 runtime.mapaccess1_fast64 448,432
-16 runtime.mapaccess1_faststr 1088,1072
-16 runtime.mapaccess2 576,560
-16 runtime.mapaccess2_fast32 464,448
-16 runtime.mapaccess2_fast64 464,448
-16 runtime.mapaccess2_faststr 1152,1136
-16 runtime.moduledataverify1 1696,1680
-16 runtime.newdefer.func1 480,464
-16 runtime.purgecachedstats 192,176
-16 runtime.typelinksinit 1760,1744
-16 runtime.wbBufFlush1 640,624
-16 runtime/pprof.parseProcSelfMaps 1248,1232
-16 strconv.roundShortest 960,944
-16 text/template.(*state).evalArg 3872,3856
-16 text/template.(*state).evalCommand 2400,2384
-16 text/template.(*state).evalEmptyInterface 2048,2032
-16 text/template.(*state).evalField 3792,3776
-16 text/template.addValueFuncs 1360,1344
-16 text/template.evalArgs 544,528
-16 text/template.length 704,688
-16 vendor/golang_org/x/crypto/chacha20poly1305/internal/chacha20.XORKeyStream 656,640
-16 vendor/golang_org/x/net/http2/hpack.newStaticTable 864,848
-16 vendor/golang_org/x/net/route.ParseRIB 1232,1216
-4 runtime.findfunctab 17053,17049
4 runtime.erodata 387,391
4 runtime.etypes 387,391
16 crypto/elliptic.p224DoubleJacobian 1104,1120
16 crypto/tls.(*Conn).readRecord 5808,5824
16 crypto/tls.(*clientHelloMsg).unmarshal 4432,4448
16 encoding/gob.(*Encoder).encodeInterface 2624,2640
16 net/http.(*http2Framer).WriteGoAway 864,880
16 regexp.(*machine).onepass 1936,1952
16 syscall.ParseDirent 1168,1184
16 text/template.(*state).evalCall 3824,3840
16 text/template.(*state).walkTemplate 992,1008
64 internal/poll.(*FD).ReadDirent 368,432
288 go.func.* 182310,182598
58414 runtime.pclntab 1671824,1730238
Sorry for all the delays from me. Here's my set of reactions having skimmed the CL and your investigatory results (thanks!).
I think we should investigate the alternative approach, possibly even for 1.11 depending on how it works out. The dependence on Value order seems fraught with peril, and the alternative approach seems a lot cleaner.
We might even (for 1.12?) want to identify statements much earlier, instead of during AST to SSA translation. Many statements present in AST to SSA translation are part of lowering or ordering code, and aren't actually present in the source. It seems to me we should mark statements right after parsing, since that's the moment at which we have the closest correspondence to the input source. It'd also sidestep the AVARLIVE etc handling.
Nice find on the fuse change and rewrite rules! I'd like to find some way to programmatically detect order-dependence in the rewrite rules, so that we can start systematically fixing them. I don't have any simple ideas, though. (One complicated idea is to have a debug mode in which we do the following. Before rewriting, make a deep copy of a function, its blocks, and its values. Randomize block and value order. Rewrite. Make another deep copy, randomize again, and rewrite. Then use topological ordering to correlate blocks, and some kind of isomorphism to correlate values. Check whether the results are identical. If not, complain. Repeat.) Or alternatively switch to a more sophisticated rewrite rule generation engine.
I think we should send a CL soon to put back the line-number-as-last-resort scheduling heuristic, to shrink pclntab back, if feasible.
Why is pclntab part of TEXT? I'd expect it to be RODATA.
I'm a bit concerned that the rewrite rules might now be (mildly) quadratic. Note that for each Value changed we potentially do O(n) work as part of nextGoodStatementIndex. This could imaginably blow up for code that has many, many Values in a single Block. Unlikely, but still concerning.
I am mildly depressed by the number of sparse data structures we have. I wonder whether some judicious embedding/composition might help. (Or generics. :P) For later.
It'd be nice to refactor the handling of APCDATA, AFUNCDATA, and ANOP into a function, which could then have nice docs attached to it explaining what is happening and why.
Can you avoid generating the liveOrderStmts slice if you're not going to use it, in liveValues? That might help with allocs.
In fuse, should clear the tail of c.Values in the in-place case, to allow them to be GC'd. Also, if you make with the correct length instead of correct cap in the reallocate case, then you could copy in instead of appending in, in which case the copying in can be shared.
Can you move numberlines after "early phielim", and maybe "early copyelim"? That'd give it much less work to do, since those passes tend to significantly reduce the number of Values present.
Are we sure that numberlines isn't quadratic, for similar reasons as the rewrite rules above?
And with that list, I'm going to probably most disappear again for another couple of days. My apologies. :(
I could start on the alternative approach, since that is hoped-for for 1.12, but I think the order dependence is more annoying than fraught. I'm pretty sure the quadratic risk has been reduced after dealing with Austin's remarks; not 100% certain, but its result is now used for skip-ahead which I think fixes it.
pclntab is "s", whatever that is, and it might be the fault of the size command that it is considered text. I don't want to undo the line numbering change just for the sake of shrinking that table; that change is believed to make the numbering more accurate (for profiling purposes) relative to earlier compromise between debugging experience and profiling accuracy.
I'm unsure about the goodness of detecting statements earlier; this worked better than I expected, enough that I would want examples (bug reports) showing failure. As far as I know, most of the existing failures are caused by peculiar combinations of rewriting and ordering, and not by screwups in initial ordering. It's my belief thats changing to the block-oriented approach would avoid all these bad-luck bugs.
I'll look into the proposed cleanups, with an eye towards tiny, tractable CLs, since this is late in the game.
I recommended an alternative and simpler statement tracking scheme in CL 93663: https://go-review.googlesource.com/c/go/+/93663#message-5b114391669f7db29632eb5b66dbec412c169e6a
Change https://golang.org/cl/113717 mentions this issue: cmd/compile: refactor and cleanup of common code/workaround
I checked for quadratic behavior of nextGoodStatementIndex
, I think it is mostly okay.
There are three uses, and in two of them the result is used to skip ahead (and the
cost of skipping ahead is proportional to the distance skipped).
In the third it is not:
if v.Pos.IsStmt() == src.PosIsStmt {
if k := nextGoodStatementIndex(v, j, b); k != j {
v.Pos = v.Pos.WithNotStmt()
b.Values[k].Pos = b.Values[k].Pos.WithIsStmt()
}
}
Quadratic behavior is limited to the number of statements of in a row that are marked PosIsStmt
for the same line with a "not Good" opcode. Such runs are not supposed to occur, though rewriting might create them -- but their size would be limited to the expansion of an instruction that was marked as a statement. I'd like to do better, but I think this risk is well-fenced.
I'm not following this:
In fuse, should clear the tail of c.Values in the in-place case, to allow them to be GC'd.
c.Values
gets b.Values
followed by c.Values
. For the in-place case, c.Values
is expanded, then the old values moved to higher indices copy(c.Values[bl:], c.Values)
. This creates no new non-zero entries, since the "old" c.Values
is overwritten by some combination of values from b and c, and the newly addressed area (between indices cl and bl+cl) is filled with the last bl elements of the old c.Values
.
Change https://golang.org/cl/113719 mentions this issue: cmd/compile: common up code in fuse for joining blocks
I'm not following this
And with good reason. :) Please ignore.
Thanks for checking on the quadratic concern.
What is the status of this issue? Is there more to do for 1.11? For 1.12?
Not more to do in 1.11, for 1.12 we might look into one or another different and perhaps more efficient ways of assigning statement boundaries.
There is a blog post about big runtime.pclntab
from CockroachDB devs - https://science.raphael.poss.name/go-executable-size-visualization-with-d3.html . Probably, pclntab
could be compressed somehow?
As I read the blog post, it was compressed, but for performance reasons we quit compressing it. It helps to be able to quantify the cost of binary size, otherwise it will come up short in any cost comparison versus runtime performance, latency, etc.
There has been a push recently to further reduce start-up times; I don't think unilaterally deciding to (naively) compress pclntab
again is a good idea. (And we're not going to introduce a flag.)
A better alternative is to do some serious engineering work and look for alternative encodings of that data.
I experimented a few years ago with using a smaller encoding for pcvalue pairs. From my notes to myself at that time:
pc/value pairs in the go1.2 symtab are encoded using varint encoding. But most values are very small, so try using nibble-at-a-time style varint encoding instead of byte-at-a-time varint encoding. You lose 1/4 bits to continuation markers instead of 1/8, but with small values it doesn't matter.
This only works because there are value pairs. The precise encoding is: first four bits are value, second four bits are pc. First bit of each nibble is a continuation bit. If both continuation bits are set, continue as before. If neither continuation bit is set, we're done. If one continuation bit is set but not the other, switch to byte-at-a-time varint encoding for the outstanding value.
This actually does prove to be a more space-efficient encoding, just not enough of an improvement to warrant changing the symtab, since there are many consumers of it.
There are also other streaming integer compression techniques, like varint-GB, Stream V-Byte, Roaring Bitmaps. They are all optimized for different purposes, so it is reasonably likely that anything would need to be adapted a bit for Go's particular data structure and needed. The point is merely that there is room for experimentation before reaching for the compressions sledgehammer.
Another thing that I've pondered is supporting some kind of lookback in the encoding. Inlined code is very expensive in the pclntab, because the encoding required for a cross-file, cross-line jump is long. But for inlined code, we are often hopping back and forth between a few locations. Having a way to refer not to the previous value but to a value up to
And there's more to the pclntab than just pcvalue entries.
The next step here is probably for someone to spend some quality time investigating all the design decisions in pclntab and coming up with an improved design and quantifying the impact it would have (as @dr2chase says).
Yet another option is to compress subsections of pclntab using a compression technique that is designed for speed and low overhead, such that it can get decompressed on the fly, every single time it is needed, without appreciable performance loss. I am far from an expert in this area, but heatshrink and blosc come to mind.
It has always bothered me that we have essentially a separate pcln table for every function. I suspect lots of functions are small, in which case maybe we should batch the tables across a larger region, say a 4K page. Random grep of objdump of the go tool, we're currently at about 500bytes/function.
One of the things I noticed while examining is that there is a lot of repetition in symbol names.
As a snippet from strings pclntab | sort | uniq
:
google.golang.org/grpc/internal/transport.(*http2Client).IncrMsgRecv
google.golang.org/grpc/internal/transport.(*http2Client).IncrMsgSent
google.golang.org/grpc/internal/transport.(*http2Client).keepalive
google.golang.org/grpc/internal/transport.(*http2Client).keepalive.stkobj
google.golang.org/grpc/internal/transport.(*http2Client).NewStream
google.golang.org/grpc/internal/transport.(*http2Client).newStream
google.golang.org/grpc/internal/transport.(*http2Client).NewStream
google.golang.org/grpc/internal/transport.(*http2Client).NewStream.func1
google.golang.org/grpc/internal/transport.(*http2Client).newStream.func1
google.golang.org/grpc/internal/transport.(*http2Client).NewStream.func2
google.golang.org/grpc/internal/transport.(*http2Client).newStream.func2
google.golang.org/grpc/internal/transport.(*http2Client).NewStream.func2
google.golang.org/grpc/internal/transport.(*http2Client).NewStream.func3
google.golang.org/grpc/internal/transport.(*http2Client).newStream.func3
One idea here would be to encode all names as a lookup to 3 different names. e.g. <pkgnameptr><typenameptr><methodnameptr>
. This would of course reduce the size of pclntab, however the strings wouldn't be easily fetchable from memory anymore, so there would need to be some sort of lookup table for the combined strings. But we probably can assume that most of the names won't be expanded at runtime.
There's also inefficiency in encoding anonymous structs:
type..eq.struct { _ golang.org/x/sys/cpu.CacheLinePad; HasAES bool; HasADX bool; HasAVX bool; HasAVX2 bool; HasBMI1 bool; HasBMI2 bool; HasERMS bool; HasFMA bool; HasOSXSAVE bool; HasPCLMULQDQ bool; HasPOPCNT bool; HasRDRAND bool; HasRDSEED bool; HasSSE2 bool; HasSSE3 bool; HasSSSE3 bool; HasSSE41 bool; HasSSE42 bool; _ golang.org/x/sys/cpu.CacheLinePad }
type..eq.struct { reflect.ityp *reflect.rtype; reflect.typ *reflect.rtype; reflect.hash uint32; _ [4]uint8; reflect.fun [100000]unsafe.Pointer }
type..eq.struct { runtime.cycle uint32; runtime.flushed bool }
type..eq.struct { runtime.enabled bool; runtime.pad [3]uint8; runtime.needed bool; runtime.cgo bool; runtime.alignme uint64 }
type..eq.struct { runtime.full runtime.lfstack; runtime.empty runtime.lfstack; runtime.pad0 internal/cpu.CacheLinePad; runtime.wbufSpans struct { runtime.lock runtime.mutex; runtime.free runtime.mSpanList; runtime.busy runtime.mSpanList }; _ uint32; runtime.bytesMarked uint64; runtime.markrootNext uint32; runtime.markrootJobs uint32; runtime.nproc uint32; runtime.tstart int64; runtime.nwait uint32; runtime.ndone uint32; runtime.nFlushCacheRoots int; runtime.nDataRoots int; runtime.nBSSRoots int; runtime.nSpanRoots int; runtime.nStackRoots int; runtime.startSema uint32; runtime.markDoneSema uint32; runtime.bgMarkReady runtime.note; runtime.bgMarkDone uint32; runtime.mode runtime.gcMode; runtime.userForced bool; runtime.totaltime int64; runtime.initialHeapLive uint64; runtime.assistQueue struct { runtime.lock runtime.mutex; runtime.q runtime.gQueue }; runtime.sweepWaiters struct { runtime.lock runtime.mutex; runtime.list runtime.gList }; runtime.cycles uint32; runtime.stwprocs int32; runtime.maxprocs int32;
Here one idea would be to use the variable name for the typename symbol rather than the type description itself. Of course, the anonymous struct types are easy to fix manually as well.
Volunteer (and often reluctant) toolspeed cop here. (Sorry, @dr2chase.)
CL 102435 has a non-trivial impact on compilation speed, memory usage, and binary size:
Reading the CL description, I can't quite tell what the value is to the user that offsets that impact.
I'd like to revisit whether the speed, memory impact, and binary size impacts can be mitigated at all before Go 1.11 is released, and I personally would like to understand better what benefits the CL brings.
Also, as an aside, I am slightly concerned about this bit from the CL description:
Value order should not matter, and cannot be relied upon. (And @randall77 has an old, outstanding CL to randomize Value order to enforce that.) And it is unclear whether the fuse.go changes need to be permanent or not.
I have very little dedicated laptop time in the immediate future due to personal life stuff, but I wanted to flag this right away. I'll look carefully at the CL and the impact as soon as I can.
cc @dr2chase @aclements @bradfitz