golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.86k stars 17.65k forks source link

cmd/compile: revisit statement boundaries CL peformance and binary size impact #25426

Open josharian opened 6 years ago

josharian commented 6 years ago

Volunteer (and often reluctant) toolspeed cop here. (Sorry, @dr2chase.)

CL 102435 has a non-trivial impact on compilation speed, memory usage, and binary size:

name        old time/op       new time/op       delta
Template          176ms ± 2%        181ms ± 3%  +2.61%  (p=0.000 n=45+50)
Unicode          87.5ms ± 5%       87.9ms ± 4%    ~     (p=0.147 n=48+49)
GoTypes           557ms ± 4%        569ms ± 2%  +2.18%  (p=0.000 n=42+44)
Compiler          2.65s ± 3%        2.70s ± 3%  +1.82%  (p=0.000 n=49+49)
SSA               7.16s ± 2%        7.37s ± 2%  +3.00%  (p=0.000 n=48+47)
Flate             118ms ± 2%        123ms ± 3%  +4.05%  (p=0.000 n=48+49)
GoParser          138ms ± 3%        143ms ± 2%  +3.28%  (p=0.000 n=49+47)
Reflect           360ms ± 3%        367ms ± 3%  +1.76%  (p=0.000 n=48+48)
Tar               157ms ± 4%        160ms ± 3%  +2.31%  (p=0.000 n=50+49)
XML               201ms ± 4%        207ms ± 3%  +2.79%  (p=0.000 n=48+49)
[Geo mean]        353ms             362ms       +2.42%

name        old user-time/op  new user-time/op  delta
Template          215ms ± 3%        219ms ± 3%  +1.67%  (p=0.000 n=48+49)
Unicode           110ms ± 5%        110ms ± 3%    ~     (p=0.051 n=48+46)
GoTypes           741ms ± 4%        749ms ± 3%  +1.05%  (p=0.000 n=47+46)
Compiler          3.60s ± 4%        3.63s ± 2%  +0.84%  (p=0.002 n=44+49)
SSA               10.3s ± 4%        10.5s ± 2%  +2.13%  (p=0.000 n=44+46)
Flate             138ms ± 3%        143ms ± 3%  +3.28%  (p=0.000 n=48+46)
GoParser          159ms ± 3%        175ms ± 4%  +9.82%  (p=0.000 n=50+47)
Reflect           464ms ± 2%        466ms ± 3%  +0.47%  (p=0.020 n=47+49)
Tar               195ms ± 4%        198ms ± 3%  +1.40%  (p=0.000 n=50+46)
XML               241ms ± 9%        258ms ± 3%  +7.04%  (p=0.000 n=50+48)
[Geo mean]        446ms             458ms       +2.79%

name        old alloc/op      new alloc/op      delta
Template         35.1MB ± 0%       36.8MB ± 0%   +4.91%  (p=0.008 n=5+5)
Unicode          29.3MB ± 0%       29.8MB ± 0%   +1.59%  (p=0.008 n=5+5)
GoTypes           115MB ± 0%        121MB ± 0%   +5.15%  (p=0.008 n=5+5)
Compiler          521MB ± 0%        560MB ± 0%   +7.48%  (p=0.008 n=5+5)
SSA              1.71GB ± 0%       1.91GB ± 0%  +11.69%  (p=0.008 n=5+5)
Flate            24.2MB ± 0%       25.4MB ± 0%   +4.91%  (p=0.008 n=5+5)
GoParser         28.1MB ± 0%       29.5MB ± 0%   +4.87%  (p=0.008 n=5+5)
Reflect          78.7MB ± 0%       82.4MB ± 0%   +4.65%  (p=0.008 n=5+5)
Tar              34.5MB ± 0%       36.1MB ± 0%   +4.62%  (p=0.008 n=5+5)
XML              43.3MB ± 0%       45.5MB ± 0%   +5.27%  (p=0.008 n=5+5)
[Geo mean]       78.1MB            82.4MB        +5.48%

name        old allocs/op     new allocs/op     delta
Template           328k ± 0%         336k ± 0%   +2.59%  (p=0.008 n=5+5)
Unicode            336k ± 0%         338k ± 0%   +0.37%  (p=0.008 n=5+5)
GoTypes           1.14M ± 0%        1.17M ± 0%   +2.29%  (p=0.008 n=5+5)
Compiler          4.77M ± 0%        4.88M ± 0%   +2.23%  (p=0.008 n=5+5)
SSA               13.7M ± 0%        14.0M ± 0%   +2.49%  (p=0.008 n=5+5)
Flate              220k ± 0%         226k ± 0%   +2.71%  (p=0.008 n=5+5)
GoParser           273k ± 0%         280k ± 0%   +2.29%  (p=0.008 n=5+5)
Reflect            940k ± 0%         971k ± 0%   +3.32%  (p=0.008 n=5+5)
Tar                321k ± 0%         330k ± 0%   +2.65%  (p=0.008 n=5+5)
XML                383k ± 0%         390k ± 0%   +1.94%  (p=0.008 n=5+5)
[Geo mean]         751k              768k        +2.28%

name        old text-bytes    new text-bytes    delta
HelloSize         672kB ± 0%        680kB ± 0%  +1.22%  (p=0.000 n=50+50)

name        old data-bytes    new data-bytes    delta
HelloSize         134kB ± 0%        134kB ± 0%    ~     (all equal)

name        old exe-bytes     new exe-bytes     delta
HelloSize        1.43MB ± 0%       1.49MB ± 0%  +4.00%  (p=0.000 n=50+50)

Reading the CL description, I can't quite tell what the value is to the user that offsets that impact.

I'd like to revisit whether the speed, memory impact, and binary size impacts can be mitigated at all before Go 1.11 is released, and I personally would like to understand better what benefits the CL brings.

Also, as an aside, I am slightly concerned about this bit from the CL description:

The code in fuse.go that glued together the value slices
of two blocks produced a result that depended on the
former capacities (not lengths) of the two slices.  This
causes differences in the 386 bootstrap, and also can
sometimes put values into an order that does a worse job
of preserving statement boundaries when values are removed.

Value order should not matter, and cannot be relied upon. (And @randall77 has an old, outstanding CL to randomize Value order to enforce that.) And it is unclear whether the fuse.go changes need to be permanent or not.

I have very little dedicated laptop time in the immediate future due to personal life stuff, but I wanted to flag this right away. I'll look carefully at the CL and the impact as soon as I can.

cc @dr2chase @aclements @bradfitz

dr2chase commented 6 years ago

We got a nice improvement in compilation time from improvements to imports; we get to spend some of that on other things.

There is in fact a dependence on value order, and not depending on value order would induce a much larger compile-time cost.

The value to the end user is that debugger breakpoints more often correspond to the correct position in the source code, and that the stepping/nexting experience within a debugger is much less erratic and confusing. This also means that printing values in the debugger more often yields the "correct" result (i.e., conforming to the user's expectations).

As a side-effect, it becomes less necessary to get this same effect by changing the line number of "unimportant" instructions (to avoid debugger-visible line-number churn) which should make profiling slightly more accurate.

Late in the game I thought of what could be a better approach to this that should be both faster and less sensitive to value order, but it was very late, and the alternate method is completely untested.

dr2chase commented 6 years ago

So, re late-in-the-game method, how about we make investigating that a high priority for 1.12? The method is to quit tracking at the value level until late scheduling, and record instead (for each block) the sequence of statements encountered within it, and then pinning statement boundaries to values during/after scheduling. It should cost less time because it's only done once, and it removes the value-order dependence, which is a good thing.

It won't reduce allocations much beyond what we do now, because that's already been optimized down to O(1) allocation per function (I think) and it won't help binary size in the least because that information takes space no matter what. We are, however, talking about compressing debugging information in the near-ish future, so there's that hope.

josharian commented 6 years ago

Thanks, @dr2chase. I want to read a bit more before I reply to your comments.

One quick question, though. Was the intent that this CL only change DWARF info? I ask because I notice that the hello world text bytes are up 1.22%, and I would have expected that DWARF changes would only impact overall binary size (which is up 4%). And a quick late night test with -ldflags=-w shows a binary size increase. The obvious follow-on questions (if I understand correctly) are: Where does the extra 1.22% comes from? And have the effectiveness of any (important) optimizations been impaired?

dr2chase commented 6 years ago

I am not sure where that extra text came from, and I don't think I touched any of the optimizations. I'll look.

It is possible that some of the scheduler heuristics changed slightly because I rolled back some of the line-number-removals previously done to clean up debugger experience, and line number is one of the tie-breakers.

dr2chase commented 6 years ago

I see differences, they go in both directions, I think there is some other order-dependent bug in rewrite triggered by my change to the value ordering in fuse. Here, "+" is the post-CL version, i.e., shorter. This is from strconv.roundShortest:

-    v148 = LEAQ <*decimal> {.autotmp_16} v2 : R9
-    v173 = MOVBloadidx1 <byte> v119 v148 v115 : R10 (u[byte])
+    v173 = MOVBloadidx1 <byte> {.autotmp_16} v119 v2 v115 : R9 (u[byte])
dr2chase commented 6 years ago

I did a quick experiment with the order of visiting blocks and values in rewrite.go, and it is definitely order-sensitive. Sadly, I proved this only by making the text larger. I suspect if we visited blocks in reverse post order and also scheduled their contents, that we would get the best outcome, though any new values allocated will probably invalidate the schedule.

dr2chase commented 6 years ago

This is amusing. The cause of the larger text is runtime.pclntab. But rewrite IS order dependent. However, changing fuse to preserve value order no matter what the respective block.Value capacities are actually tickles this bug in the proper direction to produce smaller code, not larger (and hence probably slightly more efficient).

runtime.pclntab is larger because I rolled back some changes intended to improve debugging the old way, where "unimportant" instructions are assigned the line number of their neighbors to avoid debugger-step-churn.

I probably should have done this in two CLs, I'll let that be my lesson. Your grumbling about a need for tools to understand rewrite application is also looking prescient.

-238    runtime.epclntab        2928,2690
-80     encoding/asn1.parseField        14784,14704
-48     encoding/json.(*decodeState).object     8320,8272
-48     vendor/golang_org/x/crypto/cryptobyte.(*String).ReadASN1Integer 1968,1920
-32     encoding/json.(*decodeState).literalStore       10176,10144
-32     fmt.(*pp).printArg      2688,2656
-32     reflect.DeepEqual       928,896
-32     regexp/syntax.(*parser).repeat  1552,1520
-32     runtime.mapaccess1      576,544
-32     runtime.mapaccess1_fast32       448,416
-32     runtime.updatememstats  672,640
-32     text/template.(*state).idealConstant    1104,1072
-32     text/template.(*state).walkRange        2128,2096
-32     vendor/golang_org/x/crypto/cryptobyte.(*String).ReadOptionalASN1Integer 2016,1984
-16     crypto/tls.(*serverHelloMsg).marshal    3280,3264
-16     crypto/x509.(*Certificate).checkNameConstraints 2512,2496
-16     encoding/asn1.MarshalWithParams 720,704
-16     encoding/asn1.UnmarshalWithParams       720,704
-16     encoding/gob.(*Decoder).Decode  576,560
-16     encoding/gob.(*Decoder).recvType        544,528
-16     encoding/gob.(*Encoder).Encode  288,272
-16     encoding/gob.(*Encoder).sendActualType  1696,1680
-16     encoding/gob.compileEnc 1808,1792
-16     fmt.(*pp).catchPanic    1264,1248
-16     fmt.(*pp).fmtBytes      2416,2400
-16     fmt.(*pp).printValue    10256,10240
-16     fmt.(*ss).scanOne       5024,5008
-16     fmt.intFromArg  512,496
-16     internal/poll.(*FD).Accept      704,688
-16     math/big.lehmerSimulate 384,368
-16     net/http.(*Request).write       3408,3392
-16     net/http.(*http2ClientConn).encodeHeaders       2736,2720
-16     reflect.Value.Elem      480,464
-16     reflect.Value.Field     352,336
-16     regexp.mergeRuneSets    2608,2592
-16     runtime.(*mcache).releaseAll    176,160
-16     runtime.(*mspan).sweep  2400,2384
-16     runtime.adjustpointers  576,560
-16     runtime.advanceEvacuationMark   240,224
-16     runtime.clearpools      368,352
-16     runtime.freeStackSpans  432,416
-16     runtime.freedefer.func1 368,352
-16     runtime.mallocinit      608,592
-16     runtime.mapaccess1_fast64       448,432
-16     runtime.mapaccess1_faststr      1088,1072
-16     runtime.mapaccess2      576,560
-16     runtime.mapaccess2_fast32       464,448
-16     runtime.mapaccess2_fast64       464,448
-16     runtime.mapaccess2_faststr      1152,1136
-16     runtime.moduledataverify1       1696,1680
-16     runtime.newdefer.func1  480,464
-16     runtime.purgecachedstats        192,176
-16     runtime.typelinksinit   1760,1744
-16     runtime.wbBufFlush1     640,624
-16     runtime/pprof.parseProcSelfMaps 1248,1232
-16     strconv.roundShortest   960,944
-16     text/template.(*state).evalArg  3872,3856
-16     text/template.(*state).evalCommand      2400,2384
-16     text/template.(*state).evalEmptyInterface       2048,2032
-16     text/template.(*state).evalField        3792,3776
-16     text/template.addValueFuncs     1360,1344
-16     text/template.evalArgs  544,528
-16     text/template.length    704,688
-16     vendor/golang_org/x/crypto/chacha20poly1305/internal/chacha20.XORKeyStream      656,640
-16     vendor/golang_org/x/net/http2/hpack.newStaticTable      864,848
-16     vendor/golang_org/x/net/route.ParseRIB  1232,1216
-4      runtime.findfunctab     17053,17049
4       runtime.erodata 387,391
4       runtime.etypes  387,391
16      crypto/elliptic.p224DoubleJacobian      1104,1120
16      crypto/tls.(*Conn).readRecord   5808,5824
16      crypto/tls.(*clientHelloMsg).unmarshal  4432,4448
16      encoding/gob.(*Encoder).encodeInterface 2624,2640
16      net/http.(*http2Framer).WriteGoAway     864,880
16      regexp.(*machine).onepass       1936,1952
16      syscall.ParseDirent     1168,1184
16      text/template.(*state).evalCall 3824,3840
16      text/template.(*state).walkTemplate     992,1008
64      internal/poll.(*FD).ReadDirent  368,432
288     go.func.*       182310,182598
58414   runtime.pclntab 1671824,1730238
josharian commented 6 years ago

Sorry for all the delays from me. Here's my set of reactions having skimmed the CL and your investigatory results (thanks!).

And with that list, I'm going to probably most disappear again for another couple of days. My apologies. :(

dr2chase commented 6 years ago

I could start on the alternative approach, since that is hoped-for for 1.12, but I think the order dependence is more annoying than fraught. I'm pretty sure the quadratic risk has been reduced after dealing with Austin's remarks; not 100% certain, but its result is now used for skip-ahead which I think fixes it.

pclntab is "s", whatever that is, and it might be the fault of the size command that it is considered text. I don't want to undo the line numbering change just for the sake of shrinking that table; that change is believed to make the numbering more accurate (for profiling purposes) relative to earlier compromise between debugging experience and profiling accuracy.

I'm unsure about the goodness of detecting statements earlier; this worked better than I expected, enough that I would want examples (bug reports) showing failure. As far as I know, most of the existing failures are caused by peculiar combinations of rewriting and ordering, and not by screwups in initial ordering. It's my belief thats changing to the block-oriented approach would avoid all these bad-luck bugs.

I'll look into the proposed cleanups, with an eye towards tiny, tractable CLs, since this is late in the game.

mdempsky commented 6 years ago

I recommended an alternative and simpler statement tracking scheme in CL 93663: https://go-review.googlesource.com/c/go/+/93663#message-5b114391669f7db29632eb5b66dbec412c169e6a

gopherbot commented 6 years ago

Change https://golang.org/cl/113717 mentions this issue: cmd/compile: refactor and cleanup of common code/workaround

dr2chase commented 6 years ago

I checked for quadratic behavior of nextGoodStatementIndex, I think it is mostly okay. There are three uses, and in two of them the result is used to skip ahead (and the cost of skipping ahead is proportional to the distance skipped).

In the third it is not:

                                          if v.Pos.IsStmt() == src.PosIsStmt {
                        if k := nextGoodStatementIndex(v, j, b); k != j {
                            v.Pos = v.Pos.WithNotStmt()
                            b.Values[k].Pos = b.Values[k].Pos.WithIsStmt()
                        }
                    }

Quadratic behavior is limited to the number of statements of in a row that are marked PosIsStmt for the same line with a "not Good" opcode. Such runs are not supposed to occur, though rewriting might create them -- but their size would be limited to the expansion of an instruction that was marked as a statement. I'd like to do better, but I think this risk is well-fenced.

dr2chase commented 6 years ago

I'm not following this:

In fuse, should clear the tail of c.Values in the in-place case, to allow them to be GC'd.

c.Values gets b.Values followed by c.Values. For the in-place case, c.Values is expanded, then the old values moved to higher indices copy(c.Values[bl:], c.Values). This creates no new non-zero entries, since the "old" c.Values is overwritten by some combination of values from b and c, and the newly addressed area (between indices cl and bl+cl) is filled with the last bl elements of the old c.Values.

gopherbot commented 6 years ago

Change https://golang.org/cl/113719 mentions this issue: cmd/compile: common up code in fuse for joining blocks

josharian commented 6 years ago

I'm not following this

And with good reason. :) Please ignore.

Thanks for checking on the quadratic concern.

ianlancetaylor commented 6 years ago

What is the status of this issue? Is there more to do for 1.11? For 1.12?

dr2chase commented 6 years ago

Not more to do in 1.11, for 1.12 we might look into one or another different and perhaps more efficient ways of assigning statement boundaries.

valyala commented 5 years ago

There is a blog post about big runtime.pclntab from CockroachDB devs - https://science.raphael.poss.name/go-executable-size-visualization-with-d3.html . Probably, pclntab could be compressed somehow?

dr2chase commented 5 years ago

As I read the blog post, it was compressed, but for performance reasons we quit compressing it. It helps to be able to quantify the cost of binary size, otherwise it will come up short in any cost comparison versus runtime performance, latency, etc.

josharian commented 5 years ago

There has been a push recently to further reduce start-up times; I don't think unilaterally deciding to (naively) compress pclntab again is a good idea. (And we're not going to introduce a flag.)

A better alternative is to do some serious engineering work and look for alternative encodings of that data.

I experimented a few years ago with using a smaller encoding for pcvalue pairs. From my notes to myself at that time:

pc/value pairs in the go1.2 symtab are encoded using varint encoding. But most values are very small, so try using nibble-at-a-time style varint encoding instead of byte-at-a-time varint encoding. You lose 1/4 bits to continuation markers instead of 1/8, but with small values it doesn't matter.

This only works because there are value pairs. The precise encoding is: first four bits are value, second four bits are pc. First bit of each nibble is a continuation bit. If both continuation bits are set, continue as before. If neither continuation bit is set, we're done. If one continuation bit is set but not the other, switch to byte-at-a-time varint encoding for the outstanding value.

This actually does prove to be a more space-efficient encoding, just not enough of an improvement to warrant changing the symtab, since there are many consumers of it.

There are also other streaming integer compression techniques, like varint-GB, Stream V-Byte, Roaring Bitmaps. They are all optimized for different purposes, so it is reasonably likely that anything would need to be adapted a bit for Go's particular data structure and needed. The point is merely that there is room for experimentation before reaching for the compressions sledgehammer.

Another thing that I've pondered is supporting some kind of lookback in the encoding. Inlined code is very expensive in the pclntab, because the encoding required for a cross-file, cross-line jump is long. But for inlined code, we are often hopping back and forth between a few locations. Having a way to refer not to the previous value but to a value up to values back could help. (This is similar to keyframes and lookback in video encoding.) This has a runtime cost; again, investigation needed.

And there's more to the pclntab than just pcvalue entries.

The next step here is probably for someone to spend some quality time investigating all the design decisions in pclntab and coming up with an improved design and quantifying the impact it would have (as @dr2chase says).

josharian commented 5 years ago

Yet another option is to compress subsections of pclntab using a compression technique that is designed for speed and low overhead, such that it can get decompressed on the fly, every single time it is needed, without appreciable performance loss. I am far from an expert in this area, but heatshrink and blosc come to mind.

randall77 commented 5 years ago

It has always bothered me that we have essentially a separate pcln table for every function. I suspect lots of functions are small, in which case maybe we should batch the tables across a larger region, say a 4K page. Random grep of objdump of the go tool, we're currently at about 500bytes/function.

egonelbre commented 5 years ago

One of the things I noticed while examining is that there is a lot of repetition in symbol names.

As a snippet from strings pclntab | sort | uniq:

google.golang.org/grpc/internal/transport.(*http2Client).IncrMsgRecv
google.golang.org/grpc/internal/transport.(*http2Client).IncrMsgSent
google.golang.org/grpc/internal/transport.(*http2Client).keepalive
google.golang.org/grpc/internal/transport.(*http2Client).keepalive.stkobj
google.golang.org/grpc/internal/transport.(*http2Client).NewStream
google.golang.org/grpc/internal/transport.(*http2Client).newStream
google.golang.org/grpc/internal/transport.(*http2Client).NewStream
google.golang.org/grpc/internal/transport.(*http2Client).NewStream.func1
google.golang.org/grpc/internal/transport.(*http2Client).newStream.func1
google.golang.org/grpc/internal/transport.(*http2Client).NewStream.func2
google.golang.org/grpc/internal/transport.(*http2Client).newStream.func2
google.golang.org/grpc/internal/transport.(*http2Client).NewStream.func2
google.golang.org/grpc/internal/transport.(*http2Client).NewStream.func3
google.golang.org/grpc/internal/transport.(*http2Client).newStream.func3

One idea here would be to encode all names as a lookup to 3 different names. e.g. <pkgnameptr><typenameptr><methodnameptr>. This would of course reduce the size of pclntab, however the strings wouldn't be easily fetchable from memory anymore, so there would need to be some sort of lookup table for the combined strings. But we probably can assume that most of the names won't be expanded at runtime.

There's also inefficiency in encoding anonymous structs:

type..eq.struct { _ golang.org/x/sys/cpu.CacheLinePad; HasAES bool; HasADX bool; HasAVX bool; HasAVX2 bool; HasBMI1 bool; HasBMI2 bool; HasERMS bool; HasFMA bool; HasOSXSAVE bool; HasPCLMULQDQ bool; HasPOPCNT bool; HasRDRAND bool; HasRDSEED bool; HasSSE2 bool; HasSSE3 bool; HasSSSE3 bool; HasSSE41 bool; HasSSE42 bool; _ golang.org/x/sys/cpu.CacheLinePad }
type..eq.struct { reflect.ityp *reflect.rtype; reflect.typ *reflect.rtype; reflect.hash uint32; _ [4]uint8; reflect.fun [100000]unsafe.Pointer }
type..eq.struct { runtime.cycle uint32; runtime.flushed bool }
type..eq.struct { runtime.enabled bool; runtime.pad [3]uint8; runtime.needed bool; runtime.cgo bool; runtime.alignme uint64 }
type..eq.struct { runtime.full runtime.lfstack; runtime.empty runtime.lfstack; runtime.pad0 internal/cpu.CacheLinePad; runtime.wbufSpans struct { runtime.lock runtime.mutex; runtime.free runtime.mSpanList; runtime.busy runtime.mSpanList }; _ uint32; runtime.bytesMarked uint64; runtime.markrootNext uint32; runtime.markrootJobs uint32; runtime.nproc uint32; runtime.tstart int64; runtime.nwait uint32; runtime.ndone uint32; runtime.nFlushCacheRoots int; runtime.nDataRoots int; runtime.nBSSRoots int; runtime.nSpanRoots int; runtime.nStackRoots int; runtime.startSema uint32; runtime.markDoneSema uint32; runtime.bgMarkReady runtime.note; runtime.bgMarkDone uint32; runtime.mode runtime.gcMode; runtime.userForced bool; runtime.totaltime int64; runtime.initialHeapLive uint64; runtime.assistQueue struct { runtime.lock runtime.mutex; runtime.q runtime.gQueue }; runtime.sweepWaiters struct { runtime.lock runtime.mutex; runtime.list runtime.gList }; runtime.cycles uint32; runtime.stwprocs int32; runtime.maxprocs int32; 

Here one idea would be to use the variable name for the typename symbol rather than the type description itself. Of course, the anonymous struct types are easy to fix manually as well.