cmd/compile: slow to compile large map literals containing dynamic elements

myleshorton commented 7 years ago

What version of Go are you using (`go version`)?

go version go1.8 darwin/amd64

What operating system and processor architecture are you using (`go env`)?

$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/afisk/gopath"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/_q/9g3l6s7x3v92ws74nng9g2g00000gn/T/go-build418677857=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

Used text/template to generate a very large go file to embed data, and then just ran go build

$ git clone https://github.com/getlantern/httpseverywhere.git
$ cd httpseverywhere
$ git checkout too-large-for-go
$ go build

What did you expect to see?

A successfully built binary

What did you see instead?

$ go build
# github.com/getlantern/httpseverywhere
panic: runtime error: makeslice: len out of range

goroutine 1 [running]:
cmd/compile/internal/gc.newliveness(0xc4272221b0, 0xc42011c238, 0xc49bc7e000, 0x4a651, 0x5ac00, 0xc4da1da000, 0x832c, 0x9800, 0xc48ef5f898)
    /usr/local/go/src/cmd/compile/internal/gc/plive.go:717 +0x145

cmd/compile/internal/gc.liveness(0xc4272221b0, 0xc42011c238, 0xc4c21321c0, 0xc4c2132230)
    /usr/local/go/src/cmd/compile/internal/gc/plive.go:1740 +0x14f
cmd/compile/internal/gc.genssa(0xc45d30b440, 0xc42011c238, 0xc4c21321c0, 0xc4c2132230)
    /usr/local/go/src/cmd/compile/internal/gc/ssa.go:4500 +0x77c
cmd/compile/internal/gc.compile(0xc4272221b0)
    /usr/local/go/src/cmd/compile/internal/gc/pgen.go:443 +0x707
cmd/compile/internal/gc.funccompile(0xc4272221b0)
    /usr/local/go/src/cmd/compile/internal/gc/dcl.go:1292 +0xdc
cmd/compile/internal/gc.fninit(0xc420115c00, 0x3b, 0x70)
    /usr/local/go/src/cmd/compile/internal/gc/init.go:164 +0xb8d
cmd/compile/internal/gc.Main()
    /usr/local/go/src/cmd/compile/internal/gc/main.go:471 +0x23b1
main.main()
    /usr/local/go/src/cmd/compile/main.go:50 +0xfe

myleshorton commented 7 years ago

@josharian here's a new issue originally described at https://github.com/golang/go/issues/14082

josharian commented 7 years ago

Thanks, @myleshorton.

The problematic file is a giant composite literal, a map whose elements contain maps. The compiler converts this into static initialization code in which each element in turn is initialized, and then placed into the top-level map. It all ends up in one giant function. Almost all the time is spent in phi insertion, at least on tip.

The file is 300k lines long. 1k lines take 0.2s to compile. 10k lines take 2s. 50k lines take 30s. That's as far as I got.

I didn't see any memory blow-up in liveness, and most memory is allocated in the usual places (nodes, stackalloc), although tip's liveness implementation is significantly different than 1.8's, so maybe that particular problem is fixed.

Here's one idea. We could break up giant composite literal initialization into multiple functions. Each function by itself would be much faster to compile, and as long as they don't get too granular, I doubt we'd lose much optimization opportunity or slow things down due to call overhead. The main problem here is probably implementation--coordinating compilation and calling of all the extra functions. Maybe a binary splitting approach until each chunk's size is manageable. This has the additional benefit of (theoretically!) allowing concurrent compilation of giant init functions. This suggestion pains me a little, since I'm trying to figure out how to eliminate the convolution caused by the "drop everything and compile this function" thing we currently do all over the compiler, but maybe if I find a good general solution to the problem, this can fit into it.

Thoughts, @randall77 or @mdempsky?

josharian commented 7 years ago

I took a stab at this but haven't gotten anything working yet. One sticking point is that the dynamic entries may refer to other local variables. Another sticking point is that setting up a new function manually is a pain.

In the meantime, @myleshorton, would you mind running an experiment? (I'd do it myself, but I don't see your template/generator code checked in anywhere obvious. Or just point me to it.)

Instead of generating code like:

var targetsDict = map[string]*Targets{ 
    "": &Targets{
        Plain: map[string]bool{ 
            "s.lair.io": true,
            "xserver.*.jp": true,
            "youtube.co.ke": true,
        },
    },
    "0": &Targets{
        Plain: map[string]bool{ 
            "bit.0.md": true,
        },
    },
    "0-60": &Targets{
        Plain: map[string]bool{ 
            "0-60.in": true,
        },
    },
    // ...

Try generating:

var targetsDict = map[string]*Targets{}

func init() {
    targetsDict[""] = &Targets{
        Plain: map[string]bool{ 
            "s.lair.io": true,
            "xserver.*.jp": true,
            "youtube.co.ke": true,
        },
    }
    targetsDict["0"] = &Targets{
        Plain: map[string]bool{ 
            "bit.0.md": true,
        },
    }
    // ...
}

func init() {
    targetsDict["0-60"] = &Targets{
        Plain: map[string]bool{ 
            "0-60.in": true,
        },
    }
    // ...
}

where each of the init functions contains (say) 100 or 1000 of the entries. Each of the functions should be called init. That should in theory result in the same behavior when compiled, but be much faster to compile. I'd love to hear whether that theory holds water, since it is more or less what I'd like to get the compiler to do automatically. And if it works, you'll be unblocked. :)

josharian commented 7 years ago

I wrote a quick parser and did some manual editing and hacked together a version of targets.go that puts every single targetsDict assignment in its own init function. (This was easier for a quick hack than grouping them into 100s or 1000s.)

This reduced the compilation time to 33 seconds on my machine. (It used to take 22 minutes to finally panic in plive.) So this approach has some promise.

randall77 commented 7 years ago

I'd like to understand what isn't linear about the compilation, that makes breaking functions up worthwhile. I'd be much happier if we could find and fix that nonlinearity instead. @josharian You said all the time was in phi insertion. Any particular pattern that leads to nonlinearity?

josharian commented 7 years ago

@randall77 I'm already looking into it. :)

gopherbot commented 7 years ago

CL https://golang.org/cl/39294 mentions this issue.

gopherbot commented 7 years ago

CL https://golang.org/cl/39351 mentions this issue.

josharian commented 7 years ago

OK, after the previous commits are in, the compiler panics much faster. Yay!

The panic occurs because there are lots of basic blocks, and we try to bulk allocate a slice with length longer than an int32 can hold. Eep. I'll send a CL soon to at least turn that into a Fatal instead of a panic.

The basic blocks in this case all come from branches from checking whether runtime.writeBarrier is set.

I see lots of possible fixes, but the one I find most appealing is asserting (and ensuring!) that runtime.writeBarrier is off during init, and then assuming as much in the SSA writebarrier pass.

(Other fixes including more gracefully handling lots of blocks in liveness or following my earlier suggestion of breaking up giant init functions.)

@randall77 @cherrymui opinions about this?

gopherbot commented 7 years ago

CL https://golang.org/cl/39354 mentions this issue.

randall77 commented 7 years ago

We can't assume write barriers are off during init, unfortunately. An init function can have arbitrarily long-running code in it.

What part of the compiler is doing the bulk allocating? The actual block ID allocator will panic if we try to allocate an ID that's too big. Or is something else trying to preallocate and gets too aggressive?

Sounds like we should put some more effort into optimizing phi insertion, at least.

josharian commented 7 years ago

We can't assume write barriers are off during init, unfortunately. An init function can have arbitrarily long-running code in it.

Could we do it during autogenerated inits? IIRC those all complete before user-provided inits run, or at least, they could probably be made to. But I guess maybe that's overkill for this problem.

What part of the compiler is doing the bulk allocating? The actual block ID allocator will panic if we try to allocate an ID that's too big. Or is something else trying to preallocate and gets too aggressive?

It's the call to bvbulkalloc in newliveness. We're calling it with nbit=48645 and count=2074940, so nword=1521 and nword*count=3155983740, leading to the call make([]uint32, 3155983740), which is way too big.

I think it might take some significant refactoring to avoid this.

It might be better and easier to try to split up large init functions, as proposed above. Note that there is some history of doing this manually: https://codereview.appspot.com/72590045. I'd say we could just keep asking users to split them up manually, since any init that big must be autogenerated, but splitting them up also has the benefit of allowing it to be compiler concurrently.

Sounds like we should put some more effort into optimizing phi insertion, at least.

That's definitely true in general, but the CLs that I've submitted so far in mostly solve the phi problem, at least for this particular bit of code.

gopherbot commented 7 years ago

CL https://golang.org/cl/39358 mentions this issue.

josharian commented 7 years ago

Another idea: have static initialization track and reuse autotmps, much as order.go does. Will investigate.

cherrymui commented 7 years ago

The basic blocks in this case all come from branches from checking whether runtime.writeBarrier is set.

Yes, the write barrier pass creates many blocks. I filed issue #19838 for grouping write barriers more aggresively. But it actually doesn't help here: there are calls (newobject, mapassign_faststr, makemap) between each write barrier.

ghost commented 7 years ago

When I tried to compile this program with gccgo (Ubuntu 6.0.1-0ubuntu1) 6.0.0 20160414 (experimental) [trunk revision 234994], it ran for about 45 minutes, then started allocating my system's memory at such a rate that I was forced to abort.

josharian commented 7 years ago

have static initialization track and reuse autotmps

Challenge with this: We flip back and forth between sinit and walk, so it's hard to maintain state cleanly. Probably still worth doing, but it's going to require some significant refactoring.

josharian commented 7 years ago

have static initialization track and reuse autotmps

I hacked this in with some globals. There's lots of autotmp reuse, but (of course) the exact variable in use disappears by the time we reach liveness anyway, so it doesn't help with the liveness issue. And it doesn't help as much as you'd hope with the speed of getting to the panic. So for now, I think that is a dead end.

josharian commented 7 years ago

Yes, the write barrier pass creates many blocks. I filed issue #19838 for grouping write barriers more aggresively. But it actually doesn't help here: there are calls (newobject, mapassign_faststr, makemap) between each write barrier.

On a second look, I think it would. Though there are calls between each writebarrier-enabled check, calls don't create a new SSA block. Basically all the blocks in this function come from write barriers, and the size of the problematic alloc is directly proportional to the number of blocks.

That's not to say that if we fix this there wouldn't also be problems with the size of the generated stackmap. But it's possible those could be fruitfully addressed (for all functions, not just this one) by doing liveness compaction as we go, rather than all at the end.

Sample SSA output for a random chunk from the middle of this function, at the point that it arrives at liveness:

  b1789: <- b1787 b1788
    v815 = Phi <mem> v1968012 v1968006
    v1977408 = LEAQ <*uint8> {type.map[string]*"".Targets} v3 : AX
    v817 = MOVQstore <mem> v2 v1977408 v815
    v1977356 = LoadReg <map[string]*Targets> v2901053 : CX
    v818 = MOVQstore <mem> [8] v2 v1977356 v817
    v1977322 = LEAQ <*uint8> {go.string."0am"} v3 : DX
    v2374727 = MOVQstore <mem> [16] v2 v1977322 v818
    v819 = MOVQstoreconst <mem> [val=3,off=24] v2 v2374727
    v820 = CALLstatic <mem> {runtime.mapassign_faststr} [40] v819
    v821 = MOVQload <**Targets> [32] v2 v820 : AX
    v822 = LoweredNilCheck <void> v821 v820
    v1968089 = MOVLload <uint32> {runtime.writeBarrier} v3 v820 : CX
    v2705012 = TESTL <flags> v1968089 v1968089
    NE v2705012 -> b1784 b1785 (unlikely)
  b1785: <- b1789
    v2702048 = LoadReg <*Targets> v1977553 : CX
    v1968053 = MOVQstore <mem> v821 v2702048 v820
    Plain -> b1786
  b1786: <- b1784 b1785
    v823 = Phi <mem> v1968054 v1968053
    v2702059 = LEAQ <*uint8> {type."".Targets} v3 : AX
    v826 = MOVQstore <mem> v2 v2702059 v823
    v827 = CALLstatic <mem> {runtime.newobject} [16] v826
    v828 = MOVQload <*Targets> [8] v2 v827 : AX
    v2702051 = StoreReg <*Targets> v828 : .autotmp_327651[*Targets]
    v2900939 = LEAQ <*uint8> {type.map[string]bool} v3 : CX
    v832 = MOVQstore <mem> v2 v2900939 v827
    v833 = MOVQstoreconst <mem> [val=1,off=8] v2 v832
    v834 = MOVQstoreconst <mem> [val=0,off=16] v2 v833
    v835 = MOVQstoreconst <mem> [val=0,off=24] v2 v834
    v836 = CALLstatic <mem> {runtime.makemap} [40] v835
    v837 = MOVQload <map[string]bool> [32] v2 v836 : AX
    v1977227 = StoreReg <map[string]bool> v837 : .autotmp_327673[map[string]bool]
    v1977261 = LEAQ <*uint8> {type.map[string]bool} v3 : CX
    v840 = MOVQstore <mem> v2 v1977261 v836
    v841 = MOVQstore <mem> [8] v2 v837 v840
    v2702040 = LEAQ <*uint8> {go.string."by.0atz.com"} v3 : DX
    v2374725 = MOVQstore <mem> [16] v2 v2702040 v841
    v842 = MOVQstoreconst <mem> [val=11,off=24] v2 v2374725
    v843 = CALLstatic <mem> {runtime.mapassign_faststr} [40] v842
    v844 = MOVQload <*bool> [32] v2 v843 : AX
    v846 = MOVBstoreconst <mem> [val=1,off=0] v844 v843
    v2900937 = LoadReg <*Targets> v2702051 : AX
    v847 = LoweredNilCheck <void> v2900937 v846
    v1968131 = MOVLload <uint32> {runtime.writeBarrier} v3 v846 : CX
    v848 = ADDQconst <*map[string]bool> [64] v2900937 : DX
    v2705010 = TESTL <flags> v1968131 v1968131
    NE v2705010 -> b1781 b1782 (unlikely)
  b1782: <- b1786
    v2702032 = LoadReg <map[string]bool> v1977227 : CX
    v1968096 = MOVQstore <mem> [64] v2900937 v2702032 v846
    Plain -> b1783
  b1783: <- b1781 b1782
    v849 = Phi <mem> v1968097 v1968096
    v2702043 = LEAQ <*uint8> {type.map[string]*"".Targets} v3 : AX
    v851 = MOVQstore <mem> v2 v2702043 v849
    v2900935 = LoadReg <map[string]*Targets> v2901053 : CX
    v852 = MOVQstore <mem> [8] v2 v2900935 v851
    v1977175 = LEAQ <*uint8> {go.string."0atz"} v3 : DX
    v2375035 = MOVQstore <mem> [16] v2 v1977175 v852
    v853 = MOVQstoreconst <mem> [val=4,off=24] v2 v2375035
    v854 = CALLstatic <mem> {runtime.mapassign_faststr} [40] v853
    v855 = MOVQload <**Targets> [32] v2 v854 : AX
    v856 = LoweredNilCheck <void> v855 v854
    v1968174 = MOVLload <uint32> {runtime.writeBarrier} v3 v854 : CX
    v2705008 = TESTL <flags> v1968174 v1968174
    NE v2705008 -> b1778 b1779 (unlikely)
  b1779: <- b1783
    v2702024 = LoadReg <*Targets> v2702051 : CX
    v1968142 = MOVQstore <mem> v855 v2702024 v854
    Plain -> b1780
  b1780: <- b1778 b1779
    v857 = Phi <mem> v1968150 v1968142
    v2702035 = LEAQ <*uint8> {type."".Targets} v3 : AX
    v860 = MOVQstore <mem> v2 v2702035 v857
    v861 = CALLstatic <mem> {runtime.newobject} [16] v860
    v862 = MOVQload <*Targets> [8] v2 v861 : AX
    v2702016 = StoreReg <*Targets> v862 : .autotmp_327651[*Targets]
    v2900933 = LEAQ <*uint8> {type.map[string]bool} v3 : CX
    v866 = MOVQstore <mem> v2 v2900933 v861
    v867 = MOVQstoreconst <mem> [val=1,off=8] v2 v866
    v868 = MOVQstoreconst <mem> [val=0,off=16] v2 v867
    v869 = MOVQstoreconst <mem> [val=0,off=24] v2 v868
    v870 = CALLstatic <mem> {runtime.makemap} [40] v869
    v871 = MOVQload <map[string]bool> [32] v2 v870 : AX
    v2900931 = StoreReg <map[string]bool> v871 : .autotmp_327674[map[string]bool]
    v1977132 = LEAQ <*uint8> {type.map[string]bool} v3 : CX
    v874 = MOVQstore <mem> v2 v1977132 v870
    v875 = MOVQstore <mem> [8] v2 v871 v874
    v1977080 = LEAQ <*uint8> {go.string."x.0b00000000.me"} v3 : DX
    v2374719 = MOVQstore <mem> [16] v2 v1977080 v875
    v876 = MOVQstoreconst <mem> [val=15,off=24] v2 v2374719
    v877 = CALLstatic <mem> {runtime.mapassign_faststr} [40] v876
    v878 = MOVQload <*bool> [32] v2 v877 : AX
    v880 = MOVBstoreconst <mem> [val=1,off=0] v878 v877
    v2702027 = LoadReg <*Targets> v2702016 : AX
    v881 = LoweredNilCheck <void> v2702027 v880
    v1968218 = MOVLload <uint32> {runtime.writeBarrier} v3 v880 : CX
    v882 = ADDQconst <*map[string]bool> [64] v2702027 : DX
    v2705006 = TESTL <flags> v1968218 v1968218
    NE v2705006 -> b1775 b1776 (unlikely)
  b1776: <- b1780
    v1977046 = LoadReg <map[string]bool> v2900931 : CX
    v1968182 = MOVQstore <mem> [64] v2702027 v1977046 v880
    Plain -> b1777
  b1777: <- b1775 b1776
    v883 = Phi <mem> v1968183 v1968182
    v2702008 = LEAQ <*uint8> {type.map[string]*"".Targets} v3 : AX
    v885 = MOVQstore <mem> v2 v2702008 v883
    v2702019 = LoadReg <map[string]*Targets> v2901053 : CX
    v886 = MOVQstore <mem> [8] v2 v2702019 v885
    v2900929 = LEAQ <*uint8> {go.string."0b00000000"} v3 : DX
    v2374717 = MOVQstore <mem> [16] v2 v2900929 v886
    v887 = MOVQstoreconst <mem> [val=10,off=24] v2 v2374717
    v888 = CALLstatic <mem> {runtime.mapassign_faststr} [40] v887
    v889 = MOVQload <**Targets> [32] v2 v888 : AX
    v890 = LoweredNilCheck <void> v889 v888
    v1968260 = MOVLload <uint32> {runtime.writeBarrier} v3 v888 : CX
    v2705004 = TESTL <flags> v1968260 v1968260
    NE v2705004 -> b1772 b1773 (unlikely)

gopherbot commented 7 years ago

CL https://golang.org/cl/41192 mentions this issue.

josharian commented 7 years ago

A possible alternative to splitting these init functions up (which I have tried and failed at multiple times) is to try to prevent generating some of these problematic write barrier calls by avoiding the mapmake calls by emitting some static empty map buckets to fill in.

gopherbot commented 7 years ago

CL https://golang.org/cl/42178 mentions this issue.

josharian commented 7 years ago

Looks like CL 42178 is unlikely to make 1.9. Moving to 1.10.

golang / go