Open myleshorton opened 7 years ago
@josharian here's a new issue originally described at https://github.com/golang/go/issues/14082
Thanks, @myleshorton.
The problematic file is a giant composite literal, a map whose elements contain maps. The compiler converts this into static initialization code in which each element in turn is initialized, and then placed into the top-level map. It all ends up in one giant function. Almost all the time is spent in phi insertion, at least on tip.
The file is 300k lines long. 1k lines take 0.2s to compile. 10k lines take 2s. 50k lines take 30s. That's as far as I got.
I didn't see any memory blow-up in liveness, and most memory is allocated in the usual places (nodes, stackalloc), although tip's liveness implementation is significantly different than 1.8's, so maybe that particular problem is fixed.
Here's one idea. We could break up giant composite literal initialization into multiple functions. Each function by itself would be much faster to compile, and as long as they don't get too granular, I doubt we'd lose much optimization opportunity or slow things down due to call overhead. The main problem here is probably implementation--coordinating compilation and calling of all the extra functions. Maybe a binary splitting approach until each chunk's size is manageable. This has the additional benefit of (theoretically!) allowing concurrent compilation of giant init functions. This suggestion pains me a little, since I'm trying to figure out how to eliminate the convolution caused by the "drop everything and compile this function" thing we currently do all over the compiler, but maybe if I find a good general solution to the problem, this can fit into it.
Thoughts, @randall77 or @mdempsky?
I took a stab at this but haven't gotten anything working yet. One sticking point is that the dynamic entries may refer to other local variables. Another sticking point is that setting up a new function manually is a pain.
In the meantime, @myleshorton, would you mind running an experiment? (I'd do it myself, but I don't see your template/generator code checked in anywhere obvious. Or just point me to it.)
Instead of generating code like:
var targetsDict = map[string]*Targets{
"": &Targets{
Plain: map[string]bool{
"s.lair.io": true,
"xserver.*.jp": true,
"youtube.co.ke": true,
},
},
"0": &Targets{
Plain: map[string]bool{
"bit.0.md": true,
},
},
"0-60": &Targets{
Plain: map[string]bool{
"0-60.in": true,
},
},
// ...
Try generating:
var targetsDict = map[string]*Targets{}
func init() {
targetsDict[""] = &Targets{
Plain: map[string]bool{
"s.lair.io": true,
"xserver.*.jp": true,
"youtube.co.ke": true,
},
}
targetsDict["0"] = &Targets{
Plain: map[string]bool{
"bit.0.md": true,
},
}
// ...
}
func init() {
targetsDict["0-60"] = &Targets{
Plain: map[string]bool{
"0-60.in": true,
},
}
// ...
}
where each of the init functions contains (say) 100 or 1000 of the entries. Each of the functions should be called init
. That should in theory result in the same behavior when compiled, but be much faster to compile. I'd love to hear whether that theory holds water, since it is more or less what I'd like to get the compiler to do automatically. And if it works, you'll be unblocked. :)
I wrote a quick parser and did some manual editing and hacked together a version of targets.go that puts every single targetsDict assignment in its own init function. (This was easier for a quick hack than grouping them into 100s or 1000s.)
This reduced the compilation time to 33 seconds on my machine. (It used to take 22 minutes to finally panic in plive.) So this approach has some promise.
I'd like to understand what isn't linear about the compilation, that makes breaking functions up worthwhile. I'd be much happier if we could find and fix that nonlinearity instead. @josharian You said all the time was in phi insertion. Any particular pattern that leads to nonlinearity?
@randall77 I'm already looking into it. :)
CL https://golang.org/cl/39294 mentions this issue.
CL https://golang.org/cl/39351 mentions this issue.
OK, after the previous commits are in, the compiler panics much faster. Yay!
The panic occurs because there are lots of basic blocks, and we try to bulk allocate a slice with length longer than an int32 can hold. Eep. I'll send a CL soon to at least turn that into a Fatal instead of a panic.
The basic blocks in this case all come from branches from checking whether runtime.writeBarrier is set.
I see lots of possible fixes, but the one I find most appealing is asserting (and ensuring!) that runtime.writeBarrier is off during init
, and then assuming as much in the SSA writebarrier pass.
(Other fixes including more gracefully handling lots of blocks in liveness or following my earlier suggestion of breaking up giant init functions.)
@randall77 @cherrymui opinions about this?
CL https://golang.org/cl/39354 mentions this issue.
We can't assume write barriers are off during init
, unfortunately. An init
function can have arbitrarily long-running code in it.
What part of the compiler is doing the bulk allocating? The actual block ID allocator will panic if we try to allocate an ID that's too big. Or is something else trying to preallocate and gets too aggressive?
Sounds like we should put some more effort into optimizing phi insertion, at least.
We can't assume write barriers are off during init, unfortunately. An init function can have arbitrarily long-running code in it.
Could we do it during autogenerated inits? IIRC those all complete before user-provided inits run, or at least, they could probably be made to. But I guess maybe that's overkill for this problem.
What part of the compiler is doing the bulk allocating? The actual block ID allocator will panic if we try to allocate an ID that's too big. Or is something else trying to preallocate and gets too aggressive?
It's the call to bvbulkalloc in newliveness. We're calling it with nbit=48645 and count=2074940, so nword=1521 and nword*count=3155983740, leading to the call make([]uint32, 3155983740)
, which is way too big.
I think it might take some significant refactoring to avoid this.
It might be better and easier to try to split up large init functions, as proposed above. Note that there is some history of doing this manually: https://codereview.appspot.com/72590045. I'd say we could just keep asking users to split them up manually, since any init that big must be autogenerated, but splitting them up also has the benefit of allowing it to be compiler concurrently.
Sounds like we should put some more effort into optimizing phi insertion, at least.
That's definitely true in general, but the CLs that I've submitted so far in mostly solve the phi problem, at least for this particular bit of code.
CL https://golang.org/cl/39358 mentions this issue.
Another idea: have static initialization track and reuse autotmps, much as order.go does. Will investigate.
The basic blocks in this case all come from branches from checking whether runtime.writeBarrier is set.
Yes, the write barrier pass creates many blocks. I filed issue #19838 for grouping write barriers more aggresively. But it actually doesn't help here: there are calls (newobject, mapassign_faststr, makemap) between each write barrier.
When I tried to compile this program with gccgo (Ubuntu 6.0.1-0ubuntu1) 6.0.0 20160414 (experimental) [trunk revision 234994]
, it ran for about 45 minutes, then started allocating my system's memory at such a rate that I was forced to abort.
have static initialization track and reuse autotmps
Challenge with this: We flip back and forth between sinit and walk, so it's hard to maintain state cleanly. Probably still worth doing, but it's going to require some significant refactoring.
have static initialization track and reuse autotmps
I hacked this in with some globals. There's lots of autotmp reuse, but (of course) the exact variable in use disappears by the time we reach liveness anyway, so it doesn't help with the liveness issue. And it doesn't help as much as you'd hope with the speed of getting to the panic. So for now, I think that is a dead end.
Yes, the write barrier pass creates many blocks. I filed issue #19838 for grouping write barriers more aggresively. But it actually doesn't help here: there are calls (newobject, mapassign_faststr, makemap) between each write barrier.
On a second look, I think it would. Though there are calls between each writebarrier-enabled check, calls don't create a new SSA block. Basically all the blocks in this function come from write barriers, and the size of the problematic alloc is directly proportional to the number of blocks.
That's not to say that if we fix this there wouldn't also be problems with the size of the generated stackmap. But it's possible those could be fruitfully addressed (for all functions, not just this one) by doing liveness compaction as we go, rather than all at the end.
Sample SSA output for a random chunk from the middle of this function, at the point that it arrives at liveness:
b1789: <- b1787 b1788
v815 = Phi <mem> v1968012 v1968006
v1977408 = LEAQ <*uint8> {type.map[string]*"".Targets} v3 : AX
v817 = MOVQstore <mem> v2 v1977408 v815
v1977356 = LoadReg <map[string]*Targets> v2901053 : CX
v818 = MOVQstore <mem> [8] v2 v1977356 v817
v1977322 = LEAQ <*uint8> {go.string."0am"} v3 : DX
v2374727 = MOVQstore <mem> [16] v2 v1977322 v818
v819 = MOVQstoreconst <mem> [val=3,off=24] v2 v2374727
v820 = CALLstatic <mem> {runtime.mapassign_faststr} [40] v819
v821 = MOVQload <**Targets> [32] v2 v820 : AX
v822 = LoweredNilCheck <void> v821 v820
v1968089 = MOVLload <uint32> {runtime.writeBarrier} v3 v820 : CX
v2705012 = TESTL <flags> v1968089 v1968089
NE v2705012 -> b1784 b1785 (unlikely)
b1785: <- b1789
v2702048 = LoadReg <*Targets> v1977553 : CX
v1968053 = MOVQstore <mem> v821 v2702048 v820
Plain -> b1786
b1786: <- b1784 b1785
v823 = Phi <mem> v1968054 v1968053
v2702059 = LEAQ <*uint8> {type."".Targets} v3 : AX
v826 = MOVQstore <mem> v2 v2702059 v823
v827 = CALLstatic <mem> {runtime.newobject} [16] v826
v828 = MOVQload <*Targets> [8] v2 v827 : AX
v2702051 = StoreReg <*Targets> v828 : .autotmp_327651[*Targets]
v2900939 = LEAQ <*uint8> {type.map[string]bool} v3 : CX
v832 = MOVQstore <mem> v2 v2900939 v827
v833 = MOVQstoreconst <mem> [val=1,off=8] v2 v832
v834 = MOVQstoreconst <mem> [val=0,off=16] v2 v833
v835 = MOVQstoreconst <mem> [val=0,off=24] v2 v834
v836 = CALLstatic <mem> {runtime.makemap} [40] v835
v837 = MOVQload <map[string]bool> [32] v2 v836 : AX
v1977227 = StoreReg <map[string]bool> v837 : .autotmp_327673[map[string]bool]
v1977261 = LEAQ <*uint8> {type.map[string]bool} v3 : CX
v840 = MOVQstore <mem> v2 v1977261 v836
v841 = MOVQstore <mem> [8] v2 v837 v840
v2702040 = LEAQ <*uint8> {go.string."by.0atz.com"} v3 : DX
v2374725 = MOVQstore <mem> [16] v2 v2702040 v841
v842 = MOVQstoreconst <mem> [val=11,off=24] v2 v2374725
v843 = CALLstatic <mem> {runtime.mapassign_faststr} [40] v842
v844 = MOVQload <*bool> [32] v2 v843 : AX
v846 = MOVBstoreconst <mem> [val=1,off=0] v844 v843
v2900937 = LoadReg <*Targets> v2702051 : AX
v847 = LoweredNilCheck <void> v2900937 v846
v1968131 = MOVLload <uint32> {runtime.writeBarrier} v3 v846 : CX
v848 = ADDQconst <*map[string]bool> [64] v2900937 : DX
v2705010 = TESTL <flags> v1968131 v1968131
NE v2705010 -> b1781 b1782 (unlikely)
b1782: <- b1786
v2702032 = LoadReg <map[string]bool> v1977227 : CX
v1968096 = MOVQstore <mem> [64] v2900937 v2702032 v846
Plain -> b1783
b1783: <- b1781 b1782
v849 = Phi <mem> v1968097 v1968096
v2702043 = LEAQ <*uint8> {type.map[string]*"".Targets} v3 : AX
v851 = MOVQstore <mem> v2 v2702043 v849
v2900935 = LoadReg <map[string]*Targets> v2901053 : CX
v852 = MOVQstore <mem> [8] v2 v2900935 v851
v1977175 = LEAQ <*uint8> {go.string."0atz"} v3 : DX
v2375035 = MOVQstore <mem> [16] v2 v1977175 v852
v853 = MOVQstoreconst <mem> [val=4,off=24] v2 v2375035
v854 = CALLstatic <mem> {runtime.mapassign_faststr} [40] v853
v855 = MOVQload <**Targets> [32] v2 v854 : AX
v856 = LoweredNilCheck <void> v855 v854
v1968174 = MOVLload <uint32> {runtime.writeBarrier} v3 v854 : CX
v2705008 = TESTL <flags> v1968174 v1968174
NE v2705008 -> b1778 b1779 (unlikely)
b1779: <- b1783
v2702024 = LoadReg <*Targets> v2702051 : CX
v1968142 = MOVQstore <mem> v855 v2702024 v854
Plain -> b1780
b1780: <- b1778 b1779
v857 = Phi <mem> v1968150 v1968142
v2702035 = LEAQ <*uint8> {type."".Targets} v3 : AX
v860 = MOVQstore <mem> v2 v2702035 v857
v861 = CALLstatic <mem> {runtime.newobject} [16] v860
v862 = MOVQload <*Targets> [8] v2 v861 : AX
v2702016 = StoreReg <*Targets> v862 : .autotmp_327651[*Targets]
v2900933 = LEAQ <*uint8> {type.map[string]bool} v3 : CX
v866 = MOVQstore <mem> v2 v2900933 v861
v867 = MOVQstoreconst <mem> [val=1,off=8] v2 v866
v868 = MOVQstoreconst <mem> [val=0,off=16] v2 v867
v869 = MOVQstoreconst <mem> [val=0,off=24] v2 v868
v870 = CALLstatic <mem> {runtime.makemap} [40] v869
v871 = MOVQload <map[string]bool> [32] v2 v870 : AX
v2900931 = StoreReg <map[string]bool> v871 : .autotmp_327674[map[string]bool]
v1977132 = LEAQ <*uint8> {type.map[string]bool} v3 : CX
v874 = MOVQstore <mem> v2 v1977132 v870
v875 = MOVQstore <mem> [8] v2 v871 v874
v1977080 = LEAQ <*uint8> {go.string."x.0b00000000.me"} v3 : DX
v2374719 = MOVQstore <mem> [16] v2 v1977080 v875
v876 = MOVQstoreconst <mem> [val=15,off=24] v2 v2374719
v877 = CALLstatic <mem> {runtime.mapassign_faststr} [40] v876
v878 = MOVQload <*bool> [32] v2 v877 : AX
v880 = MOVBstoreconst <mem> [val=1,off=0] v878 v877
v2702027 = LoadReg <*Targets> v2702016 : AX
v881 = LoweredNilCheck <void> v2702027 v880
v1968218 = MOVLload <uint32> {runtime.writeBarrier} v3 v880 : CX
v882 = ADDQconst <*map[string]bool> [64] v2702027 : DX
v2705006 = TESTL <flags> v1968218 v1968218
NE v2705006 -> b1775 b1776 (unlikely)
b1776: <- b1780
v1977046 = LoadReg <map[string]bool> v2900931 : CX
v1968182 = MOVQstore <mem> [64] v2702027 v1977046 v880
Plain -> b1777
b1777: <- b1775 b1776
v883 = Phi <mem> v1968183 v1968182
v2702008 = LEAQ <*uint8> {type.map[string]*"".Targets} v3 : AX
v885 = MOVQstore <mem> v2 v2702008 v883
v2702019 = LoadReg <map[string]*Targets> v2901053 : CX
v886 = MOVQstore <mem> [8] v2 v2702019 v885
v2900929 = LEAQ <*uint8> {go.string."0b00000000"} v3 : DX
v2374717 = MOVQstore <mem> [16] v2 v2900929 v886
v887 = MOVQstoreconst <mem> [val=10,off=24] v2 v2374717
v888 = CALLstatic <mem> {runtime.mapassign_faststr} [40] v887
v889 = MOVQload <**Targets> [32] v2 v888 : AX
v890 = LoweredNilCheck <void> v889 v888
v1968260 = MOVLload <uint32> {runtime.writeBarrier} v3 v888 : CX
v2705004 = TESTL <flags> v1968260 v1968260
NE v2705004 -> b1772 b1773 (unlikely)
CL https://golang.org/cl/41192 mentions this issue.
A possible alternative to splitting these init functions up (which I have tried and failed at multiple times) is to try to prevent generating some of these problematic write barrier calls by avoiding the mapmake calls by emitting some static empty map buckets to fill in.
CL https://golang.org/cl/42178 mentions this issue.
Looks like CL 42178 is unlikely to make 1.9. Moving to 1.10.
What version of Go are you using (
go version
)?go version go1.8 darwin/amd64
What operating system and processor architecture are you using (
go env
)?What did you do?
Used
text/template
to generate a very large go file to embed data, and then just rango build
What did you expect to see?
A successfully built binary
What did you see instead?