Open dvyukov opened 9 years ago
@dvyukov are you planning to migrate that CL to gerrit and mail? Does it speed up make.bash as well?
Is anybody planning to review it to +2 state?
I promise to review it thoroughly (although I'll be in and out until late March). And rsc asked for it to be mailed. If you don't want to bother, do you mind if I take it over?
The existing CL looks possibly incomplete. assignPriorities
contains (lines 648-649):
if a.priority < a1.priority+1 {
}
Take over it.
That if should be:
if a.priority < a1.priority+a.weight { a.priority = a1.priority+a.weight }
Also weight must be more sensible, that is, represent real cost of action. Something along the lines of: link > compile run test > link Fake instant actions should have weight << everything else. The goal here is to avoid serial tail of actions that run one after another delaying whole thing.
Ideally 'run test' weights represent actual execution time, so that runtime test starts first rather then starts late and delays all.bash. This may be possible in 'go test' continuos mode that monitors filesystem changes. But that's of course not a part of this change.
And compiling anything involving cgo should be even heavier weight than compiling any other packages.
If we reorder tests, do we need to preserve the order of showing the results? I think we tried hard to show a sorted result for "go test -short std".
And compiling anything involving cgo should be even heavier weight than compiling any other packages.
good idea
If we reorder tests, do we need to preserve the order of showing the results?
absolutely but that is already ensured by chaining printing actions with dependencies, otherwise they would be printed out of order even today with p>1 so you don't need to do anything special here
I still plan to take over this CL. I am just waiting for the compiler performance to stabilize a bit, so that I can get more realistic measurements of the relative speeds of different kinds of actions.
This is pretty high priority for our all-compile
builder. I increased it from 2 cores to 16 cores and we got almost no additional speed out of the builder.
I'll put this back in my queue (next step is attaching some more realistic weights.), but I'm unlikely to make much progress this week.
In the meantime, if you pull in https://go-review.googlesource.com/#/c/8918 as-is, does it help? It didn't show much improvement for me locally.
@bradfitz any numbers on that CL on a big machine?
@dvyukov @minux I spent a bit of time looking at tool exec times for 6g, 6l, and asm. asm is very fast, linking is very slow, but 6g has a huge amount of variability--two orders of magnitude. The variability goes down a tiny bit if you look at the number of files to compile, but not much. The variability is down to one order of magnitude if you look at the number of bytes to be compiled, but I don't think that we want to stat every file before getting started. Any other ideas for things to look at? As is, I think the best we can do for weights is very rough estimates, like fake=1, asm=10, gc=100, link=1000, cgo=1000. Any other ideas?
We already do stat every file before getting started (to compute staleness). So maybe that information can be saved somewhere?
Number of files look like a good first approximation. Does this thing help with 8-16 cores?
Probably also open files to read build tags.
Sorry, I haven't had any free time to test this.
No prob. I'm in the middle of running some tests now, actually, and it looks better than before. Expect a minimal CL and more comments here later today.
My earlier optimism was the result of a bug.
I have 8 logical (4 real) CPUs on my laptop. It spent the afternoon running three build tests in a loop: go build -a std
, go test -short std
, all.bash
. I had everything else shut down, measured wall time, and ran the results through a t-test. At 50 iterations (20 for all.bash, which is SLOW), the CL showed no statistically significant improvement, and the means were within 1%.
I then hacked in the best available predictor: Store previous action times on disk. Even with almost perfect weights, I got pretty much the same results.
Furthermore, this re-ordering hurts the experience of running all.bash, as there are long stalls between test outputs, which then come in spurts.
Maybe this helps with lots more cores than I have--Brad, I guess I would like your help confirming that after all. As it is, though, I don't feel comfortable mailing the CL, since I have no way to demonstrate its value.
What has changed? Dmitry's initial post reported a up to 2x improvement on go test -a std. Is it because the compiler is now written in Go and are slower?
Dmitry's initial post said 2x speed-up on a 32 core machine. Maybe my machine (which is probably typical) doesn't have enough cores to benefit.
Where is the CL? I can test on a 16-way machine.
CL 8918. It is a straight port of Dmitry's original CL.
My initial observations:
time taken for go test -a std
is not a good performance
measures for this change because there are some packages
(e.g. math/big, runtime) that take a very long time to test,
so the serial portion of the runtime is significantly skewed
by those packages.
Comparing the time of go install -a std cmd on the 16-way machine showed no significant impact on run time (~28s, CPU usage ~700%).
Comparing the time of go install -a std cmd golang.org/x/tools/cmd/godoc on the same machine still doesn't show any much impact (~34s, CPU usage ~600%)
The critical path scheduling will only show benefit where are many floating tasks with positive slacks, perhaps the dependency graph has changed a lot after the Go rewrite of the toolchain? Or perhaps there are unneeded dependency edges in the graph?
I've done some testing using time go build -a -p 4 std
.
Before:
real 0m11.527s
real 0m11.478s
real 0m11.353s
real 0m11.544s
real 0m11.586s
After:
real 0m10.443s
real 0m10.496s
real 0m10.487s
real 0m10.348s
real 0m10.451s
There is clear 10% win. I've also modified cmd/go to print number of executing actions every 500ms to confirm the win. Before:
EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 2
EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 2
After:
EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 1
EXECUTING 1
EXECUTING 4
There is clear difference towards end of build -- old code leads to lower parallelism because previous actions were scheduled badly.
Also, the win highly depends on work graph structure. It is possible to come up with dependency structure which will be handled arbitrary inefficiently by the old code. So some users can see higher speedup than we see on std lib. Moreover, currently scheduling depends on package names (or how does it order them now?), which is just weird. You can increase/reduce build speed by renaming packages.
Also, better scheduling will have more impact when we make actions finer-grained. In particular split cgo compilation and maybe asm compilation.
So I am sure that this is a positive change and we should do it. The additional complexity is minimal and very localized.
There is clear 10% win.
I don't know why I can't replicate that speed-up. Maybe you should take the CL back over, with my apologies. It doesn't make sense for me to continue to work on it when I cannot observe its impact (and thus can't tell whether my changes hurt or help).
better scheduling will have more impact when we make actions finer-grained.
Agreed. It is very coarse right now. And having thought about it more, simply saving the execution time of past identical actions might not be a bad long-term strategy for weight estimation.
In particular split cgo compilation and maybe asm compilation.
asm compilation is very fast. At least in the stdlib, there's very little cgo, so I don't think this will impact all.bash much, but it could make a big difference for heavy cgo users.
@josharian have you tried running exactly "time go build -a -p 4 std"?
What does the following debug output shows on your machine?
diff --git a/src/cmd/go/build.go b/src/cmd/go/build.go
index fda126b..ceca4c5 100644
--- a/src/cmd/go/build.go
+++ b/src/cmd/go/build.go
@@ -18,6 +18,7 @@ import (
"os"
"os/exec"
"path"
+ "sync/atomic"
"path/filepath"
"regexp"
"runtime"
@@ -1139,6 +1140,12 @@ func (b *builder) do(root *action) {
if buildN {
par = 1
}
+ var executing uint32
+ go func() {
+ for range time.NewTicker(500*time.Millisecond).C {
+ fmt.Printf("EXECUTING %v\n", executing)
+ }
+ }()
for i := 0; i < par; i++ {
wg.Add(1)
go func() {
@@ -1154,7 +1161,9 @@ func (b *builder) do(root *action) {
b.exec.Lock()
a := b.ready.pop()
b.exec.Unlock()
+ atomic.AddUint32(&executing, 1)
handle(a)
+ atomic.AddUint32(&executing, ^uint32(0))
case <-interrupted:
setExitStatus(1)
return
Too late for Go 1.5.
I'm spending what time I have on SSA. Hopefully someone else wants to pick this back up.
My WIP CL is 8918.
I fooled around with josharian's CL during a slow evening, I got nowhere but I'll report my findings, just to add another datapoint. This is on a 4-core intel machine.
First of all I applied dvyukov patch (the one that prints the number of currently executing jobs every 500ms), and run go build -a -p 4 std
.
Before:
EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 3
EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 1
after:
EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 1
EXECUTING 4
the parallelism of the last part of the job is indeed higher.
Unfortunately, as minux and josharian, I wasn't able to observe any significant speedup. For time go build -a -p 4 std
there's no statistical difference between tip and the CL. For make.bash
I observed a modest 1.5% speedup. I didn't try with all.bash
.
Thanks for the null results. They're helpful. Deprioritizing.