cmd/go: critical path scheduling

dvyukov commented 9 years ago

Currently go tool schedules work in topological order, this leads to suboptimal
parallelization. It's possible that towards end, very few work items are available
leading to idle processors.

Critical path scheduling gives much better results in this respect:
http://en.wikipedia.org/wiki/Critical_path_method

Here is a prototype CL:
https://golang.org/cl/13235046/diff/3001/src/cmd/go/build.go
I've observed up to 2x 'go test -a std' speedup on a 32-way machine.

We can also assign meaningful weights to work nodes: e.g. test > link > compile.
It would also be super useful to somehow assign weights to test nodes proportional to
test run time (e.g. runtime tests take significantly more than than most packages). This
will have significant effect even on low-core machines (as runtime tests will start
first).

rsc commented 9 years ago

Comment 1:

Please send CL after Go 1.4 is out.

Labels changed: added release-go1.5, removed release-none.

Status changed to Accepted.

josharian commented 9 years ago

@dvyukov are you planning to migrate that CL to gerrit and mail? Does it speed up make.bash as well?

dvyukov commented 9 years ago

Is anybody planning to review it to +2 state?

josharian commented 9 years ago

I promise to review it thoroughly (although I'll be in and out until late March). And rsc asked for it to be mailed. If you don't want to bother, do you mind if I take it over?

The existing CL looks possibly incomplete. assignPriorities contains (lines 648-649):

if a.priority < a1.priority+1 {
}

dvyukov commented 9 years ago

Take over it.

That if should be:

if a.priority < a1.priority+a.weight { a.priority = a1.priority+a.weight }

Also weight must be more sensible, that is, represent real cost of action. Something along the lines of: link > compile run test > link Fake instant actions should have weight << everything else. The goal here is to avoid serial tail of actions that run one after another delaying whole thing.

Ideally 'run test' weights represent actual execution time, so that runtime test starts first rather then starts late and delays all.bash. This may be possible in 'go test' continuos mode that monitors filesystem changes. But that's of course not a part of this change.

minux commented 9 years ago

And compiling anything involving cgo should be even heavier weight than compiling any other packages.

If we reorder tests, do we need to preserve the order of showing the results? I think we tried hard to show a sorted result for "go test -short std".

dvyukov commented 9 years ago

And compiling anything involving cgo should be even heavier weight than compiling any other packages.

good idea

If we reorder tests, do we need to preserve the order of showing the results?

absolutely but that is already ensured by chaining printing actions with dependencies, otherwise they would be printed out of order even today with p>1 so you don't need to do anything special here

josharian commented 9 years ago

I still plan to take over this CL. I am just waiting for the compiler performance to stabilize a bit, so that I can get more realistic measurements of the relative speeds of different kinds of actions.

bradfitz commented 9 years ago

This is pretty high priority for our all-compile builder. I increased it from 2 cores to 16 cores and we got almost no additional speed out of the builder.

josharian commented 9 years ago

I'll put this back in my queue (next step is attaching some more realistic weights.), but I'm unlikely to make much progress this week.

In the meantime, if you pull in https://go-review.googlesource.com/#/c/8918 as-is, does it help? It didn't show much improvement for me locally.

josharian commented 9 years ago

@bradfitz any numbers on that CL on a big machine?

@dvyukov @minux I spent a bit of time looking at tool exec times for 6g, 6l, and asm. asm is very fast, linking is very slow, but 6g has a huge amount of variability--two orders of magnitude. The variability goes down a tiny bit if you look at the number of files to compile, but not much. The variability is down to one order of magnitude if you look at the number of bytes to be compiled, but I don't think that we want to stat every file before getting started. Any other ideas for things to look at? As is, I think the best we can do for weights is very rough estimates, like fake=1, asm=10, gc=100, link=1000, cgo=1000. Any other ideas?

mwhudson commented 9 years ago

We already do stat every file before getting started (to compute staleness). So maybe that information can be saved somewhere?

dvyukov commented 9 years ago

Number of files look like a good first approximation. Does this thing help with 8-16 cores?

dvyukov commented 9 years ago

Probably also open files to read build tags.

bradfitz commented 9 years ago

Sorry, I haven't had any free time to test this.

josharian commented 9 years ago

No prob. I'm in the middle of running some tests now, actually, and it looks better than before. Expect a minimal CL and more comments here later today.

josharian commented 9 years ago

My earlier optimism was the result of a bug.

I have 8 logical (4 real) CPUs on my laptop. It spent the afternoon running three build tests in a loop: go build -a std, go test -short std, all.bash. I had everything else shut down, measured wall time, and ran the results through a t-test. At 50 iterations (20 for all.bash, which is SLOW), the CL showed no statistically significant improvement, and the means were within 1%.

I then hacked in the best available predictor: Store previous action times on disk. Even with almost perfect weights, I got pretty much the same results.

Furthermore, this re-ordering hurts the experience of running all.bash, as there are long stalls between test outputs, which then come in spurts.

Maybe this helps with lots more cores than I have--Brad, I guess I would like your help confirming that after all. As it is, though, I don't feel comfortable mailing the CL, since I have no way to demonstrate its value.

minux commented 9 years ago

What has changed? Dmitry's initial post reported a up to 2x improvement on go test -a std. Is it because the compiler is now written in Go and are slower?

josharian commented 9 years ago

Dmitry's initial post said 2x speed-up on a 32 core machine. Maybe my machine (which is probably typical) doesn't have enough cores to benefit.

minux commented 9 years ago

Where is the CL? I can test on a 16-way machine.

josharian commented 9 years ago

CL 8918. It is a straight port of Dmitry's original CL.

minux commented 9 years ago

My initial observations: time taken for go test -a std is not a good performance measures for this change because there are some packages (e.g. math/big, runtime) that take a very long time to test, so the serial portion of the runtime is significantly skewed by those packages.

Comparing the time of go install -a std cmd on the 16-way machine showed no significant impact on run time (~28s, CPU usage ~700%).

Comparing the time of go install -a std cmd golang.org/x/tools/cmd/godoc on the same machine still doesn't show any much impact (~34s, CPU usage ~600%)

The critical path scheduling will only show benefit where are many floating tasks with positive slacks, perhaps the dependency graph has changed a lot after the Go rewrite of the toolchain? Or perhaps there are unneeded dependency edges in the graph?

dvyukov commented 9 years ago

I've done some testing using time go build -a -p 4 std. Before:

real    0m11.527s
real    0m11.478s
real    0m11.353s
real    0m11.544s
real    0m11.586s

After:

real    0m10.443s
real    0m10.496s
real    0m10.487s
real    0m10.348s
real    0m10.451s

There is clear 10% win. I've also modified cmd/go to print number of executing actions every 500ms to confirm the win. Before:

EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 2
EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 2

After:

EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 1
EXECUTING 1
EXECUTING 4

There is clear difference towards end of build -- old code leads to lower parallelism because previous actions were scheduled badly.

Also, the win highly depends on work graph structure. It is possible to come up with dependency structure which will be handled arbitrary inefficiently by the old code. So some users can see higher speedup than we see on std lib. Moreover, currently scheduling depends on package names (or how does it order them now?), which is just weird. You can increase/reduce build speed by renaming packages.

Also, better scheduling will have more impact when we make actions finer-grained. In particular split cgo compilation and maybe asm compilation.

So I am sure that this is a positive change and we should do it. The additional complexity is minimal and very localized.

josharian commented 9 years ago

There is clear 10% win.

I don't know why I can't replicate that speed-up. Maybe you should take the CL back over, with my apologies. It doesn't make sense for me to continue to work on it when I cannot observe its impact (and thus can't tell whether my changes hurt or help).

better scheduling will have more impact when we make actions finer-grained.

Agreed. It is very coarse right now. And having thought about it more, simply saving the execution time of past identical actions might not be a bad long-term strategy for weight estimation.

In particular split cgo compilation and maybe asm compilation.

asm compilation is very fast. At least in the stdlib, there's very little cgo, so I don't think this will impact all.bash much, but it could make a big difference for heavy cgo users.

dvyukov commented 9 years ago

@josharian have you tried running exactly "time go build -a -p 4 std"?

What does the following debug output shows on your machine?

diff --git a/src/cmd/go/build.go b/src/cmd/go/build.go
index fda126b..ceca4c5 100644
--- a/src/cmd/go/build.go
+++ b/src/cmd/go/build.go
@@ -18,6 +18,7 @@ import (
        "os"
        "os/exec"
        "path"
+       "sync/atomic"
        "path/filepath"
        "regexp"
        "runtime"
@@ -1139,6 +1140,12 @@ func (b *builder) do(root *action) {
        if buildN {
                par = 1
        }
+       var executing uint32
+       go func() {
+               for range time.NewTicker(500*time.Millisecond).C {
+                       fmt.Printf("EXECUTING %v\n", executing)
+               }
+       }()
        for i := 0; i < par; i++ {
                wg.Add(1)
                go func() {
@@ -1154,7 +1161,9 @@ func (b *builder) do(root *action) {
                                        b.exec.Lock()
                                        a := b.ready.pop()
                                        b.exec.Unlock()
+                                       atomic.AddUint32(&executing, 1)
                                        handle(a)
+                                       atomic.AddUint32(&executing, ^uint32(0))
                                case <-interrupted:
                                        setExitStatus(1)
                                        return

rsc commented 9 years ago

Too late for Go 1.5.

josharian commented 9 years ago

I'm spending what time I have on SSA. Hopefully someone else wants to pick this back up.

josharian commented 9 years ago

My WIP CL is 8918.

ALTree commented 8 years ago

I fooled around with josharian's CL during a slow evening, I got nowhere but I'll report my findings, just to add another datapoint. This is on a 4-core intel machine.

First of all I applied dvyukov patch (the one that prints the number of currently executing jobs every 500ms), and run go build -a -p 4 std .

Before:

EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 3
EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 1

after:

EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 1
EXECUTING 4

the parallelism of the last part of the job is indeed higher.

Unfortunately, as minux and josharian, I wasn't able to observe any significant speedup. For time go build -a -p 4 std there's no statistical difference between tip and the CL. For make.bash I observed a modest 1.5% speedup. I didn't try with all.bash.

rsc commented 8 years ago

Thanks for the null results. They're helpful. Deprioritizing.

golang / go

cmd/go: critical path scheduling #8893