bradfitz commented 7 years ago

(Split off #8930, "repair perf builders", the https://perf.golang.org/ work)

This is a tracking bug for running benchmarks in the coordinator, both pre-submit (trybots) and post-submit.

quentinmit commented 7 years ago

I anticipate that pre- and post-submit benchmark runs will be somewhat different (e.g. expected latency, # of iterations, etc.), and post-submit benchmark runs require trend analysis, so I am focusing on pre-submit trybot benchmark runs at the moment.

We've already had several design discussions about this; I'm going to attempt to replay them into this issue for posterity.

bradfitz commented 7 years ago

For trybots:

Currently we:

run make.bash on one buildlet
snapshot the make.bash-built world to GCS
and then shard the tests (go tool dist test --list work units) across N buildlets. The creation of those helper buildlets is timed considering the expected make.bash duration and the expected buildlet creation time, the idea being that we don't hold on to buildlets when we won't need them, but ask for them earlier enough that they're available by the time make.bash is done and we want to run tests.

If machines die, they can resume later just from snapshots, bypassing make.bash. The untarring of the snapshot to the tmpfs is fast.

I think benchmarks should work similarly. They should be another work unit(s), like unit tests, sharded over machines.

quentinmit commented 7 years ago

Perf Trybots Sketch

(reviewed Mar 3 by @bradfitz)

We want to collect benchmark data as part of the normal Gerrit CL process. To effect this, I propose the following modifications to the build coordinator:

Change buildStatus.build() to add if remoteErr == nil && err == nil && st.shouldBench() {st.runBench()} after the tests finish

Implement buildStatus.shouldBench to check:

st.isTry() &&
!st.isSubrepo() &&
benchBuilders[st.name]  // map listing builders to run benchmarks on

Implement buildStatus.runBench() to:
- Download snapshots for the try commit (if not already present) and its base commit
- Build bench binaries for all affected packages
- Run bench binaries and capture the results
- Upload the results to perfdata.golang.org
Initially, have this only happen on the staging coordinator and not on the prod coordinator. I do not want to introduce a separate "Run-Perfbots" Gerrit label, because I want to quickly get to the point where it's not a burden to run for every CL. (Back of the envelope calculations suggest we're already there: running benchmarks require less time than running tests.)
Once we're happy with performance, turn this on in prod and include the result links in the "trybots complete" message that is posted on the commits

Builder choice

Some builders are probably too slow to run benchmarks on. I suggest we initially run benchmarks on:

freebsd-amd64-gce101
linux-386
linux-amd64
linux-arm
windows-amd64-gce This gives us a good swath of goos and goarch, but avoids the slowest of our builders.

@bradfitz's comments: Q: why download a snapshot if you just completed a build & tests? it's already there in $WORKDIR, no? A: I wasn't sure if I was going to be reusing the buildlet that had done the tests. You're right it might be there already.

Q: there's no guarantee [the base commit's snapshot] exists. is that a problem? A: I think we should just not try to run benchmarks if the base commit didn't pass tests. Is there another way snapshots can not exist?

Q: how many iterations are you going to run to get statistically interesting numbers? A: TBD? I'm going to start with 5.

Q: [freebsd-amd64-gce101] is just another amd64 so I'm not sure it helps so much. Seems overkill for trybots. I even think Windows is a bit redundant but justifiable due to its drastic difference. Linux and FreeBSD are relatively similar. Mostly seeing three different architectures is the interesting bit. I think running builders are three different Intel CPU families (in different GCE zones would be even more interesting) A: I think we have enough different in the runtime on BSDs that it's worth running benchmarks, no? Do we as a user get to control which CPU family we get? Also, keep in mind this is for trybots, so a quick sanity check is more important than exhaustive coverage of goos/goarch/CPU type.

quentinmit commented 7 years ago

Email thread with @bradfitz about how to fit benchmarks into the existing work units for make and test.

@quentinmit: I'd like your ideas on how to best handle sequencing for the trySet.

I know you have your goal of posting trybot results in <5m. Obviously, that's simply not enough time to post trybot results, and we can't (I think?) parallelize the benchmarks because we'd like them to come from the same machine. Currently with 5 iterations of the benchmarks it's taking 10-20m to complete.

So I think that leads me to "we should post make + test results first, and then follow up with bench results in a separate Gerrit comment". Do you think that's a reasonable UI? Alternatively, we could include in the existing Gerrit post "benchmarks are being run, ", and have the benchmark results just stream in at that page without making another post when they're done.

The question becomes then how to fit that into trySet. There's an ominous comment in buildStatus.build:

    // bc (aka st.bc) may be invalid past this point, so let's
    // close it to make sure we we don't accidentally use it.

Ideally, we would like to log the eventDone with the make+test results, and then continue on to run benchmarks on the same bc. I think what I'd like to do is move the call to ts.noteBuildComplete from its current location in awaitTryBuild to call it directly from bs.build() right after eventDone is logged. AFAICT the ominous comment above has no teeth, and the bc doesn't actually become invalid until build returns and the context is canceled.

Sorry for the long-winded e-mail. Does this all sound okay?

@bradfitz: There's an open bug (https://github.com/golang/go/issues/13076) about making permanent URLs that work either during or after a build, whether it's successful or failed.

There's been some progress towards that in the past year (unique build IDs and try IDs used mostly for BigQuery), but if we finished that, then we'd have a great place to reference perf data too.

So we'd record the Trybot-Result + comment on Gerrit and then keep running benchmarks. People interested in benchmarks can get to it from the original already-spammed comment on Gerrit. Perhaps it should only do a follow-up comment if the benchmark found notable change.

@quentinmit: What do you think about my proposal for how to implement this, then?

@bradfitz: I would prefer not to modify not complicate the design by putting in reuse at that layer. It would also complicate scheduling later. (https://github.com/golang/go/issues/19178)

I'd prefer it to be a logically separate work unit, but do the buildlet sharing as an optimization via the buildlet pool layer. There's an open bug about that somewhere--- if a buildlet is no longer being used (especially ones that are slow to create) and it's never seen a failure and has no new processes running from when it started, then add it back to the free pool, at least for everything but building a release where we really want things to be hermetic.

@quentinmit: It's not just an optimization to reuse the buildlet, though - we need a buildlet that already has the built go unpacked on disk. I agree that my suggestion is complicated, though.

I could certainly build a mechanism to pass off the buildlet client to a logically separate work unit, but then I also run into the question of how to handle logging. Right now it seems like there's an assumption that there is only one buildStatus for each commit and buildlet type. I think I would have to break that assumption if we want to have the "bench" buildStatus separate from the "make and test" buildStatus.

(Code-wise, the benchmark stuff is already logically separate, it has a single entry point of bs.runBench().)

@bradfitz: Let's not make reusing the files on disk a requirement. The cost of untarring them to the tmpfs from GCS is super tiny.

I'd rather have a simple design (benchmarks always start with a fresh buildlet, as a new work unit) and pay the cost of untarring the work that was just done a moment ago in the uncontended case.

But in the contended case (like this morning), we'd free up the buildlets for important work (trybots, building the release-branch.go1.8 most recent commit, etc), and catch up on benchmarking later when idle.

@quentinmit: I'm talking only about trybot runs - benchmarks are just as important for as test results for trybots. I completely agree for non-trybots that they should be separate and lower-priority than builds.

Okay, with that, I believe that brings this issue up to the current state of discussion.

quentinmit commented 7 years ago

Okay, so, if I understand your proposal correctly, you want us to treat benchmarks as another type of helper buildlet, so we untar the snapshot built by make.bash and then run the benchmarks in a fresh buildlet.

Understood, and that makes sense (bearing in mind the overhead of starting another VM, which can be mitigated with your reuse proposal). In the meantime until we do that, I'd like to just put that logic here into buildStatus, so that if the parent commit is ready, we pass the buildlet client off to runBench to reuse instead of getting a new one. This is much simpler than the generic reuse you proposed, because we already know the buildlet is in an okay state because we know the make and test just finished.

Now, given that, I'm still unsure of exactly how to tie that into the existing buildStatus struct and logging infrastructure. Do you have any suggestions that differ from what I proposed above in e-mail?

bradfitz commented 7 years ago

Okay, so, if I understand your proposal correctly, you want us to treat benchmarks as another type of helper buildlet, so we untar the snapshot built by make.bash and then run the benchmarks in a fresh buildlet.

I want to treat each benchmark as a unit of work that needs to be done, just like a "go tool dist test --list" work unit. I don't care whether it runs on the initial buildlet or on one of the helper buildlets. Ideally they'd run on all N of them, including the original one, in parallel. (I assume we're running more than 1 benchmark, and running it more than 1 time.)

Understood, and that makes sense (bearing in mind the overhead of starting another VM, which can be mitigated with your reuse proposal).

We already have the buildlets available. We just created ~N=5 to run unit tests on, and now they're idle. (Also, the Kubernetes buildlets create in seconds. There's not much overhead there.)

You'd need to modify runAllSharded to not destroy the helper buildlets when they're done with unit tests.

In the meantime until we do that,

Let's not cut corners. I have to maintain this. If there's a dependency to do, do that first.

I'd like to just put that logic here into buildStatus,

The logic for what?

so that if the parent commit is ready, we pass the buildlet client off to runBench to reuse instead of getting a new one.

Reusing the idle buildlets is fine.

Other question: are you only doing benchmarks for the main repo ("go") or also for subrepos?

bradfitz commented 7 years ago

What work does the benchmarking need to do, and what environment does it need?

It needs the output from make.bash. How many benchmarks does it run, and which, and how many iterations? Can the iterations run on different VMs if they're the same processors/sizes?

Does the parent commit need to be submitted? Can we use benchmark output computed previously for the parent, if it's from the same VM processors/sizes?

quentinmit commented 7 years ago

Okay, so, if I understand your proposal correctly, you want us to treat benchmarks as another type of helper buildlet, so we untar the snapshot built by make.bash and then run the benchmarks in a fresh buildlet.

I want to treat each benchmark as a unit of work that needs to be done, just like a "go tool dist test --list" work unit. I don't care whether it runs on the initial buildlet or on one of the helper buildlets. Ideally they'd run on all N of them, including the original one, in parallel. (I assume we're running more than 1 benchmark, and running it more than 1 time.)

Okay, yes. I am starting with the assumption that we need to run all benchmarks on a single buildlet, so that the results are comparable. With the CL as currently implemented, we are running 4 separate benchmark binaries, and we are running each of them 5 times. At a minimum, I think we need to make sure we run the same set of benchmarks on each buildlet (so we could shard it 5 ways, but not 20 ways). I don't know yet how much variance that introduces.

Understood, and that makes sense (bearing in mind the overhead of starting another VM, which can be mitigated with your reuse proposal).

We already have the buildlets available. We just created ~N=5 to run unit tests on, and now they're idle. (Also, the Kubernetes buildlets create in seconds. There's not much overhead there.)

You'd need to modify runAllSharded to not destroy the helper buildlets when they're done with unit tests.

I think this conflicts with the fact that we may need to wait for the parent commit to make + test, and we don't want to tie up buildlets while waiting.

I'd like to just put that logic here into buildStatus,

The logic for what?

The logic for deciding whether a buildlet can be reused for benchmarks.

Other question: are you only doing benchmarks for the main repo ("go") or also for subrepos?

At the moment, only the go repo is in scope. Running benchmarks in subrepos is also interesting, but I'm not yet tackling that because I haven't designed a way to decide which benchmarks to run.

quentinmit commented 7 years ago

What work does the benchmarking need to do, and what environment does it need?

We need to have built benchmark binaries (which could have been cross-compiled, I suppose), and then we just need to run them on a buildlet. We do require a GOROOT, because some of the benchmarks try to (re)compile parts of the go distribution.

It needs the output from make.bash. How many benchmarks does it run, and which, and how many iterations? Can the iterations run on different VMs if they're the same processors/sizes?

I have the assumption that the iterations can't run on different VMs, but see above.

Does the parent commit need to be submitted? Can we use benchmark output computed previously for the parent, if it's from the same VM processors/sizes?

The parent commit does not need to be submitted. It does need to have been built and tested. We can't use previously computed benchmark output because there is far too much variation on GCE. (We could of course use a pre-compiled benchmark binary for the parent, however.)

josharian commented 7 years ago

bradfitz commented 7 years ago

How long do each of the ~5 benchmark binaries take to run?

For a given binary, do you want to run the old-vs-new binaries as:

./old -benchtime=3s
./new -benchtime=3s

Or interleaved old/new like:

./old -benchtime=1s
./new -benchtime=1s
./old -benchtime=1s
./new -benchtime=1s
./old -benchtime=1s
./new -benchtime=1s

? I can see some advantage of interleaving them if the run times are long and something unrelated on the VM (different customer) changes performance somehow.

Implementation wise: if you want this part of trybots, we really need to work to minimize overlap what runs when.

The benchmarks can be sharded, and since we already have sharded testing buildlets, that won't add much.

I'm more concerned about the extra make.bash time (~90 seconds) needed to build the parent commit if it's not built already. I'd like to overlap that make.bash on a separate buildlet alongside the main make.bash. But because Windows VMs take ~45 seconds to boot and become usable, you really need to ask for the second buildlet about the same time you ask for the first one. You can't delay your check of whether the parent git commit's make.bash is cached on GCS until later. It would need to be done in (*buildStatus).build in parallel with the existing check. If you find that the cached build isn't available, then you'd need to ask for a second buildlet (from the same pool as (*buildStatus).getHelpersReadySoon later creates).

Refactoring the helper buildlets code from being just about test sharding to being earlier and more general (and letting you ask for just 1 immediately vs the timer creating all N helpers at once) will be the trickiest part.

As for sharding the benchmarks out over buildlets, I would treat them as normal test units, modifying (*buildStatus).distTestList somehow to either return a better "work unit" type instead of the current []string of dist test names, or keep it as is and make a higher level "(*buildStatus).getShardedTestWork() []tbItem" sort of test-or-bench work unit, and havegetShardedTestWorkcalldistTestList` and then append its ~5 benchmark work units.

Then you'd modify (*buildStatus).runTests (the thing that coordinates the execution over sharded buildlets and shutting down buildlets at the end) to instead call getShardedTestWork. Some work units will be tests, and some will be benchmarks. The first benchmark work unit would wait for the make.bash to be done, and the first benchmark work unit per buildlet would untar the snapshotted parent-commit.tar.gz into a different top-level $WORKDIR directory, like $WORKDIR/parent-go instead of $WORKDIR/go.

I drew a picture of how the sharding works for tests earlier today and @adams-sarah took a picture of it but I don't think it was a great drawing on its own. I can probably make a Gantt chart of each buildlet and what it's doing if that'd help.

s-mang commented 7 years ago

josharian commented 7 years ago

As they say, a caption is worth a thousand words.

bradfitz commented 7 years ago

I'll recreate it with words. Thanks, @adams-sarah.

quentinmit commented 7 years ago

How long do each of the ~5 benchmark binaries take to run?

I've been logging that with spans and we have a fair amount of data now. Do you have any reporting tools or should I just dig in the datastore myself?

From a quick glance, it looks like the go1 binary is the worst, at ~60s per run. The other benchmark binaries are around 10-20s each.

(go1 contains many benchmarks, of course, so we can split it in half to get 30s, or in three to get 20s.)

For a given binary, do you want to run the old-vs-new binaries or interleaved old/new like: ? I can see some advantage of interleaving them if the run times are long and something unrelated on the VM (different customer) changes performance somehow.

I agree, we definitely want to interleave them.

Implementation wise: if you want this part of trybots, we really need to work to minimize overlap what runs when.

I don't know what you mean by "minimize overlap". Do you mean that we should try to schedule Kube pods on different hosts?

I'm more concerned about the extra make.bash time (~90 seconds) needed to build the parent commit if it's not built already. I'd like to overlap that make.bash on a separate buildlet alongside the main make.bash. But because Windows VMs take ~45 seconds to boot and become usable, you really need to ask for the second buildlet about the same time you ask for the first one. You can't delay your check of whether the parent git commit's make.bash is cached on GCS until later. It would need to be done in (*buildStatus).build in parallel with the existing check. If you find that the cached build isn't available, then you'd need to ask for a second buildlet (from the same pool as (*buildStatus).getHelpersReadySoon later creates).

I don't think it makes sense to rebuild the parent commit if a snapshot is not available. We don't even necessarily know where to fetch it from (since it could be in a separate Gerrit CL, or on master). I think we should just wait until the parent commit has been built normally. This means we might take longer than 5m if a whole train of CLs is pushed at once, but I think that's a reasonable tradeoff (and we might wait a while in that case anyway before we can get a buildlet to run tests, too). When buildlet scheduling is implemented, we can teach the scheduling to run CLs in order from oldest to newest.

Refactoring the helper buildlets code from being just about test sharding to being earlier and more general (and letting you ask for just 1 immediately vs the timer creating all N helpers at once) will be the trickiest part.

Yeah, that makes sense. Initially, I'd like to punt on this change and instead just use a completely separate buildlet for benchmarks in parallel with the sharded test buildlets. Then reusing the same buildlets just becomes an optimization.

bradfitz commented 7 years ago

I've been logging that with spans and we have a fair amount of data now. Do you have any reporting tools or should I just dig in the datastore myself?

x/build/cmd/buildstats -sync will sync from datastore to bigquery, and then you can write SQL.

(go1 contains many benchmarks, of course, so we can split it in half to get 30s, or in three to get 20s.)

Yeah, we'll want to do that.

I don't know what you mean by "minimize overlap". Do you mean that we should try to schedule Kube pods on different hosts?

Er, sorry, that sentence made little sense. I meant we want to maximize overlap, to not have idle buildlets while we do many steps in serial. We want to take advantage of potential parallelism, as it currently does.

I don't think it makes sense to rebuild the parent commit if a snapshot is not available. We don't even necessarily know where to fetch it from (since it could be in a separate Gerrit CL, or on master).

You just fetch it from the gitmirror, which will return it regardless of where it is.

I think we should just wait until the parent commit has been built normally.

Idling while waiting for a buildlet isn't acceptable. We're not going to do that.

When buildlet scheduling is implemented, we can teach the scheduling to run CLs in order from oldest to newest.

That's actually not the priority that was proposed in #19178. Recent things are more important than things 5 commits back.

Yeah, that makes sense. Initially, I'd like to punt on this change and instead just use a completely separate buildlet for benchmarks in parallel with the sharded test buildlets.

Sure, how about your use one buildlet (maybe total for now, until #19178 can prioritize buildlet users) and do your work totally async with the trybots. The trybots can write "This will eventually be available at $URL", and users can just poll that URL, or your process can leave a comment saying the results are available, if it detected anything interesting.