servo / saltfs

Salt Stack Filesystem
Apache License 2.0
56 stars 107 forks source link

Handle Buildbot restarting more robustly #304

Open aneeshusa opened 8 years ago

aneeshusa commented 8 years ago

When updating the Buildbot configuration, we need to wait for Buildbot to not be executing any jobs before we can safely restart it.

See discussion in #300.

Apparently there is a way to just reload the Buildbot configuration instead of restarting it via SIGHUP or buildbot reconfig, but it's fragile, so I'd prefer not to do that: http://docs.buildbot.net/current/manual/cfg-intro.html?highlight=reconfig#reloading-the-config-file-reconfig

Just to be clear, this is all for the Buildbot master config + service, not the builder machines, yes?

larsbergstrom commented 8 years ago

That's correct - I'm not as worried about the builder machines (personally) as we don't change their configuration very often.

cc @edunham

aneeshusa commented 8 years ago

A key component of a robust automated solution will likely involve waiting for Buildbot to not have any open jobs. A few questions:

If the latency here is short, I'd look towards a solution that integrates the waiting time into the highstate sequence. If the latency is longer (or likely to increase in the future), I'd prefer to do this more asynchronously - the Salt event bus should make this easy to do.

Buildbot masters also seem to have a multimaster mode that could help make these transitions more seamless: https://docs.buildbot.net/current/manual/cfg-global.html#multi-master-mode

Bonus points if we can rig up a "Buildbot is restarting message...' to be shown via nginx (i.e. also inform nginx of buildbot up/down times).

aneeshusa commented 8 years ago

Another consideration is that the Ubuntu machines (running Trusty) currently use Upstart for service management, but newer Ubuntu releases use systemd instead. It would ideal if the chosen solution is init-agnostic, or at least has minimal coupling.

larsbergstrom commented 8 years ago

I've had more luck with:

# su - servo
# buildbot restart --clean --nodaemon /home/servo/buildbot/master &

The only issue has been ensuring that it really is run as the correct user, which I think is much easier to do in Salt? This leaves around a process that will do a SIGHUP once the current job finishes, which I think is the most foolproof way to get the changes rolled out.

It usually takes about 45-50 minutes for a given job to complete. Our homu job queue is between empty and 10 items deep at any time, and it's hard to predict when those times are :-) I'm a little afraid of something that takes down homu to let the buildbot job end, because homu also handles all of the other queues on our other servo org repos.

Does that sound reasonable? I do think it doesn't play great for upstart/systemd, though.