Goal:
Do a two step bot upgrade to ensure broken bot code is not used on the next
reboot. This permits not loosing bots when broken bot code is deployed,
permitting them to self-heal after good code is uploaded.
How:
Define LKGBC (Last Known Good Bot Code) that is updated only after a "basic
health test" passes. In our case the basic health test is that it can poll for
a new task successfully.
- LKGBC is swarming_bot.zip.
- Current candidate code is swarming_bot.1.zip.
Failures can happen due to multiple factors:
- Bot code is broken.
- Server is broken and sent corrupted/non functional zip.
- Admin provided start_slave.py is broken, causing the bot to fail to start.
Flow:
- Bootstrapping is always done with swarming_bot.zip. No more starting up of
another script like currently done.
- When swarming_bot.zip starts, it runs normally.
- When UpdateSlave is received, it downloads it as swarming_bot.staging.zip and
runs it. swarming_bot.zip is not touched.
- swarming_bot.staging.zip tries to query for a task (or do whatever necessary
to perform an health check). Only *after* the query was successful (independent
if a task was retrieved or not), it copies itself back to swarming_bot.zip as a
successful upgrade and continues running.
It's a bit tricky to define the health check. It must be good enough:
- Not crash on startup.
- Be able to access the server successfully.
- Be able to do an upgrade afterward.
Original issue reported on code.google.com by maruel@chromium.org on 10 Jun 2014 at 1:17
Original issue reported on code.google.com by
maruel@chromium.org
on 10 Jun 2014 at 1:17