madecoste / swarming

Automatically exported from code.google.com/p/swarming
Apache License 2.0
0 stars 1 forks source link

Implement LKGBC #112

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Goal:
Do a two step bot upgrade to ensure broken bot code is not used on the next 
reboot. This permits not loosing bots when broken bot code is deployed, 
permitting them to self-heal after good code is uploaded.

How:
Define LKGBC (Last Known Good Bot Code) that is updated only after a "basic 
health test" passes. In our case the basic health test is that it can poll for 
a new task successfully.

- LKGBC is swarming_bot.zip.
- Current candidate code is swarming_bot.1.zip.

Failures can happen due to multiple factors:
- Bot code is broken.
- Server is broken and sent corrupted/non functional zip.
- Admin provided start_slave.py is broken, causing the bot to fail to start.

Flow:
- Bootstrapping is always done with swarming_bot.zip. No more starting up of 
another script like currently done.
- When swarming_bot.zip starts, it runs normally.
- When UpdateSlave is received, it downloads it as swarming_bot.staging.zip and 
runs it. swarming_bot.zip is not touched.
- swarming_bot.staging.zip tries to query for a task (or do whatever necessary 
to perform an health check). Only *after* the query was successful (independent 
if a task was retrieved or not), it copies itself back to swarming_bot.zip as a 
successful upgrade and continues running.

It's a bit tricky to define the health check. It must be good enough:
- Not crash on startup.
- Be able to access the server successfully.
- Be able to do an upgrade afterward.

Original issue reported on code.google.com by maruel@chromium.org on 10 Jun 2014 at 1:17

GoogleCodeExporter commented 9 years ago

Original comment by maruel@chromium.org on 5 Feb 2015 at 12:16

GoogleCodeExporter commented 9 years ago
In f9f4032d6919df928d69ef9a13d7e5640987144a

Original comment by maruel@chromium.org on 25 Feb 2015 at 2:26