madecoste / swarming

Automatically exported from code.google.com/p/swarming
Apache License 2.0
0 stars 1 forks source link

Swarming server to schedule "prewarming on idle" task to idle bots to speed up expected-in-the-future tasks #145

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Goal:
Reduce perceived latency when running a task, to get results faster than the 
time to both prepare the local cache before assigning it a task AND cleanup the 
bot/device after the task.

Use case:
When a bot ran a task, the bot state could be "dirty". There's two way to 
address this:
- Upon startup of the task, always cleanup the state.
- Upon completion of the task, cleanup the state.

Doing it on completion doesn't work in Real Life(tm) because task fails, 
hardware spontaneously reboot, etc, so there's always going to be a moment 
where a task is distributed to a bot that is "dirty".

Examples:
- For a Swarming bot that build a project, a checkout at a recent revision 
could be done first.
- For a bot representing a device, it could be flashed to a known good revision 
when idle, stale files found on the device could be removed, etc.
- The run_isolated local cache could be warmed up with files usually required 
to run frequent tests.

Implementation:
Under a scenario of <100% utilization, the bots should be stateful on the 
server. A state machine in the server would register their state and would try 
to distribute "background work tasks" that would "cleanup" the bot in advance, 
in preparation to have the bot to execute "tasks that are expected in the 
future".

This is expected to work in tandem with issue 144 (device affinity) so that not 
all devices become in a single state. Instead, multiple states would stabilize 
automatically based on the utilization.

This requires multiple things:
- Describe pre-warm tasks.
  - Have the client part to describe the task, archive it on the isolate server and ensure the files are pinned on the isolate server.
  - Have the server part to describe these background task.
- Run background tasks.
  - Have task_to_run.py assign background task when <100%.

This must be pretty conservative; the tasks must not become a bottleneck when a 
surge of real tasks incoming happen, especially in the case where background 
tasks are slow (for example checking out the chromium source tree). In 
addition, this should respect device affinity, and assign new affinity. It must 
not start to prewarm for an old version of a firmware when all the bots start 
to use a newer firmware, etc.

Original issue reported on code.google.com by maruel@chromium.org on 15 Aug 2014 at 4:47