madecoste / swarming

Automatically exported from code.google.com/p/swarming
Apache License 2.0
0 stars 1 forks source link

Automatically retry tasks once when the first try has status BOT_DIED #108

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Use case:
- User triggers a task.
- A scenario makes the server retry the task automatically occurs. This can be 
when the TaskRunResult gets in a State.BOT_DIED state or if automatic retry on 
task failure is enabled.

Original issue reported on code.google.com by maruel@chromium.org on 22 May 2014 at 5:00

GoogleCodeExporter commented 9 years ago
This is an important task to increase the reliability to >99.9%. When a bot 
died, the task should be retried once.

The code tagging a task as BOT_DIED is bot_kill_task() at 
https://code.google.com/p/swarming/source/browse/services/swarming/server/task_s
cheduler.py#411

In particular, we'd want a new TaskRunResult to be created for the second try
https://code.google.com/p/swarming/source/browse/services/swarming/server/task_r
esult.py#364
so that the data for each try is independently saved. The whole design is 
already done to support this but the control bits are missing, and some tuning 
of the entities may be required. For example, 
result_summary_key_to_run_result_key() refuses try_number != 1.
https://code.google.com/p/swarming/source/browse/services/swarming/server/task_r
esult.py#654

See the entity tree at 
https://code.google.com/p/swarming/source/browse/services/swarming/server/README
.md

Original comment by maruel@chromium.org on 6 Aug 2014 at 4:36

GoogleCodeExporter commented 9 years ago
Surfacing the results properly is likely blocked on the new client API, issue 
118. That said, the overall thing could still work just fine even without the 
new client API.

Original comment by maruel@chromium.org on 6 Aug 2014 at 4:39

GoogleCodeExporter commented 9 years ago

Original comment by maruel@chromium.org on 6 Aug 2014 at 4:40

GoogleCodeExporter commented 9 years ago

Original comment by maruel@chromium.org on 7 Aug 2014 at 1:50

GoogleCodeExporter commented 9 years ago
This task includes adding a new .idempotent flag to TaskProperties, to 
differentiate tasks that can be safely retried from the ones that have side 
effects (like accessing a remote server and setting properties on it).

Original comment by maruel@chromium.org on 14 Aug 2014 at 9:07

GoogleCodeExporter commented 9 years ago

Original comment by maruel@chromium.org on 18 Sep 2014 at 7:16