madecoste / swarming

Automatically exported from code.google.com/p/swarming
Apache License 2.0
0 stars 1 forks source link

Add monitoring/alert system on critical situations #150

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Critical situations can be defined as:
- X pending tasks globally or for a specific dimensions.
- Presence of pending task enqueued for more than X minutes. Could be 
categorized by priority or tags; e.g. high priority task pending is an issue.
- X bots offline (likely in percentage), also notify per dimensions.
- Abnormal number of BOT_DIED
- Abnormal number of task expiration
- Abnormal number of execution timeouts
- Abnormal number of task failure

[Add more]

Original issue reported on code.google.com by maruel@chromium.org on 28 Aug 2014 at 2:32

GoogleCodeExporter commented 9 years ago

Original comment by maruel@chromium.org on 28 Aug 2014 at 2:34