Closed trink closed 5 years ago
Initial prototype (ignore the data before 8:20 PT): https://earthangel-b40313e5.influxcloud.net/d/Q_Z67GhWz/taskcluster-bitbar-gw-perf-p2?orgId=1&refresh=30s&from=now-3h&to=now
At the moment the lag data is everything seen (so no windowing, the hover over contains the total task count to provide a sense of scale). That level of historical information is good for stabilizing the alerts but in the long term may make it too unresponsive. Depending on task frequency the time it takes the p99 and p99.9 values to diverge could be minutes to days depending on the workerType. The question is how quickly do we want to alert if the lag starts creeping up? @aerickson
That looks great. :)
We don't need to alert very quickly. 6, 12, or 24 hours? 1-2 days wouldn't be the worst. Right now we have nothing automated, so anything is an improvement.
We get an immediate feel for how the workers are doing currently by watching queue count and getting a feel for velocity/derivative (could be nice to have that).
Lag is defined as the task scheduled time minus the task started time https://docs.google.com/document/d/1QjuTzD0IR3Xa_8U7_q59kZu7SckHpH14CZKHz969A38/edit#heading=h.af80tl7prv5v