taskcluster - add lag to the influxdb time series output

trink commented 5 years ago

Lag is defined as the task scheduled time minus the task started time https://docs.google.com/document/d/1QjuTzD0IR3Xa_8U7_q59kZu7SckHpH14CZKHz969A38/edit#heading=h.af80tl7prv5v

trink commented 5 years ago

Initial prototype (ignore the data before 8:20 PT): https://earthangel-b40313e5.influxcloud.net/d/Q_Z67GhWz/taskcluster-bitbar-gw-perf-p2?orgId=1&refresh=30s&from=now-3h&to=now

At the moment the lag data is everything seen (so no windowing, the hover over contains the total task count to provide a sense of scale). That level of historical information is good for stabilizing the alerts but in the long term may make it too unresponsive. Depending on task frequency the time it takes the p99 and p99.9 values to diverge could be minutes to days depending on the workerType. The question is how quickly do we want to alert if the lag starts creeping up? @aerickson

aerickson commented 5 years ago

That looks great. :)

We don't need to alert very quickly. 6, 12, or 24 hours? 1-2 days wouldn't be the worst. Right now we have nothing automated, so anything is an improvement.

We get an immediate feel for how the workers are doing currently by watching queue count and getting a feel for velocity/derivative (could be nice to have that).

mozilla-services / lua_sandbox_extensions

taskcluster - add lag to the influxdb time series output #471