Dynamic app health check

itsderek23 commented 7 years ago

We currently provide static alerting thresholds on key performance metrics. Example: error rate >= 10 per-minute for 5 minutes.

However, this has downsides:

Some familiarity with appropiate critical thresholds is required. This is frequently unique per-app.
Slippery slope: alert conditions can start to be used as diagnostics, alerting on many different metrics, but 100% coverage of every possible type isn't realistic. When things go wrong, this can also result in a flood of alert notifications.
Knowledge share: it's frequently unclear why thresholds are set at specific levels to everyone on the team.
Cross-roles: a devops team is likely to receive these alerts, but they are less likely to be familar with the exact behavior of the app.

Rather than this approach, I propose a two-step flow:

A general health check that runs continuously, looking for abnormal behavior
A performance diff UI that is viewed after an alert is generated. This highlights parts of the app that have the greatest changes.

Analogy: you go to the doctor when you have a fever, then the doctor prescribes medication to fix the fever, which could be caused by many things. Detecting the fever is the health check. The docker appointment is the performance diff.

Health Check

We'd evaluate 3 key metrics against a baseline. If any of the following exceed X standard deviations, this generates an alert:

error rate
response time - 95th percentile
time in queue

To handle seasonal patterns, the current values are compared against multiple time periods:

The last hour
Last week
The last deploy

A threshold most be exceeded in all 3 periods to generate an alert.

Example UI w/the configuration:

Performance Diff

This will likely evolve into a dedicated issue, but in brief:

Highlight the parts of the app that have changed the most. We can likely do this better and faster than a user interacting with our UI.
Provide context around the event - when did this last occur? Have there been recent deploys?
Allow users to create a comment history.

Example:

Note - I don't know why the above images are rotated when embedded. Clicking shows them in the proper orientation.

Downsides

Apps must be doing enough traffic for us to be confident in the difference. However, if an app is doing little traffic, it might not be critical enough to configure alerts.
Some tuning will be required.

drewblas commented 7 years ago

I'm definitely interested in the concept and I'm excited to try it out.

You mention "This is frequently unique per-app." - Intuition tells me this is also per-endpoint. Every endpoint/job's behavior is different.

You also mentioned about having "sufficient volume". I feel we have plenty of overall volume, but where we suffer is that individual endpoints can exhibit widely variable behavior. For a 90% of users, a certain endpoint is fine. But maybe suffers in performance/response time/etc when hit by a user with an unusual usage pattern. So we see variation, spikes, and "abnormalities" as part of "normal" app behavior that is not attributable to a recent deploy.

I'm not sure how to account for that, but just bringing up the problem that constantly "clouds" our analysis: 10 days in a row an endpoint can be just fine. And then one day "Joe" comes and hits a certain page hard. The page performance is worse for him because of any number of weird situations. And so a page that renders for most in 200ms is 600ms for him. And then maybe he hits it a whole bunch one day and hits the avg/95th response time really hits because it's a page that isn't used all that much.

I know it's a hard problem, and I'm not offering solutions. But I'd certainly love to try any attempt you make at helping us sort the wheat from the chaff here!

itsderek23 commented 7 years ago

For a 90% of users, a certain endpoint is fine. But maybe suffers in performance/response time/etc when hit by a user with an unusual usage pattern.

💯.

I'm not sure how to account for that, but just bringing up the problem that constantly "clouds" our analysis: 10 days in a row an endpoint can be just fine. And then one day "Joe" comes and hits a certain page hard.

I think context could play a stronger role here: seeing that "joe" is involved in many of the 95th percentile + requests and not @ the median would make that clear.

For example:

Following a deploy, Joe starts actively using the app, triggering slow response times.
Scout flags the response times as abnormal.
Scout identifies any context themes of the 95th percentile + requests.
Check response times by the same user from prior time periods: if it hasn't changed, don't generate an alert.

scoutapp / roadmap