scoutapp / roadmap

The public roadmap for Scout application monitoring.
https://scoutapp.com
16 stars 2 forks source link

Dynamic app health check #1

Open itsderek23 opened 7 years ago

itsderek23 commented 7 years ago

We currently provide static alerting thresholds on key performance metrics. Example: error rate >= 10 per-minute for 5 minutes.

However, this has downsides:

Rather than this approach, I propose a two-step flow:

  1. A general health check that runs continuously, looking for abnormal behavior
  2. A performance diff UI that is viewed after an alert is generated. This highlights parts of the app that have the greatest changes.

Analogy: you go to the doctor when you have a fever, then the doctor prescribes medication to fix the fever, which could be caused by many things. Detecting the fever is the health check. The docker appointment is the performance diff.

Health Check

We'd evaluate 3 key metrics against a baseline. If any of the following exceed X standard deviations, this generates an alert:

To handle seasonal patterns, the current values are compared against multiple time periods:

A threshold most be exceeded in all 3 periods to generate an alert.

Example UI w/the configuration:

img_4677

Performance Diff

This will likely evolve into a dedicated issue, but in brief:

Example:

img_4680

Note - I don't know why the above images are rotated when embedded. Clicking shows them in the proper orientation.

Downsides

drewblas commented 7 years ago

I'm definitely interested in the concept and I'm excited to try it out.

You mention "This is frequently unique per-app." - Intuition tells me this is also per-endpoint. Every endpoint/job's behavior is different.

You also mentioned about having "sufficient volume". I feel we have plenty of overall volume, but where we suffer is that individual endpoints can exhibit widely variable behavior. For a 90% of users, a certain endpoint is fine. But maybe suffers in performance/response time/etc when hit by a user with an unusual usage pattern. So we see variation, spikes, and "abnormalities" as part of "normal" app behavior that is not attributable to a recent deploy.

I'm not sure how to account for that, but just bringing up the problem that constantly "clouds" our analysis: 10 days in a row an endpoint can be just fine. And then one day "Joe" comes and hits a certain page hard. The page performance is worse for him because of any number of weird situations. And so a page that renders for most in 200ms is 600ms for him. And then maybe he hits it a whole bunch one day and hits the avg/95th response time really hits because it's a page that isn't used all that much.

I know it's a hard problem, and I'm not offering solutions. But I'd certainly love to try any attempt you make at helping us sort the wheat from the chaff here!

itsderek23 commented 7 years ago

For a 90% of users, a certain endpoint is fine. But maybe suffers in performance/response time/etc when hit by a user with an unusual usage pattern.

đŸ’¯.

I'm not sure how to account for that, but just bringing up the problem that constantly "clouds" our analysis: 10 days in a row an endpoint can be just fine. And then one day "Joe" comes and hits a certain page hard.

I think context could play a stronger role here: seeing that "joe" is involved in many of the 95th percentile + requests and not @ the median would make that clear.

For example: