Alerts Discussion: how should we implement application-level monitoring and alerts?

livepeer / test-harness

3 stars 2 forks source link

Alerts Discussion: how should we implement application-level monitoring and alerts? #16

Closed eladmallel closed 5 years ago

eladmallel commented 5 years ago

The best way to articulate what we want to achieve is to receive an alert (email/SMS) whenever a certain test network's success rate drops below a certain threshold (e.g. 95%), both on an individual node level, as well as on the entire network level.

What is the best way to achieve that? Some thoughts to kick off the brainstorm:

We could add new code to the node that emits the relevant data to something standard like statsd or collectd, and hook that data up with a Graphite service, which can then connect to PagerDuty
We could rely on our current metrics server implementation, and add features to the metrics server, where it can be the one hosting the threshold and alerting rules, or it can emit raw data to another service (e.g. GCP monitoring, Graphite) where we would configure alerts

Any other approaches you have in mind?

Would love to have a productive discussion here to land on an approach that makes the most sense to all of us.

@darkdarkdragon @j0sh @ericxtang

j0sh commented 5 years ago

Since we already have application-specific reporting via the metrics service then I'm inclined to go with option 2.

Option 1 would mean we'd be writing and running code for two concurrent metrics services which doesn't seem ideal.

There's also another option, which is to process the (perhaps aggregated) logs, extract the metrics from the logs and alert based on those numbers. This is the approach I would take if I had designed the metrics system, because it's non-invasive and doesn't require anything within the application except for logging (which is already generally useful on its own). I've typically used homegrown tools for this but there are things such as ELK, Graylog, Splunk, etc.

eladmallel commented 5 years ago

@j0sh interesting thoughts! I've used Splunk in the past and it's really powerful and useful (and expensive!).

I'm curious to better understand how you and @darkdarkdragon are thinking we can use the metrics server to generate alerts. For example, where should we define the threshold and checking against it? How should we wire it up to some alerting SaaS (e.g. GCP, PagerDuty)?

darkdarkdragon commented 5 years ago

@eladmallel

I think we shouldn't look at any hosted solutions - we're making open source project, we shouldn't tie users to any commercial provider.
I looked at graphite some time ago, and didn't liked it, but don't remember why 😄
Ideal solution would be to find open source project, in which we will be able (easily) visualize same data we have in own metrics server, and use it. One possible candidate - Prometheus, but I personally haven't tried it yet.