Closed eladmallel closed 5 years ago
Since we already have application-specific reporting via the metrics service then I'm inclined to go with option 2.
Option 1 would mean we'd be writing and running code for two concurrent metrics services which doesn't seem ideal.
There's also another option, which is to process the (perhaps aggregated) logs, extract the metrics from the logs and alert based on those numbers. This is the approach I would take if I had designed the metrics system, because it's non-invasive and doesn't require anything within the application except for logging (which is already generally useful on its own). I've typically used homegrown tools for this but there are things such as ELK, Graylog, Splunk, etc.
@j0sh interesting thoughts! I've used Splunk in the past and it's really powerful and useful (and expensive!).
I'm curious to better understand how you and @darkdarkdragon are thinking we can use the metrics server to generate alerts. For example, where should we define the threshold and checking against it? How should we wire it up to some alerting SaaS (e.g. GCP, PagerDuty)?
@eladmallel
The best way to articulate what we want to achieve is to receive an alert (email/SMS) whenever a certain test network's success rate drops below a certain threshold (e.g. 95%), both on an individual node level, as well as on the entire network level.
What is the best way to achieve that? Some thoughts to kick off the brainstorm:
Any other approaches you have in mind?
Would love to have a productive discussion here to land on an approach that makes the most sense to all of us.
@darkdarkdragon @j0sh @ericxtang