Open rigtorp opened 5 years ago
I can take this one and also https://github.com/prometheus/node_exporter/issues/1398 so that the alerts and dashboards are consistent.
@SuperQ do we have any starting points here or I should go ahead and send a PR with the alerts that I currently have for Prod?
Let's start out by thinking about what kind of alerts we should provide as examples. There are quite a lot of bad node alerts out there. I'd like to avoid recommending the typical "my disk is X% full" kind of alerts that are noisy or non-actionable.
Alerts that we do include should follow best practices laid out by the SRE Book, RED/USE methods, etc.
I'll do a first draft tomorrow and we can then iterate on it.
I looked at the current set of alerts I have, and yours is a good point, we have some 'disk space <10%' kind of alerts too. (they shall be removed) ;)
PS: After discussing this with @brancz and a quick search later, I found out that @tomwilkie already has a PR here: https://github.com/prometheus/node_exporter/pull/941
I'm okay to let Tom finish the PR, and even happy to pick up the PR if he's busy with something else.
You're welcome to take a look at the rules we use. The memory pressure one is pretty useful.
+1 builtin grafana dashboard
There is already work for a monitoring mixin for node exporter: https://github.com/prometheus/node_exporter/pull/941
@beorn7 recently finished the work for the Prometheus monitoring mixin and it looks like he’s picking up the above PR judging by the last few comments.
If everybody is fine with providing the examples in the form of jsonnet mixins, we should merge all our wisdom in #941. Note, however, that this is different from providing a plain example alert rule file (as you have to install some parts and run jsonnet to create those from the mixins). Mixins are the more flexible and powerful solution, though. (Power users can do a lot with them. Naive users just run make
to get the plain YAML file and take it from there.)
@beorn7 +1 to merging efforts towards #941 , does it also make sense generate the files in an 'examples' directory for naive users to consume, or shall we leave it upto the users to clone locally and run make
?
As a super naive user just getting started, it would be perhaps easier to just get some sample alerts from the 'documentation'.
We could of course checkin the result of the make
run, too. Let's keep that in mind to decide once the mixin PR is in a workable shape.
Ack, thanks a lot @beorn7 ; Do let me know if there's anything I can help with for the PR, but I'll leave it to you so that we don't duplicate efforts :)
In order to provide a better out of the box experience node_exporter should come with a recommended set of alert rules that provides useful alerts for common Linux system issues.
Currently when deploying Prometheus and node_exporter the user needs to build up his alerts from scratch. This can be challenging when someone is new to Prometheus and not yet familiar with all the capabilities of PromQL.