Provide useful alert rule samples

prometheus / node_exporter

Exporter for machine metrics

https://prometheus.io/

Apache License 2.0

11.18k stars 2.36k forks source link

Provide useful alert rule samples #1397

Open rigtorp opened 5 years ago

rigtorp commented 5 years ago

In order to provide a better out of the box experience node_exporter should come with a recommended set of alert rules that provides useful alerts for common Linux system issues.

Currently when deploying Prometheus and node_exporter the user needs to build up his alerts from scratch. This can be challenging when someone is new to Prometheus and not yet familiar with all the capabilities of PromQL.

aditya-konarde commented 5 years ago

I can take this one and also https://github.com/prometheus/node_exporter/issues/1398 so that the alerts and dashboards are consistent.

@SuperQ do we have any starting points here or I should go ahead and send a PR with the alerts that I currently have for Prod?

SuperQ commented 5 years ago

Let's start out by thinking about what kind of alerts we should provide as examples. There are quite a lot of bad node alerts out there. I'd like to avoid recommending the typical "my disk is X% full" kind of alerts that are noisy or non-actionable.

Alerts that we do include should follow best practices laid out by the SRE Book, RED/USE methods, etc.

aditya-konarde commented 5 years ago

I'll do a first draft tomorrow and we can then iterate on it.

I looked at the current set of alerts I have, and yours is a good point, we have some 'disk space <10%' kind of alerts too. (they shall be removed) ;)

aditya-konarde commented 5 years ago

PS: After discussing this with @brancz and a quick search later, I found out that @tomwilkie already has a PR here: https://github.com/prometheus/node_exporter/pull/941

I'm okay to let Tom finish the PR, and even happy to pick up the PR if he's busy with something else.

SuperQ commented 5 years ago

You're welcome to take a look at the rules we use. The memory pressure one is pretty useful.

detailyang commented 5 years ago

+1 builtin grafana dashboard

brancz commented 5 years ago

There is already work for a monitoring mixin for node exporter: https://github.com/prometheus/node_exporter/pull/941

@beorn7 recently finished the work for the Prometheus monitoring mixin and it looks like he’s picking up the above PR judging by the last few comments.

beorn7 commented 5 years ago

If everybody is fine with providing the examples in the form of jsonnet mixins, we should merge all our wisdom in #941. Note, however, that this is different from providing a plain example alert rule file (as you have to install some parts and run jsonnet to create those from the mixins). Mixins are the more flexible and powerful solution, though. (Power users can do a lot with them. Naive users just run make to get the plain YAML file and take it from there.)

aditya-konarde commented 5 years ago

@beorn7 +1 to merging efforts towards #941 , does it also make sense generate the files in an 'examples' directory for naive users to consume, or shall we leave it upto the users to clone locally and run make?

As a super naive user just getting started, it would be perhaps easier to just get some sample alerts from the 'documentation'.

beorn7 commented 5 years ago

We could of course checkin the result of the make run, too. Let's keep that in mind to decide once the mixin PR is in a workable shape.

aditya-konarde commented 5 years ago

Ack, thanks a lot @beorn7 ; Do let me know if there's anything I can help with for the PR, but I'll leave it to you so that we don't duplicate efforts :)