prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.58k stars 2.14k forks source link

Alertmanager : annotations: description with custom calculation result or more a function for information on a alert. #3607

Open stratanic opened 10 months ago

stratanic commented 10 months ago

Hello,

I use Prometheus and alertmanager , but Alertmanager is very limited without templating in Alertmanager (custom weebhook) or the concept of the tools. there is a way to have in (annotations: description:) : the result of 2 or 3 query (for have more information on a alert),

and i Don't like some suffixe of the "humanize" function of prometheus ( I preferd for disk data : "Mo" or "Mega" than "Mi" or "Gi" ) .

Sample : description : Warning on the server 'SRV212': Only 253 Mi of free disk space remains, representing 3% of the total 10 GB disk space.

dswarbrick commented 10 months ago

Alertmanager itself cannot perform queries, however Prometheus can perform queries within templates. See https://prometheus.io/docs/prometheus/latest/configuration/template_reference/#queries

Using that, you could write an alerting rule which included the additional values that you want in the annotation.

The humanize / humanize1024 function prefixes are internationally recognized, as defined by SI and ISO. You should probably add a "B" to your annotation for your disk alerting rule, so that it produces "MiB" rather than just the prefix alone.

stratanic commented 10 months ago

Thx, but some time it's gig or bytes, result as space left.. .sample: {VALUE |HUMANIZE1024}M= 5 giM....

stratanic commented 9 months ago

Where can I find a sample " perform queries within templates." ? I’ve spent several days trying to understand how templates work but alway It's existing integrations : slack, email ,wechat etc.. when it comes to "generic webhooks", I’m not sure how they function. I haven’t been able to find any samples. Can anyone help?

my goal it send some external label with key/value to alerta : i think all must in "annotation" section of the alert rule. and the annotation have 2 values:

Sample : annotations: summary: "253 Mi reprensent 3% space left "

"253 Mi" and "3%" : is two separate value. not easy to display this type of annotation in alertmanager, but very easy in other monitoring tools....

dswarbrick commented 9 months ago

Thx, but some time it's gig or bytes, result as space left.. .sample: {VALUE |HUMANIZE1024}M= 5 giM....

Try {value | humanize1024}B.

The humanize / humanize1024 functions are only intended to provide a prefix. It's up to you to supply the unit, e.g. {value | humanize}J for an energy measurement (Joules), or {value | humanize}Pa for pressure (Pascals).

dswarbrick commented 9 months ago

Where can I find a sample " perform queries within templates." ?

https://prometheus.io/docs/prometheus/latest/configuration/template_examples/#display-one-value

I’ve spent several days trying to understand how templates work but alway It's existing integrations : slack, email ,wechat etc..

Prometheus knows nothing of Slack, WeChat, or even email integration. It merely fires alerts at Alertmanager via an API. Alertmanager is where that alert is forwarded to some kind of notification provider. It sounds like you might be confusing Prometheus templating with Alertmanager notification templates.

when it comes to "generic webhooks", I’m not sure how they function. I haven’t been able to find any samples. Can anyone help?

The JSON payload format of the HTTP POST that Alertmanager sends to webhook receivers is documented in the https://prometheus.io/docs/alerting/latest/configuration/#webhook_config section.

my goal it send some external label with key/value to alerta : i think all must in "annotation" section of the alert rule. and the annotation have 2 values:

Sample : annotations: summary: "253 Mi reprensent 3% space left "

"253 Mi" and "3%" : is two separate value. not easy to display this type of annotation in alertmanager, but very easy in other monitoring tools....

The format of the alert JSON payload that Prometheus sends to Alertmanager is described at https://prometheus.io/docs/alerting/latest/clients/

(this request for help really ought to have been posted in Discussions, or in the prometheus-users Google group)

dswarbrick commented 9 months ago

Below is an adaptation of a generic disk space rule that I've used in production for monitoring a few thousand hosts of various types / role.

  - alert: NodeDiskSpaceCritical
    expr: |
      node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
    labels:
      severity: critical
    annotations:
      summary: Critical disk space {{ $labels.mountpoint }} on host {{ $labels.instance }}
      description: >-
        Mountpoint {{ $labels.mountpoint }} on host {{ $labels.instance }} has {{ humanizePercentage $value }} disk space remaining.
      filesystem_size: "{{ with printf \"node_filesystem_size_bytes{instance='%s',mountpoint='%s'}\" $labels.instance $labels.mountpoint
        | query }}{{ . | first | value | humanize1024 }}B{{ end }}"
      filesystem_available: "{{ with printf \"node_filesystem_avail_bytes{instance='%s',mountpoint='%s'}\" $labels.instance $labels.mountpoint
        | query }}{{ . | first | value | humanize1024 }}B{{ end }}"

In brief, it will fire if the available filesystem space is less than 10% (regardless of absolute size). The filesystem_size and filesystem_available annotations show how you can use queries in Prometheus templates. This will result in alerts such as the following:

[
  {
    "annotations": {
      "description": "Mountpoint /srv on host foohost:9100 has 8.98% disk space remaining.",
      "filesystem_available": "14.2GiB",
      "filesystem_size": "159.1GiB",
      "summary": "Critical disk space /srv on host foohost:9100"
    },
    "endsAt": "2023-11-24T00:48:45.921Z",
    "fingerprint": "3800078d23f72e50",
    "startsAt": "2023-11-24T00:36:00.921Z",
    "status": {
      "inhibitedBy": [],
      "silencedBy": [],
      "state": "active"
    },
    "updatedAt": "2023-11-24T01:44:45.925+01:00",
    "labels": {
      "alertname": "NodeDiskSpaceCritical",
      "device": "/dev/sda2",
      "fstype": "ext4",
      "instance": "foohost:9100",
      "job": "node",
      "mountpoint": "/srv",
      "severity": "critical"
    }
  }
]

I can't really say I would recommend this, and it feels like an anti-pattern to me. But I have used template queries in the past to populate labels (as opposed to annotations) in alerts.

If your goal is to convey additional information such that the human receiver can decide whether the alert is urgent (e.g. if your alert threshold is 10%, but since it's a 500 GB filesystem, it still has 50 GB free), then this already violates alerting best practices, as it leads to alert fatigue. I would recommend instead something like my original rule:

  - alert: NodeDiskSpaceCritical
    expr: |
      node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
      unless node_filesystem_avail_bytes > 15e9
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Critical disk space {{ $labels.mountpoint }} on host {{ $labels.instance }}
      description: >-
        Mountpoint {{ $labels.mountpoint }} on host {{ $labels.instance }} has
        {{ humanizePercentage $value }} disk space remaining.

This will fire if the available filesystem space is less than 10% unless there are more than 15 GB remaining.

To learn more advanced alerting techniques, I suggest you head to the prometheus-users Google group, https://groups.google.com/g/prometheus-users/.

stratanic commented 9 months ago

Thanks you soo much, you are great. I'm going test this.. 😁, and share my alert for disk space... 👍

stratanic commented 9 months ago

Its work awsome 👍

Here my rule , i make rule for Windows VM and Linux VM, and exclude some big disk.

- name: VirtualMachine
   rules:
    - alert: Win espace critique
      expr: bottomk(5, (vmware_vm_guest_disk_free{partition=~'(C|D|E|F|G|H):.?'}) < 2200000000 )
      for: 2m
      labels:
        severity: critical
        environment: Production
        event: "{{ $labels.vm_name | toUpper }}"
        service: "Espace Disque"
        instance: "{{ $labels.host_name }}"
      annotations:
        summary:  "{{ $labels.vm_name }}"
        description: "Critique sur le serveur: {{ $labels.vm_name | toUpper }} sur  {{ $labels.partition }} il reste  {{ $value | humanize }}B" 
        value: "{{ $value | humanize }}B"
    - alert: Windows Espace Warning
      expr: bottomk(5, (vmware_vm_guest_disk_free{partition=~'(C|D|E|F|G|H):.?'}) > 2200000000 and (vmware_vm_guest_disk_free{partition=~'(C|D|E|F|G|H):.?'}) < 3500000000 )
      for: 10m
      labels:
        severity: warning
        environment: Production
        event: "{{ $labels.vm_name | toUpper }}"
        service: "Espace Disque"
        instance: "{{ $labels.host_name }}"
      annotations:
        summary:  "{{ $labels.vm_name }}"
        description: "en Warning sur le serveur: {{ $labels.vm_name | toUpper }} sur  {{ $labels.partition }}  il reste  {{ $value | humanize }}B"
        value: "{{ $value | humanize }}B"
    - alert: Linux Espace Critique
      expr: bottomk( 6, ( vmware_vm_guest_disk_free{partition=~"^/.*", vm_name!="exp-c"} < 200000000 ) and vmware_vm_guest_disk_capacity > 370000000 )
      for: 10m
      labels:
        severity: critical
        environment: Production
        event: "{{ $labels.vm_name | toUpper }}"
        service: "Espace Disque"
        instance: "{{ $labels.host_name }}"
      annotations:
        summary:  "{{ $labels.vm_name }}"
        description: "Critique sur le serveur: {{ $labels.vm_name | toUpper }} sur  {{ $labels.partition }}  il reste  {{ $value | humanize }}B soit : {{ with printf \"vmware_vm_guest_disk_free{vm_name='%s',partition='%s'} /\ vmware_vm_guest_disk_capacity{vm_name='%s',partition='%s'}\" $labels.vm_name $labels.partition $labels.vm_name $labels.partition | query }}{{ . | first | value | humanizePercentage }}{{ end }} Taille total : {{ with printf \"vmware_vm_guest_disk_capacity{vm_name='%s',partition='%s'}\" $labels.vm_name $labels.partition | query }}{{ . | first | value | humanize1024 }}B{{ end }}"
        value: "{{ $value | humanize }}B"

sample output description :

description Critique sur le serveur: xxdxdxdxREC sur /xdxdxd/intra/rdbms il reste 76.12Mi soit : 0.38% Taille total : 19.56GiB

stratanic commented 9 months ago

Another question , description: is very long now....

How I can reduce rule with : reusable-templates ? https://prometheus.io/docs/prometheus/latest/configuration/template_examples/#defining-reusable-templates

I create a template file and declare it in my conf alertmanager (.../alertmanager .yml) right ? (Prometheus templating)

templates:

alway I read this sample at official website prometheus but they dont show, how to make it exactly...

dswarbrick commented 7 months ago

@RichiH May I quietly suggest that you convert this issue to a discussion?