prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.56k stars 2.13k forks source link

Alert grouping test with amtool #3003

Open freeseacher opened 2 years ago

freeseacher commented 2 years ago

What did you do? Now we are activly using amtool config routes test and find it extremely usefull, but recently found that we should check if alert grouping is expected too. for example now we are checking that

% amtool config routes test --config.file alertmanager.yaml --tree \
--verify.receivers wire-team-opsgenie 'team=wire'
Matching routes:
.
└── default-route
    └── {team=~"^(?:^(wire)$)$"}  receiver: wire-team-opsgenie
wire-team-opsgenie

it will be usefull if we can pass something like

% amtool config routes test --config.file alertmanager.yaml \
--tree --verify.receivers wire-team-opsgenie \
--verify.grouping=env,cluster,priority 'team=wire'

Matching routes:
.
└── default-route
    └── {team=~"^(?:^(wire)$)$"}  receiver: wire-team-opsgenie
wire-team-opsgenie, grouping: [env,cluster,priority]
gotjosh commented 2 years ago

I'm not sure I follow the usefulness of this - on your example where you include the grouping, what changed?

freeseacher commented 2 years ago

the main reason of it is for routing with custom subroutes. for example i have something like

- receiver: wire-team-opsgenie
  group_by:
    - env
    - cluster
    - priority
  match_re:
    team: ^(wire)$
  routes:
    - receiver: wire-team-opsgenie
      group_by:
        - alertname
        - cve
        - cluster
      match:
        alert_topic: security
    - receiver: wire-team-opsgenie
      group_by:
        - alertname
        - service
        - project
        - team
      match:
        alertname: QuotaCanBeReached

You can see each alert will be sent to same receiver but with different grouping. After opsgenie we create jira issue and alert grouping is a key to know we already had the same incident previously. So instead of opening new jira issue we can append to already created. That is why its crucial to check if grouping is correct when changing am configs.

i propose two things

  1. show reciever grouping when displaying routing tree may be here https://github.com/prometheus/alertmanager/blob/main/cli/routing.go#L89
    {team=~"^(?:^(wire)$)$"}  receiver: wire-team-opsgenie
    wire-team-opsgenie, *grouping: [env,cluster,priority]*
  2. add new key verify.grouping that can check if receiver got expected grouping. maybe something like --verify.grouping[0]=[alertname,cve,cluster] will do the trick