prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.66k stars 2.16k forks source link

AlertManager not sending all alerts to Webhook endpoint. #2404

Open andrewipmtl opened 4 years ago

andrewipmtl commented 4 years ago

Hi,

I'm using a webhook receiver for AlertManager to store alerts for pagination etc. For the most part, the webhook seems to be working just fine, but for some alerts, the webhook doesn't seem to receive a POST call at all from AlertManager.

Is there any way to troubleshoot this? For example, a way to trace alertmanager's outgoing HTTP calls to the webhook receiver?

The webhook endpoint is a Rails application server which also logs all incoming traffic, and after investigating, the missing alerts never show up in the logs (a POST request is never received).

I've attached a partial configuration, omitting redundant receivers etc. They're almost all the same.

Thanks,

Environment


global:
  resolve_timeout: 5m
  http_config: {}
  smtp_from: no-reply@testsite.com
  smtp_hello: localhost
  smtp_smarthost: smtp.office365.com:587
  smtp_auth_username: no-reply@testsite.com
  smtp_auth_password: <secret>
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: device-alerts.hook
  group_by:
  - alertname
  - uid
  - group_id
  - stack_name
  - tenant_id
  - tenant_name
  - rule_stack
  - rule_tenant
  routes:
  - receiver: Test Presence Offline Notification Name
    match_re:
      alertname: ^(Test Presence Offline Alert Name)$
      group_id: 460599d4-3c4a-4311-a7d6-bdce6058672a
      tenant_name: ^(vle)$
    continue: true
    repeat_interval: 10y

  group_wait: 30s
  group_interval: 5m
  repeat_interval: 30m
receivers:
- name: device-alerts.hook
  webhook_configs:
  - send_resolved: true
    http_config: {}
    url: http://127.0.0.1/v1/webhook
    max_alerts: 0
- name: Test Presence Offline Notification Name
  email_configs:
  - send_resolved: false
    to: testuser@testsite.com
    from: no-reply@testsite.com
    hello: localhost
    smarthost: smtp.office365.com:587
    auth_username: no-reply@testsite.com
    auth_password: <secret>
    headers:
      From: no-reply@testsite.com
      Smtp_from: no-reply@testsite.com
      Subject: 'Alert: {{ range .Alerts }}{{ .Labels.device_name }}{{ end }} | {{ range .Alerts }}{{ .Annotations.description }}{{ end }} | {{ range .Alerts }}{{ .Labels.uid }}{{ end }}'
      To: Test.dooling@testsite.onmicrosoft.com
      X-SES-CONFIGURATION-SET: ses-kibana
    html: '{{ template "email.default.html" . }}'
    text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}Rule: {{ range .Alerts }}{{ .Labels.alertname }}{{ end }}Group: {{ range .Alerts }}{{ .Labels.group_name }}{{ end }}Device Name: {{ range .Alerts }}{{ .Labels.device_name }}{{ end }}Serial Number: {{ range .Alerts }}{{ .Labels.uid }}{{ end }}'
    require_tls: true
templates:
- /etc/alertmanager/templates/default.tmpl
simonpasquier commented 4 years ago

Your best bet is to turn on debug logs (--log.level=debug). How do you know for sure that notifications are missing?

andrewipmtl commented 4 years ago

@simonpasquier So, we can see the alerts in prometheus, as well as in alertmanager, so the alerts fire properly. On our webhook application side, we've logged everything, and we notice that not every alert that fires on alertmanager makes its way to our webhook endpoint. We can see the POST requests from alertmanager to our webhook for some of the alerts, but others are completely missing.

andrewipmtl commented 4 years ago

Honestly, the only reason we're using the webhook in the first place is because alertmanager doesn't support pagination when querying for alerts/groups. So we're using the webhook to receive all alerts/resolutions and storing them ourselves so we can manually paginate them. Our applications and metrics can generate tens of thousands of alerts which causes requests to alertmanager to sometimes timeout when the payloads are too large.

andrewipmtl commented 4 years ago

Your best bet is to turn on debug logs (--log.level=debug). How do you know for sure that notifications are missing?

@simonpasquier I've run debug logs on alertmanager, and can confirm that alerts are received by alertmanager, but not sent to the webhook; email integration does get sent though.

Shouldn't all alerts route to the default route (which is set as the webhook)?


level=debug ts=2020-10-30T18:03:04.334Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{alertname=~\"^(?:^(Jesse Logo Showing Alert Name)$)$\",group_id=~\"^(?:2223-4343-34333)$\",tenant_name=~\"^(?:^(test)$)$\"}:{alertname=\"Jesse Logo Showing Alert Name\", group_id=\"2223-4343-34333\", rule_stack=\"dev\", rule_tenant=\"test\", stack_name=\"dev\", tenant_id=\"1\", tenant_name=\"test\", uid=\"TEST-UNIT-001\"}" msg=flushing alerts="[Jesse Logo Showing Alert Name[d34154b][active]]"
level=debug ts=2020-10-30T18:03:05.592Z caller=notify.go:685 component=dispatcher receiver="IP Show Logo Alert Notif Name" integration=email[0] msg="Notify success" attempts=1
level=debug ts=2020-10-30T18:04:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"
level=debug ts=2020-10-30T18:06:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"
level=debug ts=2020-10-30T18:08:04.334Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{alertname=~\"^(?:^(Jesse Logo Showing Alert Name)$)$\",group_id=~\"^(?:2223-4343-34333)$\",tenant_name=~\"^(?:^(test)$)$\"}:{alertname=\"Jesse Logo Showing Alert Name\", group_id=\"2223-4343-34333\", rule_stack=\"dev\", rule_tenant=\"test\", stack_name=\"dev\", tenant_id=\"1\", tenant_name=\"test\", uid=\"TEST-UNIT-001\"}" msg=flushing alerts="[Jesse Logo Showing Alert Name[d34154b][active]]"
level=debug ts=2020-10-30T18:08:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"```
mvineza commented 4 years ago

I encounter similar issue using same alert manager version 0.21 but our prometheus is on v2.19. In our case, there are some POST requests that are missing even though there are active alerts.

Another issue is that seems some POST requests has missing information.

For example, if there is an active alert containing 5 nodes grouped together.

We will receive 2 POST request. The first one is incomplete because it has missing nodes.

{
    "alerts": [
        {
        ...
                "instance": "node1.demo.com:9100",
                ...
                "instance": "node2.demo.com:9100",
                ...
                "instance": "node5.demo.com:9100",
                ...
            "status": "firing"
        }
    ],
    ...
}

And the second POST requests is the complete one. With 5 nodes.

{
    "alerts": [
        {
        ...
                "instance": "node1.demo.com:9100",
                ...
                "instance": "node2.demo.com:9100",
                ...
                "instance": "node3.demo.com:9100",
                ...
                "instance": "node4.demo.com:9100",
                ...
                "instance": "node5.demo.com:9100",
                ...
            "status": "firing"
        }
    ],
    ...
}
simonpasquier commented 4 years ago

@andrewipmtl

Shouldn't all alerts route to the default route (which is set as the webhook)?

no, alerts that will match the Test Presence Offline Notification Name receiver won't go through the top-level route.

simonpasquier commented 4 years ago

@mvineza this seems a different problem.

For example, if there is an active alert containing 5 nodes grouped together.

We will receive 2 POST request. The first one is incomplete because it has missing nodes.

You have 5 alerts then and it may be that they are not sent at the same time by Prometheus.

andrewipmtl commented 4 years ago

@andrewipmtl

Shouldn't all alerts route to the default route (which is set as the webhook)?

no, alerts that will match the Test Presence Offline Notification Name receiver won't go through the top-level route.

Even though it has the continue flag to 'true' ? Is there any way to make all alerts hit the webhook no matter what?

We have a system where we want to store the alerts so that we can paginate them (webhook) but also only send notifications out for specific ones. Even if we configure an email notification for one of the alerts, we still want it to hit the webhook.

linkingli commented 4 years ago

Has this problem been solved? I have encountered the same problem. When grouping alarms, webhook will lose part of the alarms.,My configuration information is as follows

Image:         quay.io/prometheus/alertmanager:v0.21.0

route:
  receiver: webhook
  group_by:
  - alertname
  routes:
  - receiver: webhook
    continue: true
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 4h
receivers:
- name: webhook
  webhook_configs:
  - send_resolved: true
    url: http://os-alertmgt-svc.prometheus-monitoring.svc:3000/api/v1/alert/webhook
templates:
- /etc/alertmanager/config/email.tmpl

In the alterManager page, I saw the following alarm. After passing the webhook, I could hardly see the complete alarm

alertname="aa"
4 alerts
alertname="we"
114 alerts
alertname="wewqd"
171 alerts
andrewipmtl commented 3 years ago

I've configured the webhook as another route on top of being the default route, and I'm still seeing some alerts not being sent through to the webhook.

simonpasquier commented 3 years ago

@andrewipmtl can you share the new config?

0rest commented 3 years ago

Hello! Am running Prometheus 2.22.0 and Alertmanager v0.16.2 for Openshift Platform monitoring and am also observing some of messages not being sent to webhook endpoint. I do use only one default route for all messages in alertmanager. Alertmanager runs in debug mode so I can easily follows all events. At the webhook endpoint level I log all of events from Alertmanager. Here are my findings:

andrewipmtl commented 3 years ago

@andrewipmtl can you share the new config?

global:
  resolve_timeout: 5m
  http_config: {}
  smtp_from: no-reply@mywebsite.com
  smtp_hello: localhost
  smtp_smarthost: smtp.office365.com:587
  smtp_auth_username: no-reply@videri.com
  smtp_auth_password: <secret>
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: device-alerts.hook
  group_by:
  - alertname
  - uid
  - group_id
  - stack_name
  - tenant_id
  - tenant_name
  - rule_stack
  - rule_tenant
  routes:
  - receiver: device-alerts.hook
    match_re:
      alertname: .*
    continue: true
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 30m
receivers:
- name: device-alerts.hook
  webhook_configs:
  - send_resolved: true
    http_config: {}
    url: http://127.0.0.1/v1/webhook
    max_alerts: 0
templates:
- /etc/alertmanager/templates/default.tmpl
simonpasquier commented 3 years ago

@andrewipmtl hmm not sure why you configured a subroute. AFAICT this would work the same?

route:
  receiver: device-alerts.hook
  group_by:
  - alertname
  - uid
  - group_id
  - stack_name
  - tenant_id
  - tenant_name
  - rule_stack
  - rule_tenant
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 30m
andrewipmtl commented 3 years ago

@andrewipmtl hmm not sure why you configured a subroute. AFAICT this would work the same?

route:
  receiver: device-alerts.hook
  group_by:
  - alertname
  - uid
  - group_id
  - stack_name
  - tenant_id
  - tenant_name
  - rule_stack
  - rule_tenant
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 30m

I've tried it without subroutes either and it still doesn't receive all the alerts via webhook, some still go missing.

simonpasquier commented 3 years ago

Ok not sure why this happens but the only thing I can recommend is to run with --log.level=debug and investigate what happens when no notification is sent while you expect some.

andrewipmtl commented 3 years ago

Ok not sure why this happens but the only thing I can recommend is to run with --log.level=debug and investigate what happens when no notification is sent while you expect some.

The exact same thing happens as when I tested it in an earlier debug session: https://github.com/prometheus/alertmanager/issues/2404#issuecomment-719715603

Alerts show up, but aren't sent to the webhook endpoint.

rodrigoscferraz commented 3 years ago

same with me

shishirkh commented 2 years ago

Facing same.

rmartinsjr commented 2 years ago

Forgive me if I misunderstood your initial question, but I think y'all didn't get the point.

The default receiver for some route node (including the top-level node) is only used if your alarm didn't match any matchers declared at that level of the routing tree. Your alarms enter the routing tree from the top and traverses it down until they match some matcher and then that node's receiver receives the alarm.

If you set "continue: true", the alarm will continue matching the siblings, meaning that it will try to match another matcher at the same level.

Therefore, if you want your Webhooks to receive all the alarms, it must be declared properly in combination with "continue: true" in all levels that your alarm matches.

Use amtool to test your routes, as described in prometheus/alertmanager

andrewipmtl commented 2 years ago

@rmartinsjr I'm not sure what you mean by sibling routes, all the routes that -should- alert are at the same level, including the one for the webhook, and all routes have continue: true defined, yet I'm still seeing this behavior.

It's also intermittent as some alerts would go through, and many would not. There's no pattern as it does not always seem to be the same alerts randomly passing through to the webhook either.

rmartinsjr commented 2 years ago

@andrewipmtl, reviewing all posted configurations, I believe you're using the simpler one that simonpasquier posted... With that supposition, are you sure it isn't the group_by that is grouping multiple alerts into one?

andrewipmtl commented 2 years ago

@rmartinsjr, yes I'm sure. The example I posted is a simplified version for demonstration. The actual version has a lot more alerts set up, all with continue: true defined as parameter as well. We have dozens of alerts set up in this manner configured the same. All the alerts have different naming criteria as well as firing criteria.

rmartinsjr commented 2 years ago

Have never seen anything like that... Have you tried the routing tree visual tool? https://www.prometheus.io/webtools/alerting/routing-tree-editor/

andrewipmtl commented 2 years ago

@rmartinsjr I have never used it before -- but after using it for the first time just now, I get a "tree" map generated where it looks like every single alert branches from a single node which is the device-alerts.hook. So unless I'm wrong -- every single alert should be hitting the webhook.

luislhl commented 2 years ago

In case this helps anyone, I was running AlertManager through prometheus-operator, and I experienced the exact same problem.

In my case the cause was that alert-manager was matching only alert that contained the right namespace label. There is an issue about that in https://github.com/prometheus-operator/prometheus-operator/issues/3737

andrewipmtl commented 2 years ago

In case this helps anyone, I was running AlertManager through prometheus-operator, and I experienced the exact same problem.

In my case the cause was that alert-manager was matching only alert that contained the right namespace label. There is an issue about that in prometheus-operator/prometheus-operator#3737

@luislhl , by namespace what exactly do you mean? I have no namespaces defined in my config file, is that the issue? I wasn't aware of any namespace matching if none were provided.

luislhl commented 2 years ago

@luislhl , by namespace what exactly do you mean? I have no namespaces defined in my config file, is that the issue? I wasn't aware of any namespace matching if none were provided.

Hey, @andrewipmtl

By namespace I mean a Kubernetes namespace, my bad I didn't make it clearer.

I have deployed Alertmanager in a Kubernetes cluster by using the Prometheus Operator.

The final Alertmanager config I get has this matcher to select only alerts containing a namespace label with value kube-prometheus:

global:
  resolve_timeout: 5m
route:
  receiver: "null"
  group_by:
  - job
  routes:
  - receiver: kube-prometheus-slack-alerts-slack-alerts-warning
    group_by:
    - alertname
    matchers:
    - namespace="kube-prometheus"
[...]

I had some alerts from others namespaces that were ignored because of this matcher. The issue I linked in my previous comment has more info about this behavior.

bsozin commented 2 years ago

We have similar issue. Some alerts are not posted to webhook. And I have a feeling that this is because alert is resolved within group_wait interval. As example, group_wait set to 30s and alert lasts just 20s. Is that possible?

P.S. Alertmanager v0.21.0, send_resolved not specified (supposed to be true by default).

mhf-ir commented 1 year ago

same problem, alert manager and prometheus shown the alert but not send data to webhook:

ts=2023-01-08T12:31:28.676Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=InstanceDown[c136526][active]
ts=2023-01-08T12:31:38.677Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}/{}:{job=\"node\"}" msg=flushing alerts=[InstanceDown[c136526][active]]
global:
receivers:
  - name: "n8n"
    webhook_configs:
      - url: https://sample.tld/webhook/alertmanager
        send_resolved: true
        http_config:
          basic_auth:
            username: alertmanager
            password: securePassword
          tls_config:
            insecure_skip_verify: true
route:
  receiver: n8n
  group_by: ['job']
  group_wait: 10s
  group_interval: 4m
  repeat_interval: 2h
  routes:
  - receiver: n8n
    continue: true
alonperel commented 1 year ago

@mhf-ir Did you find a solution for this? facing same issue now on prometheus/alertmanager:v0.24.0

andrewipmtl commented 1 year ago

@mhf-ir Did you find a solution for this? facing same issue now on prometheus/alertmanager:v0.24.0

Still no solutions provided.

alonperel commented 1 year ago

Still no solutions provided.

thanks @andrewipmtl , I mainly having issues sending the alert through AWS SNS

hkapasi commented 1 year ago

I am facing the same issues while sending slack messages with webhook and email as well. @andrewipmtl - Did you get past this issue ? I do not see any solution or further suggestions on the thread.

grobinson-grafana commented 1 year ago

@andrewipmtl I see group_interval is 5m, are you certain that all alerts are firing for at least 5 minutes, as otherwise some alerts might get resolved before a notification is sent.

sahaniarungod commented 1 year ago

issue with webhook...resolved, Actually the rocket chat .js file to be updated.

reddy2018 commented 1 year ago

thanks @andrewipmtl , I mainly having issues sending the alert through AWS SNS

Does anyone found the solution to integrate prometheus with AWS SNS?

@alonperel

oll-aroundarmor commented 1 year ago

issue with webhook...resolved, Actually the rocket chat .js file to be updated.

@sahaniarungod can you please elaborate?

rsuplina commented 9 months ago

Is this still an Issue? I'm using the 0.26.0 version. Im seeing alerts in Prometheus and AlertManager but in Slack sometimes I get almost empty alert in Slack and sometimes values are populated.

oll-aroundarmor commented 9 months ago

@rsuplina can you try using group by parameters and group your alerts ?

group_by:

  • severity
  • alertname
  • team
  • clusterName
  • container
  • exported_container
rsuplina commented 9 months ago

Hi @aroundarmor-cldcvr thanks for your reply, I have tried to change to:

Screenshot 2024-01-31 at 09 33 30

But still what would happen is some alarms would arrive with full data, and then if I change to some other instances that also does not exist and restart docker services (Prometheus-alertmanager) it will then again start send Notifications which are missing values .

Here is a full description of my error I wonder if. this is a bug with alertmanager ?

https://stackoverflow.com/questions/77908650/prometheus-alertmanager-slack-messages-not-fully-showing

sonvuthai commented 5 months ago

facing the same issue today. i'm seeing all the alerts in any channels except webhook which some of them are missing.

xDarekM commented 5 months ago

Hi everyone, I have exactly the same case. Due to the characteristics of the monitoring system, I can not send one aggregate alarm, but send them in individual webhooks, with a large number of alarms such as 50, maybe 7 are delivered, checking the logs on the recipient side shows as if the reqeusts do not arrive. Have any of you observed this problem in version v.0.27? Perhaps the solution is to send the alerts to some queue like AWS SNS?

grobinson-grafana commented 5 months ago

@xDarekM I would suggest running Alertmanager with debug logging --log.level=debug to see what is happening to those additional webhooks. You should look for lines such as Notify attempt failed and also check the notifications_failed_total metric. If there are no failed notification attempts then it's possible the number of notifications being sent doesn't match your expectations because of the Group wait, Group interval and Repeat interval timers.