Open andrewipmtl opened 4 years ago
Your best bet is to turn on debug logs (--log.level=debug
). How do you know for sure that notifications are missing?
@simonpasquier So, we can see the alerts in prometheus, as well as in alertmanager, so the alerts fire properly. On our webhook application side, we've logged everything, and we notice that not every alert that fires on alertmanager makes its way to our webhook endpoint. We can see the POST requests from alertmanager to our webhook for some of the alerts, but others are completely missing.
Honestly, the only reason we're using the webhook in the first place is because alertmanager doesn't support pagination when querying for alerts/groups. So we're using the webhook to receive all alerts/resolutions and storing them ourselves so we can manually paginate them. Our applications and metrics can generate tens of thousands of alerts which causes requests to alertmanager to sometimes timeout when the payloads are too large.
Your best bet is to turn on debug logs (
--log.level=debug
). How do you know for sure that notifications are missing?
@simonpasquier I've run debug logs on alertmanager, and can confirm that alerts are received by alertmanager, but not sent to the webhook; email integration does get sent though.
Shouldn't all alerts route to the default route (which is set as the webhook)?
level=debug ts=2020-10-30T18:03:04.334Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{alertname=~\"^(?:^(Jesse Logo Showing Alert Name)$)$\",group_id=~\"^(?:2223-4343-34333)$\",tenant_name=~\"^(?:^(test)$)$\"}:{alertname=\"Jesse Logo Showing Alert Name\", group_id=\"2223-4343-34333\", rule_stack=\"dev\", rule_tenant=\"test\", stack_name=\"dev\", tenant_id=\"1\", tenant_name=\"test\", uid=\"TEST-UNIT-001\"}" msg=flushing alerts="[Jesse Logo Showing Alert Name[d34154b][active]]"
level=debug ts=2020-10-30T18:03:05.592Z caller=notify.go:685 component=dispatcher receiver="IP Show Logo Alert Notif Name" integration=email[0] msg="Notify success" attempts=1
level=debug ts=2020-10-30T18:04:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"
level=debug ts=2020-10-30T18:06:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"
level=debug ts=2020-10-30T18:08:04.334Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{alertname=~\"^(?:^(Jesse Logo Showing Alert Name)$)$\",group_id=~\"^(?:2223-4343-34333)$\",tenant_name=~\"^(?:^(test)$)$\"}:{alertname=\"Jesse Logo Showing Alert Name\", group_id=\"2223-4343-34333\", rule_stack=\"dev\", rule_tenant=\"test\", stack_name=\"dev\", tenant_id=\"1\", tenant_name=\"test\", uid=\"TEST-UNIT-001\"}" msg=flushing alerts="[Jesse Logo Showing Alert Name[d34154b][active]]"
level=debug ts=2020-10-30T18:08:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"```
I encounter similar issue using same alert manager version 0.21 but our prometheus is on v2.19. In our case, there are some POST requests that are missing even though there are active alerts.
Another issue is that seems some POST requests has missing information.
For example, if there is an active alert containing 5 nodes grouped together.
We will receive 2 POST request. The first one is incomplete because it has missing nodes.
{
"alerts": [
{
...
"instance": "node1.demo.com:9100",
...
"instance": "node2.demo.com:9100",
...
"instance": "node5.demo.com:9100",
...
"status": "firing"
}
],
...
}
And the second POST requests is the complete one. With 5 nodes.
{
"alerts": [
{
...
"instance": "node1.demo.com:9100",
...
"instance": "node2.demo.com:9100",
...
"instance": "node3.demo.com:9100",
...
"instance": "node4.demo.com:9100",
...
"instance": "node5.demo.com:9100",
...
"status": "firing"
}
],
...
}
@andrewipmtl
Shouldn't all alerts route to the default route (which is set as the webhook)?
no, alerts that will match the Test Presence Offline Notification Name
receiver won't go through the top-level route.
@mvineza this seems a different problem.
For example, if there is an active alert containing 5 nodes grouped together.
We will receive 2 POST request. The first one is incomplete because it has missing nodes.
You have 5 alerts then and it may be that they are not sent at the same time by Prometheus.
@andrewipmtl
Shouldn't all alerts route to the default route (which is set as the webhook)?
no, alerts that will match the
Test Presence Offline Notification Name
receiver won't go through the top-level route.
Even though it has the continue flag to 'true' ? Is there any way to make all alerts hit the webhook no matter what?
We have a system where we want to store the alerts so that we can paginate them (webhook) but also only send notifications out for specific ones. Even if we configure an email notification for one of the alerts, we still want it to hit the webhook.
Has this problem been solved? I have encountered the same problem. When grouping alarms, webhook will lose part of the alarms.,My configuration information is as follows
Image: quay.io/prometheus/alertmanager:v0.21.0
route:
receiver: webhook
group_by:
- alertname
routes:
- receiver: webhook
continue: true
group_wait: 30s
group_interval: 1m
repeat_interval: 4h
receivers:
- name: webhook
webhook_configs:
- send_resolved: true
url: http://os-alertmgt-svc.prometheus-monitoring.svc:3000/api/v1/alert/webhook
templates:
- /etc/alertmanager/config/email.tmpl
In the alterManager page, I saw the following alarm. After passing the webhook, I could hardly see the complete alarm
alertname="aa"
4 alerts
alertname="we"
114 alerts
alertname="wewqd"
171 alerts
I've configured the webhook as another route on top of being the default route, and I'm still seeing some alerts not being sent through to the webhook.
@andrewipmtl can you share the new config?
Hello! Am running Prometheus 2.22.0 and Alertmanager v0.16.2 for Openshift Platform monitoring and am also observing some of messages not being sent to webhook endpoint. I do use only one default route for all messages in alertmanager. Alertmanager runs in debug mode so I can easily follows all events. At the webhook endpoint level I log all of events from Alertmanager. Here are my findings:
@andrewipmtl can you share the new config?
global:
resolve_timeout: 5m
http_config: {}
smtp_from: no-reply@mywebsite.com
smtp_hello: localhost
smtp_smarthost: smtp.office365.com:587
smtp_auth_username: no-reply@videri.com
smtp_auth_password: <secret>
smtp_require_tls: true
pagerduty_url: https://events.pagerduty.com/v2/enqueue
opsgenie_api_url: https://api.opsgenie.com/
wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
receiver: device-alerts.hook
group_by:
- alertname
- uid
- group_id
- stack_name
- tenant_id
- tenant_name
- rule_stack
- rule_tenant
routes:
- receiver: device-alerts.hook
match_re:
alertname: .*
continue: true
group_wait: 30s
group_interval: 5m
repeat_interval: 30m
receivers:
- name: device-alerts.hook
webhook_configs:
- send_resolved: true
http_config: {}
url: http://127.0.0.1/v1/webhook
max_alerts: 0
templates:
- /etc/alertmanager/templates/default.tmpl
@andrewipmtl hmm not sure why you configured a subroute. AFAICT this would work the same?
route:
receiver: device-alerts.hook
group_by:
- alertname
- uid
- group_id
- stack_name
- tenant_id
- tenant_name
- rule_stack
- rule_tenant
group_wait: 30s
group_interval: 5m
repeat_interval: 30m
@andrewipmtl hmm not sure why you configured a subroute. AFAICT this would work the same?
route: receiver: device-alerts.hook group_by: - alertname - uid - group_id - stack_name - tenant_id - tenant_name - rule_stack - rule_tenant group_wait: 30s group_interval: 5m repeat_interval: 30m
I've tried it without subroutes either and it still doesn't receive all the alerts via webhook, some still go missing.
Ok not sure why this happens but the only thing I can recommend is to run with --log.level=debug
and investigate what happens when no notification is sent while you expect some.
Ok not sure why this happens but the only thing I can recommend is to run with
--log.level=debug
and investigate what happens when no notification is sent while you expect some.
The exact same thing happens as when I tested it in an earlier debug session: https://github.com/prometheus/alertmanager/issues/2404#issuecomment-719715603
Alerts show up, but aren't sent to the webhook endpoint.
same with me
Facing same.
Forgive me if I misunderstood your initial question, but I think y'all didn't get the point.
The default receiver for some route node (including the top-level node) is only used if your alarm didn't match any matchers declared at that level of the routing tree. Your alarms enter the routing tree from the top and traverses it down until they match some matcher and then that node's receiver receives the alarm.
If you set "continue: true", the alarm will continue matching the siblings, meaning that it will try to match another matcher at the same level.
Therefore, if you want your Webhooks to receive all the alarms, it must be declared properly in combination with "continue: true" in all levels that your alarm matches.
Use amtool to test your routes, as described in prometheus/alertmanager
@rmartinsjr I'm not sure what you mean by sibling routes, all the routes that -should- alert are at the same level, including the one for the webhook, and all routes have continue: true defined, yet I'm still seeing this behavior.
It's also intermittent as some alerts would go through, and many would not. There's no pattern as it does not always seem to be the same alerts randomly passing through to the webhook either.
@andrewipmtl, reviewing all posted configurations, I believe you're using the simpler one that simonpasquier posted... With that supposition, are you sure it isn't the group_by that is grouping multiple alerts into one?
@rmartinsjr, yes I'm sure. The example I posted is a simplified version for demonstration. The actual version has a lot more alerts set up, all with continue: true
defined as parameter as well. We have dozens of alerts set up in this manner configured the same. All the alerts have different naming criteria as well as firing criteria.
Have never seen anything like that... Have you tried the routing tree visual tool? https://www.prometheus.io/webtools/alerting/routing-tree-editor/
@rmartinsjr I have never used it before -- but after using it for the first time just now, I get a "tree" map generated where it looks like every single alert branches from a single node which is the device-alerts.hook. So unless I'm wrong -- every single alert should be hitting the webhook.
In case this helps anyone, I was running AlertManager through prometheus-operator, and I experienced the exact same problem.
In my case the cause was that alert-manager was matching only alert that contained the right namespace label. There is an issue about that in https://github.com/prometheus-operator/prometheus-operator/issues/3737
In case this helps anyone, I was running AlertManager through prometheus-operator, and I experienced the exact same problem.
In my case the cause was that alert-manager was matching only alert that contained the right namespace label. There is an issue about that in prometheus-operator/prometheus-operator#3737
@luislhl , by namespace what exactly do you mean? I have no namespaces defined in my config file, is that the issue? I wasn't aware of any namespace matching if none were provided.
@luislhl , by namespace what exactly do you mean? I have no namespaces defined in my config file, is that the issue? I wasn't aware of any namespace matching if none were provided.
Hey, @andrewipmtl
By namespace I mean a Kubernetes namespace, my bad I didn't make it clearer.
I have deployed Alertmanager in a Kubernetes cluster by using the Prometheus Operator.
The final Alertmanager config I get has this matcher to select only alerts containing a namespace label with value kube-prometheus
:
global:
resolve_timeout: 5m
route:
receiver: "null"
group_by:
- job
routes:
- receiver: kube-prometheus-slack-alerts-slack-alerts-warning
group_by:
- alertname
matchers:
- namespace="kube-prometheus"
[...]
I had some alerts from others namespaces that were ignored because of this matcher. The issue I linked in my previous comment has more info about this behavior.
We have similar issue. Some alerts are not posted to webhook.
And I have a feeling that this is because alert is resolved within group_wait
interval.
As example, group_wait
set to 30s and alert lasts just 20s.
Is that possible?
P.S. Alertmanager v0.21.0, send_resolved
not specified (supposed to be true by default).
same problem, alert manager and prometheus shown the alert but not send data to webhook:
ts=2023-01-08T12:31:28.676Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=InstanceDown[c136526][active]
ts=2023-01-08T12:31:38.677Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}/{}:{job=\"node\"}" msg=flushing alerts=[InstanceDown[c136526][active]]
global:
receivers:
- name: "n8n"
webhook_configs:
- url: https://sample.tld/webhook/alertmanager
send_resolved: true
http_config:
basic_auth:
username: alertmanager
password: securePassword
tls_config:
insecure_skip_verify: true
route:
receiver: n8n
group_by: ['job']
group_wait: 10s
group_interval: 4m
repeat_interval: 2h
routes:
- receiver: n8n
continue: true
@mhf-ir Did you find a solution for this? facing same issue now on prometheus/alertmanager:v0.24.0
@mhf-ir Did you find a solution for this? facing same issue now on prometheus/alertmanager:v0.24.0
Still no solutions provided.
Still no solutions provided.
thanks @andrewipmtl , I mainly having issues sending the alert through AWS SNS
I am facing the same issues while sending slack messages with webhook and email as well. @andrewipmtl - Did you get past this issue ? I do not see any solution or further suggestions on the thread.
@andrewipmtl I see group_interval
is 5m, are you certain that all alerts are firing for at least 5 minutes, as otherwise some alerts might get resolved before a notification is sent.
issue with webhook...resolved, Actually the rocket chat .js file to be updated.
thanks @andrewipmtl , I mainly having issues sending the alert through AWS SNS
Does anyone found the solution to integrate prometheus with AWS SNS?
@alonperel
issue with webhook...resolved, Actually the rocket chat .js file to be updated.
@sahaniarungod can you please elaborate?
Is this still an Issue? I'm using the 0.26.0 version. Im seeing alerts in Prometheus and AlertManager but in Slack sometimes I get almost empty alert in Slack and sometimes values are populated.
@rsuplina can you try using group by parameters and group your alerts ?
group_by:
- severity
- alertname
- team
- clusterName
- container
- exported_container
Hi @aroundarmor-cldcvr thanks for your reply, I have tried to change to:
But still what would happen is some alarms would arrive with full data, and then if I change to some other instances that also does not exist and restart docker services (Prometheus-alertmanager) it will then again start send Notifications which are missing values .
Here is a full description of my error I wonder if. this is a bug with alertmanager ?
facing the same issue today. i'm seeing all the alerts in any channels except webhook which some of them are missing.
Hi everyone, I have exactly the same case. Due to the characteristics of the monitoring system, I can not send one aggregate alarm, but send them in individual webhooks, with a large number of alarms such as 50, maybe 7 are delivered, checking the logs on the recipient side shows as if the reqeusts do not arrive. Have any of you observed this problem in version v.0.27? Perhaps the solution is to send the alerts to some queue like AWS SNS?
@xDarekM I would suggest running Alertmanager with debug logging --log.level=debug
to see what is happening to those additional webhooks. You should look for lines such as Notify attempt failed
and also check the notifications_failed_total
metric. If there are no failed notification attempts then it's possible the number of notifications being sent doesn't match your expectations because of the Group wait, Group interval and Repeat interval timers.
Hi,
I'm using a webhook receiver for AlertManager to store alerts for pagination etc. For the most part, the webhook seems to be working just fine, but for some alerts, the webhook doesn't seem to receive a POST call at all from AlertManager.
Is there any way to troubleshoot this? For example, a way to trace alertmanager's outgoing HTTP calls to the webhook receiver?
The webhook endpoint is a Rails application server which also logs all incoming traffic, and after investigating, the missing alerts never show up in the logs (a POST request is never received).
I've attached a partial configuration, omitting redundant receivers etc. They're almost all the same.
Thanks,
Environment
System information:
Linux 4.14.186-146.268.amzn2.x86_64 x86_64
Alertmanager version:
Prometheus version:
Alertmanager configuration file: