prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.7k stars 2.17k forks source link

Alertmanager is sending non existing alert #2744

Closed 23ewrdtf closed 2 years ago

23ewrdtf commented 3 years ago

What did you do? I installed alert manager with prometheus using this chart: https://github.com/prometheus-community/helm-charts/blob/main/charts/prometheus/values.yaml

prometheus-community:Chart:14.10.0:App:2.26.0

What did you expect to see? I have an alert in Prometheus:

    - alert: PrometheusTargetMissing
      expr: up == 0
      for: 0m
      labels:
        severity: critical
      annotations:
        identifier: '{{ $labels.instance }}'
        summary: Prometheus target missing `{{ $labels.instance }}`
        description: "A Prometheus target has disappeared. An exporter might be crashed. `{{ $labels.cluster }}`"

AlertManagerAPP 12:41 [FIRING:21] Monitoring Event Notification Alert: Prometheus target missing xxx - critical Description: A Prometheus target has disappeared. An exporter might be crashed. `` Details: • alertname: PrometheusTargetMissing • endpoint: metrics Show more

[FIRING:25] Monitoring Event Notification 12:46 Alert: Prometheus target missing xxx - critical Description: A Prometheus target has disappeared. An exporter might be crashed. `` Details: • alertname: PrometheusTargetMissing • endpoint: metrics Show more


**What did you see instead? Under which circumstances?**
Alertmanager shouldn't be sending this alert.

**Environment**

* System information:

/alertmanager $ uname -srm Linux 4.14.243-185.433.amzn2.x86_64 x86_64


* Alertmanager version:

/alertmanager $ alertmanager --version alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d) build user: root@dee35927357f build date: 20200617-08:54:02 go version: go1.14.4


* Prometheus version:

/prometheus $ prometheus --version prometheus, version 2.26.0 (branch: HEAD, revision: 3cafc58827d1ebd1a67749f88be4218f0bab3d8d) build user: root@a67cafebe6d0 build date: 20210331-11:56:23 go version: go1.16.2 platform: linux/amd64


* Alertmanager configuration file:

global: resolve_timeout: 5m http_config: {} smtp_hello: localhost smtp_require_tls: true slack_api_url: pagerduty_url: xxx opsgenie_api_url: xxx wechat_api_url: xxx victorops_api_url: xxx route: receiver: slack group_wait: 30s group_interval: 5m repeat_interval: 3h receivers:

nabilrad commented 3 years ago

We are seeing the same scenario as we are getting false alarms every time based on repeat_interval value and no events are getting record on both alertamanger and prometheus ui.

$./alertmanager --version alertmanager, version 0.23.0 (branch: HEAD, revision: 61046b17771a57cfd4c4a51be370ab930a4d7d54) build user: root@e21a959be8d2 build date: 20210825-10:48:55 go version: go1.16.7

$ ./prometheus --version prometheus, version 2.26.0 (branch: HEAD, revision: 3cafc58827d1ebd1a67749f88be4218f0bab3d8d) build user: root@a67cafebe6d0 build date: 20210331-11:56:23 go version: go1.16.2 platform: linux/amd64

$ uname -a Linux seigwpdevmon01 5.4.17-2102.205.7.3.el7uek.x86_64 #2 SMP Fri Sep 17 16:52:13 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux

roidelapluie commented 3 years ago

If you go in prometheus and type ALERT in the "graph" section, do you see the alert?

nabilrad commented 3 years ago

This is exactly the kicker. I don't see any alerts when executing "ALERTS" in the graph or console (period = last 2 days for instance). I usually get the email alerts every 3 hours for each of the monitored resources "repeat_interval: 3h" followed with no resolved emails (send_resolved: true). We have not see this issue in all other environments running "alertmanager, version 0.16.0" and prometheus, version 2.6.1.

23ewrdtf commented 3 years ago

Example screenshots:

Prometheus alerts: Screenshot 2021-10-26 at 16 35 25

Alertmanager alerts: Screenshot 2021-10-26 at 16 35 35

Slack alerts: Screenshot 2021-10-26 at 16 36 43

nabilrad commented 3 years ago

No more false alarms after upgrading alertmanager to version 0.23.0.

23ewrdtf commented 3 years ago

No more false alarms after upgrading alertmanager to version 0.23.0.

Thanks, will try that.

23ewrdtf commented 3 years ago

That didn't help. Still getting those alerts.

23ewrdtf commented 3 years ago

Just got this alert in AlertManager logs and in Slack (slack time 14:16, Prometheus server time 13:16 as both in different time zones.).

There are no alerts in Prometheus. Seems like all the PrometheusTargetMissing[8b68fd2][resolved] alerts are being sent as normal ALERTS.

level=debug 
ts=2021-10-28T13:16:52.129Z 
caller=dispatch.go:516 
component=dispatcher 
aggrGroup={}:{} 
msg=flushing 
alerts="[
  PrometheusTargetMissing[8b68fd2][resolved] 
  PrometheusTargetMissing[9c8d752][resolved] 
  PrometheusTargetMissing[0b67966][resolved] 
  PrometheusTargetMissing[484babe][resolved] 
  PrometheusTargetMissing[d7be54b][resolved] 
  PrometheusTargetMissing[713778d][resolved] 
  PrometheusTargetMissing[9a65983][resolved] 
  PrometheusTargetMissing[4d27dea][resolved] 
  ContainerMemoryUsage[f61fabf][active] 
  PrometheusAlertmanagerE2eDeadManSwitch[5d03b49][active]
]"
AlertmanagerAPP  14:16
[FIRING:1] Monitoring Event Notification
Alert: Prometheus target missing xxxxxxxxxxxxx - critical
 Description: A Prometheus target has disappeared. An exporter might be crashed. ``
 Details:
  • alertname: PrometheusTargetMissing
  • endpoint: metrics
  • instance: xxxxxxxxxxxxx
  • job: xxxxxxxxxxxxx
  • namespace: default
  • pod: xxxxxxxxxxxxx-xxx
  • service: xxxxxxxxxxxxx
  • severity: critical

 Alert: Prometheus target missing xxxxxxxxxxxxx - critical
 Description: A Prometheus target has disappeared. An exporter might be crashed. ``
 Details:
  • alertname: PrometheusTargetMissing
  • endpoint: metrics
  • instance: xxxxxxxxxxxxx
  • job: xxxxxxxxxxxxx
  • namespace: default
  • pod: xxxxxxxxxxxxx-xxxxxxxxxxxxx
  • service: xxxxxxxxxxxxx
  • severity: critical

 Alert: Prometheus target missing xxxxxxxxxxxxx - critical
 Description: A Prometheus target has disappeared. An exporter might be crashed. xxxxxxxxxxxxx
 Details:
  • alertname: PrometheusTargetMissing
  • beta_kubernetes_io_arch: amd64
  • beta_kubernetes_io_instance_type: t2.medium
  • beta_kubernetes_io_os: linux
  • cluster: xxxxxxxxxxxxx
  • failure_domain_beta_kubernetes_io_region: xxxxxxxxxxxxx
  • failure_domain_beta_kubernetes_io_zone: xxxxxxxxxxxxxa
  • instance: xxxxxxxxxxxxx
  • job: kubernetes-nodes
  • kubernetes_io_arch: amd64
  • kubernetes_io_hostname: xxxxxxxxxxxxx
  • kubernetes_io_os: linux
  • node_kubernetes_io_instance_type: t2.medium
  • xxxxxxxxxxxxx
  • severity: critical
  • topology_kubernetes_io_region: xxxxxxxxxxxxx
  • topology_kubernetes_io_zone: xxxxxxxxxxxxxa
  • type: xxxxxxxxxxxxxa

 Alert: Prometheus target missing xxxxxxxxxxxxx.xxxxxxxxxxxxx.compute.internal - critical
 Description: A Prometheus target has disappeared. An exporter might be crashed. xxxxxxxxxxxxx
 Details:
  • alertname: PrometheusTargetMissing
  • beta_kubernetes_io_arch: amd64
  • beta_kubernetes_io_instance_type: z1d.xlarge
  • beta_kubernetes_io_os: linux
  • cluster: xxxxxxxxxxxxx
  • failure_domain_beta_kubernetes_io_region: xxxxxxxxxxxxx
  • failure_domain_beta_kubernetes_io_zone: xxxxxxxxxxxxxa
  • instance: xxxxxxxxxxxxx.xxxxxxxxxxxxx.compute.internal
  • job: kubernetes-nodes
  • kubernetes_io_arch: amd64
  • kubernetes_io_hostname: xxxxxxxxxxxxx.xxxxxxxxxxxxx.compute.internal
  • kubernetes_io_os: linux
  • node_kubernetes_io_instance_type: z1d.xlarge
  • xxxxxxxxxxxxx
  • severity: critical
  • topology_kubernetes_io_region: xxxxxxxxxxxxx
  • topology_kubernetes_io_zone: xxxxxxxxxxxxxa
  • xxxxxxxxxxxxx

 Alert: Prometheus target missing xxxxxxxxxxxxx - critical
 Description: A Prometheus target has disappeared. An exporter might be crashed. xxxxxxxxxxxxx
 Details:
  • alertname: PrometheusTargetMissing
  • beta_kubernetes_io_arch: amd64
  • beta_kubernetes_io_instance_type: t2.medium
  • beta_kubernetes_io_os: linux
  • cluster: xxxxxxxxxxxxx
  • failure_domain_beta_kubernetes_io_region: xxxxxxxxxxxxx
  • failure_domain_beta_kubernetes_io_zone: xxxxxxxxxxxxxa
  • instance: xxxxxxxxxxxxx
  • job: kubernetes-nodes-cadvisor
  • kubernetes_io_arch: amd64
  • kubernetes_io_hostname: xxxxxxxxxxxxx
  • kubernetes_io_os: linux
  • node_kubernetes_io_instance_type: t2.medium
  • xxxxxxxxxxxxx
  • severity: critical
  • topology_kubernetes_io_region: xxxxxxxxxxxxx
  • topology_kubernetes_io_zone: xxxxxxxxxxxxxa
  • type: xxxxxxxxxxxxxa

 Alert: Prometheus target missing xxxxxxxxxxxxx.xxxxxxxxxxxxx.compute.internal - critical
 Description: A Prometheus target has disappeared. An exporter might be crashed. xxxxxxxxxxxxx
 Details:
  • alertname: PrometheusTargetMissing
  • beta_kubernetes_io_arch: amd64
  • beta_kubernetes_io_instance_type: z1d.xlarge
  • beta_kubernetes_io_os: linux
  • cluster: xxxxxxxxxxxxx
  • failure_domain_beta_kubernetes_io_region: xxxxxxxxxxxxx
  • failure_domain_beta_kubernetes_io_zone: xxxxxxxxxxxxxa
  • instance: xxxxxxxxxxxxx.xxxxxxxxxxxxx.compute.internal
  • job: kubernetes-nodes-cadvisor
  • kubernetes_io_arch: amd64
  • kubernetes_io_hostname: xxxxxxxxxxxxx.xxxxxxxxxxxxx.compute.internal
  • kubernetes_io_os: linux
  • node_kubernetes_io_instance_type: z1d.xlarge
  • xxxxxxxxxxxxx
  • severity: critical
  • topology_kubernetes_io_region: xxxxxxxxxxxxx
  • topology_kubernetes_io_zone: xxxxxxxxxxxxxa
  • xxxxxxxxxxxxx

 Alert: Prometheus target missing xxxxxxxxxxxxx - critical
 Description: A Prometheus target has disappeared. An exporter might be crashed. ``
 Details:
  • alertname: PrometheusTargetMissing
  • app: xxxxxxxxxxxxx
  • chart: xxxxxxxxxxxxx-xxxxxxxxxxxxx
  • component: agent
  • heritage: Tiller
  • instance: xxxxxxxxxxxxx
  • job: kubernetes-service-endpoints
  • kubernetes_name: xxxxxxxxxxxxx
  • kubernetes_namespace: default
  • kubernetes_node: xxxxxxxxxxxxx
  • release: xxxxxxxxxxxxx
  • severity: critical

 Alert: Prometheus target missing xxxxxxxxxxxxx - critical
 Description: A Prometheus target has disappeared. An exporter might be crashed. ``
 Details:
  • alertname: PrometheusTargetMissing
  • app: xxxxxxxxxxxxx
  • chart: xxxxxxxxxxxxx-xxxxxxxxxxxxx
  • component: agent
  • heritage: Tiller
  • instance: xxxxxxxxxxxxx
  • job: kubernetes-service-endpoints
  • kubernetes_name: xxxxxxxxxxxxx
  • kubernetes_namespace: default
  • kubernetes_node: xxxxxxxxxxxxx.xxxxxxxxxxxxx.compute.internal
  • release: xxxxxxxxxxxxx
  • severity: critical

 Alert: Prometheus AlertManager E2E dead man switch - critical
 Description: Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.
 VALUE = 1
 Details:
  • alertname: PrometheusAlertmanagerE2eDeadManSwitch
  • severity: critical
Show less

Then again two more alerts in slack and in alertmanager logs:

level=debug 
ts=2021-10-28T13:36:52.168Z 
caller=dispatch.go:516 
component=dispatcher 
aggrGroup={}:{} 
msg=flushing 
alerts="[
PrometheusTargetMissing[930a732][resolved] PrometheusTargetMissing[63b2c77][resolved] 
PrometheusTargetMissing[2c9e4fa][active] PrometheusTargetMissing[320a208][resolved] 
PrometheusTargetMissing[b6047c3][resolved] PrometheusTargetMissing[e287914][resolved] 
PrometheusTargetMissing[4e752ca][active] PrometheusTargetMissing[169f6fa][resolved] 
PrometheusTargetMissing[b3c6b47][resolved] PrometheusTargetMissing[0c2d5d9][resolved] 
PrometheusTargetMissing[ee627b6][active] PrometheusTargetMissing[cb47392][resolved] 
PrometheusTargetMissing[5e3164d][resolved] PrometheusTargetMissing[48f2962][resolved] 
PrometheusTargetMissing[2712c95][active] PrometheusTargetMissing[6d0d40f][resolved] 
ContainerMemoryUsage[f61fabf][active] 
PrometheusAlertmanagerE2eDeadManSwitch[5d03b49][active]]"
level=debug 
ts=2021-10-28T13:41:52.168Z 
caller=dispatch.go:516 
component=dispatcher 
aggrGroup={}:{} 
msg=flushing 
alerts="[
PrometheusTargetMissing[930a732][resolved] PrometheusTargetMissing[63b2c77][resolved] 
PrometheusTargetMissing[2c9e4fa][resolved] PrometheusTargetMissing[320a208][resolved] 
PrometheusTargetMissing[b6047c3][resolved] PrometheusTargetMissing[e287914][resolved] 
PrometheusTargetMissing[4e752ca][resolved] PrometheusTargetMissing[169f6fa][resolved] 
PrometheusTargetMissing[b3c6b47][resolved] PrometheusTargetMissing[0c2d5d9][resolved] 
PrometheusTargetMissing[ee627b6][resolved] PrometheusTargetMissing[cb47392][resolved] 
PrometheusTargetMissing[5e3164d][resolved] PrometheusTargetMissing[48f2962][resolved] 
PrometheusTargetMissing[2712c95][resolved] PrometheusTargetMissing[6d0d40f][resolved] 
ContainerMemoryUsage[f61fabf][active] 
PrometheusAlertmanagerE2eDeadManSwitch[5d03b49][active]]"

The beginning of alertmanager logs:

k logs prometheus-community-alertmanager-776696c9d4-llfjh -c prometheus-alertmanager
level=info ts=2021-10-28T13:10:46.567Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=HEAD, revision=61046b17771a57cfd4c4a51be370ab930a4d7d54)"
level=info ts=2021-10-28T13:10:46.567Z caller=main.go:226 build_context="(go=go1.16.7, user=root@e21a959be8d2, date=20210825-10:48:55)"
level=debug ts=2021-10-28T13:10:51.069Z caller=main.go:372 externalURL=http://localhost:9093
level=info ts=2021-10-28T13:10:51.069Z caller=coordinator.go:113 component=configuration msg="Loading configuration file" file=/etc/config/alertmanager.yml
level=info ts=2021-10-28T13:10:51.466Z caller=coordinator.go:126 component=configuration msg="Completed loading of configuration file" file=/etc/config/alertmanager.yml
level=debug ts=2021-10-28T13:10:51.567Z caller=main.go:498 routePrefix=/
level=info ts=2021-10-28T13:10:51.666Z caller=main.go:518 msg=Listening address=:9093
level=info ts=2021-10-28T13:10:51.666Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
level=debug ts=2021-10-28T13:10:52.129Z caller=dispatch.go:165 component=dispatcher msg="Received alert" alert=PrometheusTargetMissing[484babe][resolved]
level=debug ts=2021-10-28T13:10:52.130Z caller=dispatch.go:165 component=dispatcher msg="Received alert" alert=PrometheusTargetMissing[d7be54b][resolved]
.
.
.
roidelapluie commented 3 years ago

Do you specify any parameters to prometheus?

23ewrdtf commented 3 years ago

prometheus-server pod arguments:

prometheus-server:
    Image:         quay.io/prometheus/prometheus:v2.26.0
    Args:
      --storage.tsdb.retention.time=15d
      --config.file=/etc/config/prometheus.yml
      --storage.tsdb.path=/data
      --web.console.libraries=/etc/prometheus/console_libraries
      --web.console.templates=/etc/prometheus/consoles
      --web.enable-lifecycle
      --web.external-url=https://xxxxxxx
    State:          Running

prometheus-server Command-Line Flags

alertmanager.notification-queue-capacity    10000
alertmanager.timeout    
config.file /etc/config/prometheus.yml
enable-feature  
log.format  logfmt
log.level   info
query.lookback-delta    5m
query.max-concurrency   20
query.max-samples   50000000
query.timeout   2m
rules.alert.for-grace-period    10m
rules.alert.for-outage-tolerance    1h
rules.alert.resend-delay    1m
scrape.adjust-timestamps    true
storage.exemplars.exemplars-limit   0
storage.remote.flush-deadline   1m
storage.remote.read-concurrent-limit    10
storage.remote.read-max-bytes-in-frame  1048576
storage.remote.read-sample-limit    50000000
storage.tsdb.allow-overlapping-blocks   false
storage.tsdb.max-block-duration 1d12h
storage.tsdb.min-block-duration 2h
storage.tsdb.no-lockfile    false
storage.tsdb.path   /data
storage.tsdb.retention  0s
storage.tsdb.retention.size 0B
storage.tsdb.retention.time 15d
storage.tsdb.wal-compression    true
storage.tsdb.wal-segment-size   0B
web.config.file 
web.console.libraries   /etc/prometheus/console_libraries
web.console.templates   /etc/prometheus/consoles
web.cors.origin .*
web.enable-admin-api    false
web.enable-lifecycle    true
web.external-url    https://xxx
web.listen-address  0.0.0.0:9090
web.max-connections 512
web.page-title  Prometheus Time Series Collection and Processing Server
web.read-timeout    5m
web.route-prefix    /
web.user-assets 

alertmanager pod arguments:

  prometheus-alertmanager:
    Image:         quay.io/prometheus/alertmanager:v0.23.0
    Args:
      --config.file=/etc/config/alertmanager.yml
      --storage.path=/data
      --cluster.listen-address=
      --log.level=info
      --web.external-url=http://xxxxxxx
    State:          Running

alertmanager Config:


global:
  resolve_timeout: 5m
  http_config:
    follow_redirects: true
route:
  receiver: slack
  continue: false
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
receivers:
- name: slack
  slack_configs:
  - send_resolved: true
    http_config:
      follow_redirects: true
    api_url: <secret>
    channel: '#xxx'
    username: '{{ template "slack.default.username" . }}'
    color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
    title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing
      | len }}{{ end }}] Monitoring Event Notification'
    title_link: '{{ template "slack.default.titlelink" . }}'
    pretext: '{{ template "slack.default.pretext" . }}'
    text: |-
      {{ range .Alerts }}
        *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
        *Description:* {{ .Annotations.description }}
        *Details:*
        {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
        {{ end }}
      {{ end }}
    short_fields: false
    footer: '{{ template "slack.default.footer" . }}'
    fallback: '{{ template "slack.default.fallback" . }}'
    callback_id: '{{ template "slack.default.callbackid" . }}'
    icon_emoji: '{{ template "slack.default.iconemoji" . }}'
    icon_url: https://avatars3.githubusercontent.com/u/3380462
    link_names: false
templates: []
xenofree commented 2 years ago

I have the same issue. Alertmanager is sending rules that has been deleted or updated.

alertmanager, version 0.23.0 (branch: HEAD, revision: 61046b17771a57cfd4c4a51be370ab930a4d7d54)
  build user:       root@e21a959be8d2
  build date:       20210825-10:48:55
  go version:       go1.16.7
  platform:         linux/amd64
prometheus, version 2.33.1 (branch: HEAD, revision: 4e08110891fd5177f9174c4179bc38d789985a13)
  build user:       root@37fc1ebac798
  build date:       20220202-15:23:18
  go version:       go1.17.6
  platform:         linux/amd64
mutagenspree commented 2 years ago

same issue. alertmanager sends alerts that not existing in prometheus and deleted rule. Upgrade to 0.23 doesn't help

simonpasquier commented 2 years ago

Alertmanager can't create alerts by itself. There must be something somewhere firing an alert at Alertmanager.

MMGeri commented 6 months ago

why is this closed