prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.57k stars 2.14k forks source link

Alertmanager send resolved notification when problem not solved #3871

Open ysfnsrv opened 3 months ago

ysfnsrv commented 3 months ago

I want to monitor the status of docker containers. The problem is as follows. I stop a test docker container, I get a notification in slack that there is a stopped container, SUPER! But after exactly 5 minutes, I get a message that the problem is solved and the docker is up as if.

If alertmanager.yml change group_interval: 5m to group_interval: 15m, I get the wrong notification again, but now after 15 minutes. I was confused to comment out this line group_interval: 15m, and then the notification came in 5 minutes again. The problem is that the docker container is stopped and has not been started, but for some reason the erroneous notification comes.

Translated with DeepL.com (free version)

APP 12:47 PM [FIRING:1] ContainerKilled (NGINX Docker Maintainers docker-maint@nginx.com /docker/084586b71d3605ea6657d2cb4530348438226d14af7d0a563427bb8bc6a51e46 nginx 192.168.100.1:8080 cadvisor-intra myngin5 warning) New 12:52 [RESOLVED] ContainerKilled (NGINX Docker Maintainers docker-maint@nginx.com /docker/084586b71d3605ea6657d2cb4530348438226d14af7d0a563427bb8bc6a51e46 nginx 192.168.100.1:8080 cadvisor-intra myngin5 warning)

alertmanager, version 0.23.0 (branch: debian/sid, revision: 0.23.0-4ubuntu0.2) build user: team+pkg-go@tracker.debian.org build date: 20230502-12:28:45 go version: go1.18.1 platform: linux/amd64

prometheus, version 2.31.2+ds1 (branch: debian/sid, revision: 2.31.2+ds1-1ubuntu1.22.04.2) build user: team+pkg-go@tracker.debian.org build date: 20230502-12:17:56 go version: go1.18.1 platform: linux/amd64

receivers:

templates:

inhibit_rules:

global:
  external_labels:
      monitor: ''
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']
rule_files:
    - "/etc/prometheus/rules/prod.yml" 
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 60s
    scrape_timeout: 60s
  - job_name: 'cadvisor-intra'
    static_configs:
      - targets: ['192.168.100.1:8080']

/etc/prometheus/rules/prod.yml groups:


* Logs:

Jun 11 12:46:36 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:46:36.395Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active] Jun 11 12:47:06 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:47:06.396Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{alertname=\"ContainerKilled\"}" msg=flushing alerts=[ContainerKilled[5029f3e][active]] Jun 11 12:48:16 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:48:16.391Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active] Jun 11 12:49:56 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:49:56.392Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active] Jun 11 12:50:46 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:50:46.393Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][resolved] Jun 11 12:52:06 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:52:06.396Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{alertname=\"ContainerKilled\"}" msg=flushing alerts=[ContainerKilled[5029f3e][resolved]]```

grobinson-grafana commented 3 months ago

Your logs show that the alert was resolved, and this resolved alert was sent to the Alertmanager which then sent a resolved notification.

Jun 11 12:50:46 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:50:46.393Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][resolved]

You need to understand why your alert resolved by looking at the query. I'm afraid this isn't an issue with Alertmanager.

ysfnsrv commented 3 months ago

Yes, I totally agre with you, tha is why I don't understud what is problem... I stopped the docker container and didn't start it. So I don't understand why it comes and who sends this message that the problem is solved.

grobinson-grafana commented 3 months ago

You need to look at your ContainerKilled alert in Prometheus and understand why it resolved. My guess is that the query time() - container_last_seen > 60 returned 0 (false). If you need additional help you can ask in the promtheus-users mailing list.