Alertmanager send resolved notification when problem not solved

ysfnsrv commented 5 months ago

I want to monitor the status of docker containers. The problem is as follows. I stop a test docker container, I get a notification in slack that there is a stopped container, SUPER! But after exactly 5 minutes, I get a message that the problem is solved and the docker is up as if.

If alertmanager.yml change group_interval: 5m to group_interval: 15m, I get the wrong notification again, but now after 15 minutes. I was confused to comment out this line group_interval: 15m, and then the notification came in 5 minutes again. The problem is that the docker container is stopped and has not been started, but for some reason the erroneous notification comes.

Translated with DeepL.com (free version)

APP 12:47 PM [FIRING:1] ContainerKilled (NGINX Docker Maintainers docker-maint@nginx.com /docker/084586b71d3605ea6657d2cb4530348438226d14af7d0a563427bb8bc6a51e46 nginx 192.168.100.1:8080 cadvisor-intra myngin5 warning) New 12:52 [RESOLVED] ContainerKilled (NGINX Docker Maintainers docker-maint@nginx.com /docker/084586b71d3605ea6657d2cb4530348438226d14af7d0a563427bb8bc6a51e46 nginx 192.168.100.1:8080 cadvisor-intra myngin5 warning)

System information:

Linux 6.5.0-1022-azure x86_64
Alertmanager version:

alertmanager, version 0.23.0 (branch: debian/sid, revision: 0.23.0-4ubuntu0.2) build user: team+pkg-go@tracker.debian.org build date: 20230502-12:28:45 go version: go1.18.1 platform: linux/amd64

Prometheus version:

prometheus, version 2.31.2+ds1 (branch: debian/sid, revision: 2.31.2+ds1-1ubuntu1.22.04.2) build user: team+pkg-go@tracker.debian.org build date: 20230502-12:17:56 go version: go1.18.1 platform: linux/amd64

Alertmanager configuration file:


route:
receiver: 'slack-notifications'
group_by: ['alertname']
group_wait: 30s
#group_interval: 15m
repeat_interval: 1h

receivers:

name: 'slack-notifications' slack_configs:
- api_url: 'https://hooks.slack.com/services/xxxxxxx/xxxx' channel: '#alerts' send_resolved: true

templates:

'/etc/prometheus/alertmanager_templates/prod.tmpl'

inhibit_rules:

source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']

global:
  external_labels:
      monitor: ''
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']
rule_files:
    - "/etc/prometheus/rules/prod.yml" 
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 60s
    scrape_timeout: 60s
  - job_name: 'cadvisor-intra'
    static_configs:
      - targets: ['192.168.100.1:8080']

/etc/prometheus/rules/prod.yml groups:

name: ContainerHealthAlerts rules:
- alert: CadvisorContainerDown expr: up{job="cadvisor"} == 0 labels: severity: 'critical' annotations: summary: 'Alert: Cadvisor container is down' description: 'The Cadvisor container is down or not responding.'
- alert: ContainerKilled expr: 'time() - container_last_seen > 60' for: 0m labels: severity: warning annotations: summary: Container killed (instance {{ $labels.instance }}) description: "A container has disappeared\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerAbsent expr: 'absent(container_last_seen)' for: 7m labels: severity: warning annotations: summary: Container absent (instance {{ $labels.instance }}) description: "A container is absent for 7 min\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerHighMemoryUsage expr: '(sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80' for: 2m labels: severity: warning annotations: summary: Container High Memory usage (instance {{ $labels.instance }}) description: "Container Memory usage is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerHighThrottleRate expr: 'rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1' for: 2m labels: severity: warning annotations: summary: Container high throttle rate (instance {{ $labels.instance }}) description: "Container is being throttled\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerLowCpuUtilization expr: '(sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) < 20' for: 7d labels: severity: info annotations: summary: Container Low CPU utilization (instance {{ $labels.instance }}) description: "Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerLowMemoryUsage expr: '(sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) < 20' for: 7d labels: severity: info annotations: summary: Container Low Memory usage (instance {{ $labels.instance }}) description: "Container Memory usage is under 20% for 1 week. Consider reducing the allocated memory.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"


* Logs:

Jun 11 12:46:36 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:46:36.395Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active] Jun 11 12:47:06 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:47:06.396Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{alertname=\"ContainerKilled\"}" msg=flushing alerts=[ContainerKilled[5029f3e][active]] Jun 11 12:48:16 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:48:16.391Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active] Jun 11 12:49:56 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:49:56.392Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active] Jun 11 12:50:46 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:50:46.393Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][resolved] Jun 11 12:52:06 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:52:06.396Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{alertname=\"ContainerKilled\"}" msg=flushing alerts=[ContainerKilled[5029f3e][resolved]]```

grobinson-grafana commented 5 months ago

Your logs show that the alert was resolved, and this resolved alert was sent to the Alertmanager which then sent a resolved notification.

Jun 11 12:50:46 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:50:46.393Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][resolved]

You need to understand why your alert resolved by looking at the query. I'm afraid this isn't an issue with Alertmanager.

ysfnsrv commented 5 months ago

Yes, I totally agre with you, tha is why I don't understud what is problem... I stopped the docker container and didn't start it. So I don't understand why it comes and who sends this message that the problem is solved.

grobinson-grafana commented 5 months ago

You need to look at your ContainerKilled alert in Prometheus and understand why it resolved. My guess is that the query time() - container_last_seen > 60 returned 0 (false). If you need additional help you can ask in the promtheus-users mailing list.

prometheus / alertmanager

Alertmanager send resolved notification when problem not solved #3871