prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.56k stars 2.13k forks source link

Assistance Needed with Prometheus and Alertmanager Configuration #3781

Open Trio-official opened 5 months ago

Trio-official commented 5 months ago

I am encountering challenges with configuring Prometheus and Alertmanager for my application's alarm system. Below are the configurations I am currently using:

prometheus.yml: Scrape Interval: 1h

rules.yml:

groups:
  - name: recording-rule
    interval: 1h
    rules:
      - record: myRecord
        expr: expression….. (calculating ratio by dividing two metric > than value)

  - name: alerting-rule
    interval: 4h
    rules:
      - alert: myAlert
        expr: max_over_time(myRecord[4h])
        labels:
          severity: warning
        annotations:
          summary: “summary”

alertmanager.yml:

group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h

Issues:

Request for Assistance:

I am seeking guidance on configuring Prometheus and Alertmanager to achieve the following:

Thanks in advance.

TheMeier commented 5 months ago

Hi there, generally the issue tracking is not the right place for questions like this. Please consider taking it to https://groups.google.com/g/prometheus-users or similar forums.

Very likely the issue that you are facing here is staleness. If you only scrape every hour your metric will be stale (and thus non existent) for 55 minutes of every hour. From a prometheus users post:

Either way, Prometheus is not going to handle hourly scraping well, the practical upper limit of scrape interval is 2 minutes. I would recommend changing the way your exporter works, I would probably do something like run it as a cron job and use the pushgateway or node_exporter textfile collector. https://groups.google.com/g/prometheus-users/c/2DDL7FKMeVk/m/N5WJ8hUnAAAJ

So that basically means your configuration is not supported.

Please close this issue as it is not a bug or feature request for alertmanager.

Trio-official commented 5 months ago

Thank you for your prompt response and guidance on addressing the metric staleness issue.

Regarding your suggestion to use square brackets for the recording metric and alerting rule (the link that you shared), I confirm that I have already implemented this approach. However, the main challenge persists with the discrepancy in the number of alerts generated by Prometheus compared to those displayed in Alertmanager. (e.g. max_over_time(metric[1h]))

To illustrate, when observing Prometheus, I may observe approximately 25,000 alerts triggered within a given period. However, when reviewing the corresponding alerts in Alertmanager, the count often deviates significantly, displaying figures such as 10,000 or 18,000, rather than the expected 25,000.

This inconsistency poses a significant challenge in our alert management process, leading to confusion and potentially overlooking critical alerts.

I would greatly appreciate any further insights or recommendations you may have to address this issue and ensure alignment between Prometheus and Alertmanager in terms of the number of alerts generated and displayed.

grobinson-grafana commented 4 months ago

As @TheMeier said https://groups.google.com/g/prometheus-users is the best place to ask such questions. Could you please close this issue?