grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.03k stars 512 forks source link

Alertmanager feature, testing and logging. #9286

Closed naguam closed 1 week ago

naguam commented 1 week ago

Hello,

I'm using a set of rules that worked previously with standard Alertmanager and also work with Mimir Alertmanager (imported with mimirtool) as it says firing when triggered on Grafana UI.

name: alert.rules
rules:
    - alert: PrometheusTargetMissing
      expr: up{job="prometheus.scrape.default"} == 0
      labels:
        severity: critical
      annotations:
        description: |-
            A Prometheus target has disappeared. An exporter might be crashed.
              VALUE = {{ $value }}
              LABELS = {{ $labels }}            
        summary: Prometheus target missing (instance {{ $labels.instance }})
    - alert: HostOutOfDiskSpace
      expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
      for: 2m
      labels:
        severity: warning
      annotations:
        description: |-
            Disk is almost full (< 10% left)
              VALUE = {{ $value }}
              LABELS = {{ $labels }}            
        summary: Host out of disk space (instance {{ $labels.instance }}

In Grafana UI Alerting settings I configured the Mimir Alertmanager as the default Alertmanager, even for Grafana configuration)

I configured the global config and the the default contact point for the default notification policy to see if I receive emails before making fancier configuration.

First problem : Weird thing, contrary to all Grafana's documentation, for the contact point, there was no testing button. Second problem : I checked both Mimir and Grafana's logs, and there was no SMTP errors or anything related to that. Third problem : Mimirtool does not have a test/send test command and amtool does not seem compatible on that specific feature (did not find my group or I may not know how to do without a group for amtool config group test or something like that).

On the :9009/alertmanager page, I don't see the alerts that were fired according to Grafana.

Because of the lack of good information (because there is a lot of repeating status logs......) provided by the "suite" I am unable to diagnose/troubleshoot what I've been doing wrong.

From the thing configured in UI, mimirtool alertmanager get shows this (minus email addresses/passwords)

global:
    http_config:
        enable_http2: true
        follow_redirects: true
        tls_config:
            insecure_skip_verify: false
    resolve_timeout: 1m
    smtp_auth_identity: <hidden>
    smtp_auth_password: <hidden>
    smtp_auth_username: <hidden>
    smtp_from: <hidden>
    smtp_require_tls: true
    smtp_smarthost: <hidden>
receivers:
    - email_configs:
        - auth_identity: <hidden>
          auth_password: <hidden>
          auth_username: <hidden>
          from: <hidden>
          require_tls: true
          send_resolved: true
          smarthost: <hidden>
          tls_config:
            insecure_skip_verify: false
          to: <hidden>
      name: default-receiver
route:
    continue: false
    matchers: []
    mute_time_intervals: []
    receiver: default-receiver
    repeat_interval: 15m
    routes: []

Templates:

Do you know what I've been doing wrong ?

Thanks

Otherwise I think there are path of improvements for the Alertmanager integration documentation and testing features.

Version of Grafana 11.2.0 (official deb repo) Version of Mimir 2.13.0 (official deb repo) Version of Mimirtool 2.13.0 (official deb repo)

Both Mimir and Grafana are on the same node with default auth (anonymous, http://localhost:9009 for Alertmanager source, etc)

naguam commented 1 week ago

Ok from my research https://github.com/grafana/mimir/discussions/3297 I found the solution.

Still I believe https://localhost:9009/alertmanager should be the default unless stated otherwise, and needs to be much better documented.