[HELP] - Alertmanager Email and Config Issues

RichieRogers commented 1 month ago

Hi, I've got Prometheus, Grafana and Alertmanager running ok on a Raspberry Pi - sending alert emails etc. I'm now setting up a Kubernetes cluster and have installed the Prometheus-Grafana-Alertmanager stack. Everything for that seems to be working ok - Prometheus exporters for things inside & outside the Kubernetes environment logging fine and alerts flagging up in Alertmanager as expected. However, emails are not being sent to my SMTP smarthost (a physical Windows server on my network - the same server my other Prom/Alertmanager setup is using). I've spent well over a week trawling the internet and trying out many solutions for configuring Alertmanager to send emails, but nothing has worked. I can actually telnet onto my SMTP server when connected to the Alertmanager pod, so assuming the issue must be in my Alertmanager config. Can someone shed some light on where I'm going wrong?

Info: Helm version: version.BuildInfo{Version:"v3.15.3", GitCommit:"3bb50bbbdd9c946ba9989fbe4fb4104766302a64", GitTreeState:"clean", GoVersion:"go1.22.5"} Chart version: kube-prometheus-stack-62.3.1 Kubectl version: Client Version: v1.30.3 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.3

Alertmanager Config (passed in with helm install)

alertmanager:
  enabled: true
  alertmanagerSpec:
    alertmanagerConfigMatcherStrategy:
      type: None
    storage: 
      volumeClaimTemplate:
        spec:
          storageClassName: rook-ceph-block
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
  config:
    global:
      # The smarthost and SMTP sender used for mail notifications.
      smtp_smarthost: 'smtpserver:25'
      smtp_from: 'alertmanager@int.domain.com'
      smtp_require_tls: false

    # The directory from which notification templates are read.
    templates:
      - '/etc/alertmanager/template/*.tmpl'

    # The root route on which each incoming alert enters.
    route:
      # The labels by which incoming alerts are grouped together. For example,
      # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
      # be batched into a single group.
      #
      # To aggregate by all possible labels use '...' as the sole label name.
      # This effectively disables aggregation entirely, passing through all
      # alerts as-is. This is unlikely to be what you want, unless you have
      # a very low alert volume or your upstream notification system performs
      # its own grouping. Example: group_by: [...]
      group_by: ['alertname', 'cluster', 'service']

      # When a new group of alerts is created by an incoming alert, wait at
      # least 'group_wait' to send the initial notification.
      # This way ensures that you get multiple alerts for the same group that start
      # firing shortly after another are batched together on the first
      # notification.
      group_wait: 30s

      # When the first notification was sent, wait 'group_interval' to send a batch
      # of new alerts that started firing for that group.
      group_interval: 5m

      # If an alert has successfully been sent, wait 'repeat_interval' to
      # resend them.
      repeat_interval: 6h

      # A default receiver
      receiver: sys-admin-email

    # Inhibition rules allow to mute a set of alerts given that another alert is
    # firing.
    # We use this to mute any warning-level notifications if the same alert is
    # already critical.
    inhibit_rules:
      - source_matchers: [severity="critical"]
        target_matchers: [severity="warning"]
        # Apply inhibition if the alertname is the same.
        # CAUTION:
        #   If all label names listed in `equal` are missing
        #   from both the source and target alerts,
        #   the inhibition rule will apply!
        equal: [alertname, cluster, service]

    receivers:
      - name: 'sys-admin-email'
        email_configs:
          - to: 'sys.admin@domain.com'
            require_tls: false

Alertmanager Secret (kubectl create secret generic -n monitoring alertmanager-kube-prometheus-stack-alertmanager --from-file=alertmanager.yaml --dry-run=client -o yaml | kubectl apply -f - )

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'smtpserver:25'
  smtp_from: 'alertmanager@domain.com'
  smtp_require_tls: false

# The directory from which notification templates are read.
templates:
  - '/etc/alertmanager/template/*.tmpl'

# The root route on which each incoming alert enters.
route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  #
  # To aggregate by all possible labels use '...' as the sole label name.
  # This effectively disables aggregation entirely, passing through all
  # alerts as-is. This is unlikely to be what you want, unless you have
  # a very low alert volume or your upstream notification system performs
  # its own grouping. Example: group_by: [...]
  group_by: ['alertname', 'cluster', 'service']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 6h

  # A default receiver
  receiver: sys-admin-email

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
  - source_matchers: [severity="critical"]
    target_matchers: [severity="warning"]
    # Apply inhibition if the alertname is the same.
    # CAUTION:
    #   If all label names listed in `equal` are missing
    #   from both the source and target alerts,
    #   the inhibition rule will apply!
    equal: [alertname, cluster, service]

receivers:
  - name: 'sys-admin-email'
    email_configs:
      - to: 'sys.admin@domain.com'
        require_tls: false
        send_resolved: true

Alertmanager Alerts:

I'm getting frustrated now :(

Thanks, Richie

zeritti commented 1 month ago

Your configuration seems to have been accepted by Prometheus operator (at least once, that is) and passed to alertmanager as seen in the console. Furthermore, it also validates successfully with amtool. I'd suggest that you have a look at alertmanager's log, probably after setting log level to debug:

alertmanager:
  alertmanagerSpec:
    logLevel: "debug"

You should see its attempting to send an email notification. Taking a look at operator's log helps as well.

A note aside. When excluding the 'null' receiver from the configuration, Prometheus operator should fail to provision alertmanager config (on install) and hence, alertmanager should not start. Works on upgrade if alertmanager is already running but alertmanager does not get the new/latest config and runs with the last successful but possibly outdated one.

RichieRogers commented 1 month ago

Hi, Thanks for checking it over. I'm now very annoyed with myself - after further investigation I found that the issue was with the "from" email address being blocked by my mail relay :( Once I resolved that I started getting emails (after blowing away my existing config). I now have some basic emails being received - at some point I will have to see if there are any "pretty" email templates for Alertmanager.

Thanks, Richie

prometheus-community / helm-charts

[HELP] - Alertmanager Email and Config Issues #4830