prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.69k stars 2.17k forks source link

PromQL alert error - Vector contains metrics with the same labelset after applying rule labels #3428

Open edtshuma opened 1 year ago

edtshuma commented 1 year ago

I have setup a Cronjob for a container that exits with a non-zero code. I have a PrometheusRule and AlertManagerConfig setup against this CronJob but the alert is not firing as expected. The alerting is based on this example.

This is the CronJob definition :

apiVersion: batch/v1
kind: CronJob
metadata:
  name: exitjob
  namespace: monitoring
spec:
  schedule: "*/4 * * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - command:
                - sh
                - exit
                - "1"
              image: alpine
              imagePullPolicy: Always
              name: main
          restartPolicy: Never
          terminationGracePeriodSeconds: 30

And this is the PrometheusRule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: failing-job-alert
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
    - name: kube-cron
      rules:
        - record: job:kube_job_status_start_time:max
          expr: |
            label_replace(
              label_replace(
                max(
                  kube_job_status_start_time
                  * ON(job_name, namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (job_name, owner_name, namespace)
                == ON(owner_name) GROUP_LEFT()
                max(
                  kube_job_status_start_time
                  * ON(job_name, namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (owner_name),
              "job", "$1", "job_name", "(.+)"),
            "cronjob", "$1", "owner_name", "(.+)")

        - record: job:kube_job_status_failed:sum
          expr: |
            clamp_max(
              job:kube_job_status_start_time:max,1)
              * ON(job, namespace) GROUP_LEFT()
              label_replace(
                label_replace(
                  (kube_job_status_failed != 0),
                  "job", "$1", "job_name", "(.+)"),
                "cronjob", "$1", "owner_name", "(.+)")
        - alert: CronJobStatusFailed
          expr: |
            job_cronjob:kube_job_status_failed:sum
            * ON(job, namespace) GROUP_RIGHT()
            kube_cronjob_labels
            > 0
          labels:
            severity: critical
            job: cron-failure
            namespace: monitoring
          for: 1m
          annotations:
            summary: '{{ $labels.cronjob }} last run has failed {{ $value }} times.'

And the associated AlertManagerConfig:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: cronjob-failure-receiver
  namespace: monitoring
  labels:
    release: prometheus
spec:
  route:
    groupBy: ['alertname']
    groupWait: 30s
    groupInterval: 2m
    repeatInterval: 2m
    receiver: cron-email
    routes:
      - matchers:
        - name: job
          value: cron-failure
        receiver: cron-email
  receivers:
    - name: cron-email
      emailConfigs:
        - to: 'etshuma@mycompany.com'
          from: 'devops@mycompany.com'
          smarthost: 'mail2.mycompany.com:25'
          requireTLS: false

What did you expect to see? I expected to see alerts being delivered to the specified email. This functionality is already working as I have another alert setup already

Environment Kubernetes

Versions : The AlertManager and Prometheus instances are both installed via the Bitnami kube-prometheus stack. Prometheus : v2.44.0 AlertManager; v0.25.0 Helm Chart version : helm template prometheus bitnami/kube-prometheus --namespace monitoring --version 8.14.0 -f prometheus-values.yaml > ./output/values.yaml\

Alert Manager logs :

ts=2023-07-24T13:30:44.192Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T13:30:44.218Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T15:27:44.721Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T15:27:44.730Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T16:57:45.192Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T16:57:45.229Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml\

In the Prometheus UI the alert appears as inactive :

image

What am I missing ?

simonpasquier commented 1 year ago

This is more a PromQL question than Alertmanager.