prometheus-operator / prometheus-operator

Prometheus Operator creates/configures/manages Prometheus clusters atop Kubernetes
https://prometheus-operator.dev
Apache License 2.0
9.09k stars 3.71k forks source link

Issues creating AlertmanagerConfig. Prometheus Operator keeps trying to get a Secret #5327

Closed Jcardoso96 closed 1 year ago

Jcardoso96 commented 1 year ago

What did you do? I wanted to start using AlertmanagerConfig and the alertmanagerConfigSelector/matchLabels option in Alertmanager to setup the configuration of the latter.

Created AlertmanagerConfig resource and expected Prometheus Operator to merge this configuration with the default one. However, this does not happen and prometheus operator logs state that no secret has been found, as if it is trying to use a secret to get the alertmanager configuration.

I explored two different avenues to debug: either the operator was still trying to deploy a secret and ignoring the alertmanagerConfig crd, or the config itself was invalid and the operator ignored it and tried to deploy a configuration from a non-existing secret as default.

However I can't seem to make it work. Has anybody else had a similar issue?

Environment

image: quay.io/prometheus-operator/prometheus-operator:v0.62.0

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: config-example
  labels:
    alertmanagerConfig: teamlabel
spec:
  receivers:
    - name: default-receiver
    - name: abc
      pagerdutyConfigs:
      - routingKey: 
          key: "randomKey"
        description: 'hgd'
        severity: 'critical'
        client: Grafana
        clientURL: https://testing.com
    - name: abc-officehours
      pagerdutyConfigs:
      - routingKey: 
          key: "randomKey"
        description: 'hgd'
        severity: 'critical'
        client: Grafana
        clientURL: https://testing.com

  route:
    groupBy: ['job']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: default-receiver

    routes:
    - matchers:
      - name: pagerduty
        value: abc
      receiver: 'abc'
    - matchers:
      - name: pagerduty
        value: abc-officehours
      receiver: 'abc-officehours'
level=info ts=2023-02-08T23:13:41.381476262Z caller=operator.go:457 component=alertmanageroperator msg="connection established" cluster-version=v1.25.3
level=info ts=2023-02-08T23:13:41.381631471Z caller=operator.go:466 component=alertmanageroperator msg="CRD API endpoints ready"
level=info ts=2023-02-08T23:13:41.481771096Z caller=operator.go:297 component=alertmanageroperator msg="successfully synced all caches"
level=info ts=2023-02-08T23:13:41.481987429Z caller=operator.go:638 component=alertmanageroperator key=monitoring/prometheus-operator-alertmanager msg="sync alertmanager"
level=warn ts=2023-02-08T23:13:41.482175554Z caller=operator.go:1003 component=alertmanageroperator msg="skipping alertmanagerconfig" error="unable to get secret \"\": resource name may not be empty" alertmanagerconfig=monitoring/config-example namespace=monitoring alertmanager=prometheus-operator-alertmanager
level=info ts=2023-02-08T23:13:41.523531429Z caller=operator.go:638 component=alertmanageroperator key=monitoring/prometheus-operator-alertmanager msg="sync alertmanager"
level=warn ts=2023-02-08T23:13:41.523767637Z caller=operator.go:1003 component=alertmanageroperator msg="skipping alertmanagerconfig" error="unable to get secret \"\": resource name may not be empty" alertmanagerconfig=monitoring/config-example namespace=monitoring alertmanager=prometheus-operator-alertmanager
JoaoBraveCoding commented 1 year ago

Hello @Jcardoso96 👋 I believe you have to set Alertmanager.spec.alertmanagerConfiguration.name to match the name of your AlertmanagerConfig. That might be what's missing you if I'm not mistaken.

Jcardoso96 commented 1 year ago

Hi @JoaoBraveCoding, I forgot to put it but in the values.yaml used to deploy the prometheus operator I am setting alertmanagerConfigSelector.matchLabels to the label defined in the AlertmanagerConfig.

prometheus-operator:
  alertmanager:
    verticalPodAutoscaler:
      enabled: false
      updatePolicy:
        updateMode: "Off"
    alertmanagerSpec:
      alertmanagerConfigSelector:
        matchLabels:
          alertmanagerConfig: teamlabel
      storage:
        volumeClaimTemplate:
          spec:
            accessModes: ["ReadWriteOnce"]
            resources:
              requests:
                storage: 10Gi

Doing it with the name directly unfortunately would not work in our use case because we need to merge different configs when we are installing prometheus. The idea is to have different teams each with their own prometheus instance, where they have a base config common across all teams but they can also set up different configs specific to their team. Thus we need to merge the base config and the configs related to a team (using the appropriate label) for their prometheus instance.

JoaoBraveCoding commented 1 year ago

I see, then if I'm not mistaken you also have to set alertmanagerConfigNamespaceSelector to {} to select all the namespaces, if it's nil (which is the default if you don't configure it) the operator will look for AlertmanagerConfig in the other namespaces.

Jcardoso96 commented 1 year ago

Did not seem to work unfortunately : (

I might be wrong but I think the operator is able to find the AlertmanagerConfig. This is because if I change the alertmanagerConfig label to something different than the alertmanagerConfigSelector in Alertmanager then I don't have issues with the configuration. It's only when the labels match and the operator finds the configuration that I then get the error. Could I possibly have some yaml error in that configuration I posted or does it look okay?

JoaoBraveCoding commented 1 year ago

Sorry, @Jcardoso96 I miss understood the problem 🤦 If we look at this logline

level=warn ts=2023-02-08T23:13:41.482175554Z caller=operator.go:1003 component=alertmanageroperator msg="skipping alertmanagerconfig" error="unable to get secret \"\": resource name may not be empty" alertmanagerconfig=monitoring/config-example namespace=monitoring alertmanager=prometheus-operator-alertmanager

it seems that you don't provide a secret name under routingKey and if I'm not mistaken you need to do it, otherwise the operator would not be able to fetch that key.

Jcardoso96 commented 1 year ago

That seemed to fix it, obrigado João!