hashicorp / vault-helm

Helm chart to install Vault and other associated components.
Mozilla Public License 2.0
1.05k stars 868 forks source link

Prometheus metrics disappear in HA setup when all Vault pods are sealed #990

Open cascadia-sati opened 5 months ago

cascadia-sati commented 5 months ago

Describe the bug I'm deploying an HA Vault setup in our Kubernetes cluster with three replicas. While working on monitoring for the seal status of the Vault pods, I noticed that the Prometheus metrics go away when all Vault pods are sealed, which makes it impossible to trigger an alert for this state.

This apparently happens, because the vault ServiceMonitor selects the vault-active Service, which in turn selects the Vault pod with the vault-active: "true" annotation. However, when all Vault pods are sealed, then they all have the vault-active: "false" annotation, which means the Service returns 503 when the ServiceMonitor attempts to fetch metrics.

To Reproduce Simply configure Prometheus metrics and then seal all the Vault pods by restarting them

Expected behavior We should be able to get metrics and monitor the seal state via the vault_core_unsealed metric even when all Vault pods are sealed.

We achieved this by removing vault-active: "true" from the ServiceMonitor matchLabels field and adding a new unique label both there and to the vault Service object. This ensure the ServiceMonitor uses only the vault Service object, which routes to the Vault pods regardless of their active status.

Environment

Chart values:

global:
  serverTelemetry:
    prometheusOperator: true
injector:
  enabled: false
server:
  ha:
    enabled: true
    replicas: 3
    # Enable HA for integrated storage
    raft:
      enabled: true
      setNodeId: true
      config: |
        ui = true

        listener "tcp" {
          tls_disable = 1
          address = "[::]:8200"
          cluster_address = "[::]:8201"

          # Enable unauthenticated metrics access for Prometheus Operator
          telemetry {
            unauthenticated_metrics_access = "true"
          }
        }

        telemetry {
          prometheus_retention_time = "30m"
          disable_hostname = true
        }

        storage "raft" {
          path = "/vault/data"
        }

        # For integrated raft storage and security
        # https://developer.hashicorp.com/vault/docs/configuration#disable_mlock
        disable_mlock = true

        service_registration "kubernetes" {}
  serverTelemetry:
    serviceMonitor:
      enabled: true
  dataStorage:
    enabled: true
    size: 5Gi
    storageClass: ebs-gp3
  affinity: |
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: {{ template "vault.name" . }}
              app.kubernetes.io/instance: "{{ .Release.Name }}"
              component: server
          topologyKey: topology.kubernetes.io/zone