hashicorp / consul-k8s

First-class support for Consul Service Mesh on Kubernetes
https://www.consul.io/docs/k8s
Mozilla Public License 2.0
667 stars 316 forks source link

Possible circular dependency in ACL management #3738

Open schoeppi5 opened 6 months ago

schoeppi5 commented 6 months ago

Community Note


Overview of the Issue

I ran into several issues when installing a new Consul cluster on Kubernetes using the Helm chart. When configuring the Helm chart to manage ACLs for the cluster, I found a circular dependency, which prevented me from installing the cluster without modifying the manifests using kustomize.

Reproduction Steps

In order to effectively and quickly resolve the issue, please provide exact steps that allow us the reproduce the problem. If no steps are provided, then it will likely take longer to get the issue resolved. An example that you can follow is provided below.

Steps to reproduce:

Install the helm chart using the following values:

global:
  name: consul
  secretsBackend:
    vault:
      enabled: false
  acls:
    manageSystemACLs: true,
    bootstrapToken:
      secretName: bootstrap-token-secret
      secretKey: bootstrap-token
server:
  replicas: 3
  connect: false
  exposeService:
    enabled: false
ui:
  enabled: true
  service:
    enabled: true
  ingress:
    enabled: false
syncCatalog:
  enabled: false
connectInject:
  enabled: false
meshGetaway:
  enabled: false
ingressGateways:
  enabled: false
terminatingGateways:
  enabled: false
tests:
  enabled: false
telemetryCollector:
  enabled: false

Actual behavior

This will lead to two things:

  1. The server's StatefulSet will try to load the bootstrap-token from the secret
    • If the secret does not exist (which it doesn't for a clean install), this will prevent the server from starting
  2. The consul-k8s-controlplane server-acl-init job will not complete, since it can't resolve the DNS name of the headless server, since no pods are running

I realized, that the .global.acls.bootstrapToken config option is probably meant to be set, if you already have a bootstrap token. This should be made clearer in the docs, if this is the case.

One possible workaroung I tried was creating an empty secret with the same name, which allows the consul servers to start and the server-acl-init job to successfully initialize the ACL systems, but the job ultimately fails, since it always tries to create the secret and not update it, contradicting the documentation.

The other possibility is to remove .global.acls.bootstrapToken, which removes the env var from the StatefulSet. This works, the job is initializing the ACL system and creates the secret with the token, but I ran into another issue with the server-acl-init job: The job resolves the IPs of the consul servers from the DNS name of the headless service, which contains a race condition, because the DNS only returns IP addresses from started Pods (obviously). In my testing it was almost always the case, that the job only received one or two out of three IP addresses, leading to only these consul servers receiving a server token and the remaining servers unable to communicate using acl. A fix for that would be for the job to know, how many servers to expect (like the -bootstrap-expect).

Workaround solution

I solved these issues with a kustomization patch to the acl init job:

apiVersion: batch/v1
kind: Job
metadata:
  name: kustomize
spec:
  template:
    spec:
      containers:
      - name: server-acl-init-job
        env: 
        - name: CONSUL_ADDRESSES
          value: "exec=/nslookup.sh consul-server"
        - name: CONSUL_TLS_SERVER_NAME
          value: "consul-server"
        volumeMounts:
        - name: nslookup
          mountPath: /nslookup.sh
          subPath: nslookup.sh
      volumes:
      - name: nslookup
        configMap:
          name: consul-nslookup-exec
          items:
          - key: nslookup.sh
            path: nslookup.sh
          defaultMode: 0777

and the nslookup.sh script:

#!/bin/sh

nslookup $1 | awk 'BEGIN {count=0} /^Address:/ {count++; if (count > 1) printf "%s ", $2} END {if (count-1 != 3) {printf "Less than three (%s) addresses found for consul", count-1; exit 1}}'

Environment details

If not already included, please provide the following:

Additionally, please provide details regarding the Kubernetes Infrastructure, as shown below:

Philipp Schöppner [philipp.schoeppner@mercedes-benz.com](mailto:philipp.schoeppner@mercedes-benz.com), Mercedes-Benz Tech Innovation GmbH (Provider Information)

schreibergeorg commented 5 months ago

:+1: