hashicorp / consul-k8s

First-class support for Consul Service Mesh on Kubernetes
https://www.consul.io/docs/k8s
Mozilla Public License 2.0
667 stars 318 forks source link

Servers go through rolling-deploy for config changes on Consul servers #1516

Closed dnlopes closed 8 months ago

dnlopes commented 2 years ago

Hello,

I have been trying out multiple deployment modes for Consul. I have sucessfully deployed a multi-DC via WAN federation using autoscaling groups in AWS, and now I’m moving to try out deploying on top of K8s. The experience was pretty smooth with ASG; in the end I got a pretty stable setup where I could add/remove nodes at will and the datacenters reacted smoothly (e.g, consensus was impeccable).

However, I keep having a lot of stability issues on the Raft consensus on top of K8s:

Question

  1. increasing from 3 replicas to 5 replicas for some reason makes Raft lose consensus. I don’t undertstand this, why would Raft lose consensus when increasing the replicas?
  2. with a 5-node deployment, changing a Consul setting (e.g., log rotation) and then upgrading helm once again makes the consensus to be lost.

One of the issues I believe I detected is that, because pods are named like consul-server-1, consul-server-2, etc, when pods are refreshing, they come up with the same name as the previous one (instead of generating a random suffix for example). This makes some members of the consensus protocol to detect two nodes with the same name (e.g., consul-server-2) running on different IPs (one from the new pod, and one from the old pod that was just replaced). But because this happens very quickly, the nodes don’t have time to “forget” the old pod causing a naming conflict.

Other more general questions:

  1. how can I automate the servers upgrade process without downtime? Official documentation mentions that I should manipulate the server.partitions setting in multiple phases. However, in a “real” scenario, in which deploys are managed by CI/CD tools, does it mean I need to do multiple commits and multiple deploys to ensure the servers all receive the upgrade? It does sound a bit unproductive. Are there any other alternatives to this, while still using the official Helm chart?
  2. the helm chart is using deprecated settings, both from Consul as well as K8s settings (e.g., TLS settings and PodSecurityPolicy). Is this a known issue?

I’m probably missing something that could explain these issues, as I am fairly new working with K8s.

Helm Configuration

global:
  enabled: true
  logLevel: "debug"
  logJSON: false
  name: "dlo"
  datacenter: "dlo"
  consulAPITimeout: "5s"
  enablePodSecurityPolicies: true
  recursors: []
  tls:
    enabled: true
    enableAutoEncrypt: true
    serverAdditionalDNSSANs: []
    serverAdditionalIPSANs: []
    verify: true
    httpsOnly: true
    caCert:
      secretName: null
      secretKey: null
    caKey:
      secretName: null
      secretKey: null
  acls:
    manageSystemACLs: true
    bootstrapToken:
      secretName: null
      secretKey: null
    createReplicationToken: true
    replicationToken:
      secretName: null
      secretKey: null
  gossipEncryption:
    autoGenerate: true
  federation:
    enabled: false
    createFederationSecret: false
    primaryDatacenter: null
    primaryGateways: []
    k8sAuthMethodHost: null
  metrics:
    enabled: false
    enableAgentMetrics: false
    agentMetricsRetentionTime: "1m"
    enableGatewayMetrics: true

server:
  replicas: 5
  #affinity: null # for minikube, set null
  connect: true # setup root CA and certificates
  extraConfig: |
    {
      "log_level": "DEBUG",
      "log_file": "/consul/",
      "log_rotate_duration": "24h",
      "log_rotate_max_files": 7
    }

client:
  enabled: false
  affinity: null
  updateStrategy: |
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  extraConfig: |
    {
      "log_level": "DEBUG"
    }

ui:
  enabled: true
  service:
    enabled: true
    type: LoadBalancer
    port:
      http: 80
      https: 443
  metrics:
    enabled: false
  ingress:
    enabled: false

dns:
  enabled: false

externalServers:
  enabled: false

syncCatalog:
  enabled: false

connectInject:
  enabled: false

controller:
  enabled: false

meshGateway:
  enabled: false

ingressGateways:
  enabled: false

terminatingGateways:
  enabled: false

apiGateway:
  enabled: false

webhookCertManager:
  tolerations: null

prometheus:
  enabled: false

(I know most of the values there are the default ones, but I just wanted to have a yaml with the full configs so I could tweak incrementally)

Steps to reproduce this issue

  1. helm install with 3 replicas and wait for healthy nodes
  2. change config to 5 replicas and upgrade helm installation
  3. consensus is lost and nodes take a long time (> 5 minutes) to reach consensus

Current understanding and Expected behavior

  1. When adding nodes, consensus should not be lost
  2. When changing nodes configurations, the pod replacement should be done carefully in order to keep consensus and avoid re-elections.

Environment details

I have tested this setup both in minikube and in AWS EKS, both with the same outcomes.