System upgrade fails due to PodDisruptionBudget

Talinx commented 3 weeks ago

Description

When pod disruption budgets apply to multiple pods, the system upgrade can fail due to not being able to completely drain a node.

Here is what happens:

A system upgrade is started
Nodes are drained
The pod disruption budget of multiple pods is reached, preventing further draining
Every worker node has at least one pod that can't be evicted

The result is a cluster that is stuck wanting to upgrade with nodes that don't allow scheduling pods with as much pods evicted as possible. This effectively results in downtime of the hosted application until manually resolved (unless every critical pod has a PodDisruptionBudget).

This process is a bit random. E. g. if there are 2 worker nodes and after evicting pods until the disruption budget is reached one node has no pods then this node can be upgraded.

(In this case the Elasticsearch helm chart from Bitnami "caused" the problem. However this can happen with everything that introduces enough pod disruption budgets.)

Kube.tf file

module "kube-hetzner" {

  providers = {
    hcloud = hcloud
  }

  hcloud_token = var.hcloud_token

  source = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.14.5"

  ssh_public_key  = file(var.ssh_public_key)
  ssh_private_key = file(var.ssh_private_key)

  hcloud_ssh_key_id = var.ssh_key_id

  network_region = "eu-central"

  allow_scheduling_on_control_plane = false

  control_plane_nodepools = [
    {
      name        = "cax11-master",
      server_type = "cax11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 3
    }
  ]

  agent_nodepools = [
    {
      name        = "cax21-worker",
      server_type = "cax21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      floating_ip = false,
      count       = 2
    }
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"

  create_kubeconfig    = false
  create_kustomization = false

  traefik_values = <<EOT
globalArguments:
  - "--global.sendanonymoususage=false"
  EOT

}

Screenshots

Screenshot From 2024-10-26 11-34-52

Platform

Linux

pat-s commented 3 weeks ago

link https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/1004 (in this case the longhorn-manager PDB prevents this as long as there are replicas on the node).

I am currently testing if the system-upgrade-controller even needs to perform a drain/cordon - I think an in-place update of the k3s-agent should suffice, without draining all pods.

Talinx commented 3 weeks ago

Would be great if no drain is necessary. Maybe sequentially draining and updating the nodes could also work?

pat-s commented 3 weeks ago

For major updates you definitely want to drain all pods, for patch updates it is surely debatable. I guess a drain by default is definitely needed therefore.

I think #1338 should help here, when setting system_upgrade_enable_eviction=false PDBs should be ignored, allowing the upgrade process to succeed.

Maybe sequentially draining and updating the nodes could also work?

That should actually already happen - at least for me, only one node at a time is drained. If the pods can be relocated to other nodes, there shouldn't be an issue. Could it be that some nodeSelector/spread restrictions are preventing this? In this case, the pod can't be evicted and the upgrade plan fails.

mysticaltech commented 2 weeks ago

@Talinx Please try @pat-s' tip above, if you feel confident enough that is.

kube-hetzner / terraform-hcloud-kube-hetzner