[Bug]: Autoupgrade nodes seems to lead to not ready nodes that need manual reboots

sharkymcdongles commented 6 months ago

Description

I notice many nodes become NotReady and don't come back without manually rebooting them from the CLI or cloud console.

Kube.tf file

"kube-hetzner" {

  providers = {
    hcloud = hcloud
  }

  hcloud_token = var.hcloud_token
  source       = "kube-hetzner/kube-hetzner/hcloud"
  version      = "2.13.3"

  ssh_port           = 22242
  ssh_public_key     = file("~/.ssh/id_ed25519.pub")
  ssh_private_key    = file("~/.ssh/id_ed25519")
  ssh_max_auth_tries = 10

  network_region = "eu-central"

  control_plane_nodepools = [
    {
      name        = "master-0",
      server_type = "ccx13",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1

      backups = true
    },
    {
      name        = "master-1",
      server_type = "ccx13",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1

      backups = true
    },
    {
      name        = "master-2",
      server_type = "ccx13",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1

      backups = true
    },
  ]

  agent_nodepools = [
    {
      name        = "large",
      server_type = "cpx31",
      location    = "fsn1",
      labels = [
        "nodepool.replaced.dev/name=large"
      ],
      taints          = [],
      count           = 10
      placement_group = "large"
      backups         = true
    },
    {
      name        = "longhorn",
      server_type = "ccx23",
      location    = "fsn1",
      labels = [
        "node.longhorn.io/create-default-disk=true",
        "nodepool.replaced.dev/name=longhorn"
      ],
      taints          = [],
      count           = 4
      placement_group = "longhorn"
      backups         = true
    },
    {
      name        = "egress",
      server_type = "cpx21",
      location    = "fsn1",
      labels = [
        "node.kubernetes.io/role=egress",
      ],
      taints = [
        "node.kubernetes.io/role=egress:NoSchedule"
      ],
      floating_ip     = true
      placement_group = "egress"
      count           = 1
    },
    {
      name        = "static",
      server_type = "ccx13",
      location    = "fsn1",
      labels = [
        "nodepool.replaced.dev/name=static"
      ],
      taints          = [],
      count           = 1
      placement_group = "static"
      backups         = true
      selinux         = false
    },
  ]

  # Add custom control plane configuration options here.
  # E.g to enable monitoring for etcd, proxy etc:
  # control_planes_custom_config = {
  #  etcd-expose-metrics = true,
  #  kube-controller-manager-arg = "bind-address=0.0.0.0",
  #  kube-proxy-arg ="metrics-bind-address=0.0.0.0",
  #  kube-scheduler-arg = "bind-address=0.0.0.0",
  # }

  k3s_registries = <<-EOT
    mirrors:
      registry.gitlab.com:
        endpoint:
        - "https://trow.replaced.dev"
        rewrite:
          "^gitlab-org/(.*)": "f/gitlab/gitlab-org/$1"
  EOT

  enable_wireguard = false

  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"

  load_balancer_disable_ipv6 = true

  load_balancer_health_check_interval = "5s"

  load_balancer_health_check_timeout = "3s"

  load_balancer_health_check_retries = 3

  autoscaler_nodepools = [
    {
      name        = "autoscaled-large"
      server_type = "cpx31"
      location    = "fsn1"
      labels = {
        "node.kubernetes.io/role" : "peak-workloads"
        "nodepool.replaced.dev/name" : "autoscaled-large"
      },
      taints    = []
      min_nodes = 1
      max_nodes = 5
    }
  ]
  autoscaler_labels = [
    "node.kubernetes.io/role=peak-workloads"
  ]

  cluster_autoscaler_image            = "registry.k8s.io/autoscaling/cluster-autoscaler"
  cluster_autoscaler_version          = "v1.27.3"
  cluster_autoscaler_log_level        = 4
  cluster_autoscaler_log_to_stderr    = true
  cluster_autoscaler_stderr_threshold = "WARNING"

  cluster_autoscaler_extra_args = [
    "--ignore-daemonsets-utilization=true",
    "--enforce-node-group-min-size=true",
    "--leader-elect=false"
  ]

  disable_hetzner_csi = false

  ingress_controller = "none"

  ingress_replica_count = 2

  enable_metrics_server = true

  allow_scheduling_on_control_plane = false

  automatically_upgrade_k3s = true

  automatically_upgrade_os = true

  enable_longhorn = true

  longhorn_namespace = "longhorn-system"

  longhorn_fstype = "ext4"

  longhorn_replica_count = 2

  # If you need more control over kured and the reboot behaviour, you can pass additional options to kured.
  # For example limiting reboots to certain timeframes. For all options see: https://kured.dev/docs/configuration/
  # The default options are: `--reboot-command=/usr/bin/systemctl reboot --pre-reboot-node-labels=kured=rebooting --post-reboot-node-labels=kured=done --period=5m`
  # Defaults can be overridden by using the same key.
  # kured_options = {
  #   "reboot-days": "su"
  #   "start-time": "3am"
  #   "end-time": "8am"
  #   "time-zone": "Local"
  # }

  initial_k3s_channel = "v1.27"

  cluster_name = "infra"

  use_cluster_name_in_node_name = true

  k3s_exec_agent_args = "--kubelet-arg kube-reserved=cpu=100m,memory=200Mi,ephemeral-storage=1Gi"

  restrict_outbound_traffic = false

  cni_plugin = "cilium"

  cilium_routing_mode = "native"

  cilium_egress_gateway_enabled = true

  disable_network_policy = true

  block_icmp_ping_in = true

  enable_cert_manager = false

  use_control_plane_lb = true

  control_plane_lb_type = "lb21"

  control_plane_lb_enable_public_interface = true

  cilium_values = <<EOT
ipam:
  mode: kubernetes
k8s:
  requireIPv4PodCIDR: true
kubeProxyReplacement: strict
routingMode: native
ipv4NativeRoutingCIDR: "10.0.0.0/8"
l7Proxy: false
endpointRoutes:
  enabled: true
loadBalancer:
  acceleration: native
bpf:
  masquerade: true
egressGateway:
  enabled: true
MTU: 1450
EOT

  longhorn_values = <<EOT
defaultSettings:
  createDefaultDiskLabeledNodes: true
  defaultDataPath: /var/lib/longhorn
  node-down-pod-deletion-policy: delete-both-statefulset-and-deployment-pod
  replicaAutoBalance: best-effort
  storageReservedPercentageForDefaultDisk: 15
persistence:
  defaultFsType: ext4
  defaultClassReplicaCount: 2
  defaultClass: false
  defaultNodeSelector:
    enable: true
    selector: "longhorn"
EOT

  enable_rancher = false

  extra_kustomize_parameters = {
    floating_ip            = "blah"
    tls_crt                = <<EOH
    ${data.google_secret_manager_secret_version.sealed-secrets-tls-crt.secret_data}
    EOH
    tls_key                = <<EOH
    ${data.google_secret_manager_secret_version.sealed-secrets-tls-key.secret_data}
    EOH
    cert_manager_values    = <<EOH
    prometheus:
      enabled: false
    serviceAccount:
      create: true
      name: cert-manager
    cainjector:
      enabled: true
    extraArgs:
    - --cluster-issuer-ambient-credentials=false
    - --issuer-ambient-credentials=false
    - --leader-elect=false
    installCRDs: true
    ingressShim:
      defaultIssuerKind: ClusterIssuer
      defaultIssuerName: google-dns-public-ca
    maxConcurrentChallenges: 60
    replicaCount: 1
    webhook:
      enabled: true
    EOH
    external_dns_values    = <<EOH
    google:
      project: blah
      serviceAccountSecret: google-dns-secrets
      serviceAccountSecretKey: key.json
    logLevel: info
    metrics:
      enabled: false
      serviceMonitor:
        enabled: false
    policy: sync
    registry: txt
    serviceAccount:
      create: true
      name: external-dns
    provider: google
    EOH
    gitlab_postgres_values = <<EOH
    fullnameOverride: "gitlab-postgres"
    image:
      registry: docker.io
      repository: bitnami/postgresql
      tag: 15.4.0-debian-11-r10
      digest: ""
    auth:
      enablePostgresUser: true
      username: "gitlab"
      database: "gitlab"
      existingSecret: "gitlab-secrets"
      secretKeys:
        adminPasswordKey: postgres-user-password
        userPasswordKey: postgres-gitlab-password
      usePasswordFiles: false
    architecture: standalone
    postgresqlDataDir: /postgresql/data
    postgresqlSharedPreloadLibraries: pgaudit
    shmVolume:
      enabled: true
      sizeLimit: 2Gi
    tls:
      enabled: false
    primary:
      name: primary
      pgHbaConfiguration: |-
        local all all trust
        host all all localhost trust
        host gitlab gitlab 10.0.0.0/8 md5
      extendedConfiguration: |-
        max_connections = 300
      standby:
        enabled: false
      resources:
        limits:
          memory: 4Gi
          cpu: 4
        requests:
          memory: 4Gi
          cpu: 250m
      service:
        externalTrafficPolicy: Local
      persistence:
        enabled: true
        mountPath: /postgresql
        storageClass: "hcloud-durable"
        accessModes:
          - ReadWriteOnce
        size: 32Gi
      persistentVolumeClaimRetentionPolicy:
        enabled: true
        whenScaled: Retain
        whenDeleted: Retain
      podAnnotations:
        backup.velero.io/backup-volumes: data
    EOH
    ingress_nginx_values   = <<EOH
    commonLabels:
      team: shared-infra
    controller:
      admissionWebhooks:
        enabled: true
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - ingress-nginx
              topologyKey: kubernetes.io/hostname
            weight: 100
      autoscaling:
        enabled: false
      ingressClass: nginx
      ingressClassByName: false
      ingressClassResource:
        controllerValue: k8s.io/ingress-nginx
        default: true
        enabled: true
        name: nginx
      keda:
        enabled: false
      config:
        client-body-buffer-size: 10M
        disable-ipv6-dns: 'true'
        disable-ipv6: 'true'
        enable-brotli: 'false'
        enable-modsecurity: 'false'
        enable-ocsp: 'true'
        enable-opentracing: 'false'
        enable-underscores-in-headers: 'true'
        enable-vts-status: 'false'
        hsts-include-subdomains: 'true'
        hsts-max-age: '31536000'
        hsts-preload: 'true'
        hsts: 'true'
        ignore-invalid-headers: 'false'
        keep-alive: '60'
        log-format-escape-json: 'false'
        log-format-escape-none: 'false'
        nginx-status-ipv4-whitelist: '127.0.0.1,10.0.0.0/8'
        no-auth-locations: /.well-known/acme-challenge, /-/healthy, /health
        proxy-body-size: '0'
        proxy-buffering: 'off'
        proxy-connect-timeout: '60'
        proxy-next-upstream-timeout: '60'
        proxy-next-upstream-tries: '4'
        proxy-read-timeout: '60'
        proxy-request-buffering: 'off'
        proxy-send-timeout: '60'
        server-tokens: 'false'
        ssl-early-data: 'true'
        ssl-redirect: 'true'
        ssl-reject-handshake: 'true'
        use-gzip: 'true'
        use-http2: 'true'
      enableMimalloc: true
      image:
        digest: ''
      lifecycle:
        preStop:
          exec:
            command:
            - /wait-shutdown
      livenessProbe:
        httpGet:
          path: "/healthz"
          port: 10254
          scheme: HTTP
        failureThreshold: 5
        initialDelaySeconds: 10
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      metrics:
        enabled: false
        serviceMonitor:
          enabled: false
      minAvailable: 1
      publishService:
        enabled: true
      replicaCount: 2
      resources:
        limits:
          memory: 1024Mi
        requests:
          cpu: 250m
          memory: 1024Mi
      service:
        enabled: true
        annotations:
          load-balancer.hetzner.cloud/disable-private-ingress: true
        externalTrafficPolicy: Local
      startupProbe:
        httpGet:
          path: "/healthz"
          port: 10254
          scheme: HTTP
        initialDelaySeconds: 5
        periodSeconds: 5
        timeoutSeconds: 2
        successThreshold: 1
        failureThreshold: 5
      terminationGracePeriodSeconds: 300
      useComponentLabel: true
    tcp:
      "22": "gitlab/gitlab-gitlab-shell:22"
    defaultBackend:
      enabled: true
    EOH
  }

  create_kubeconfig = true

  create_kustomization = false
}

Screenshots

No response

Platform

Linux

mysticaltech commented 6 months ago

@sharkymcdongles No idea what could be happening, but I would suggest using our default cilium config instead. So remove cilium_values and try again.

aleksasiriski commented 6 months ago

This is a recurring issue I noticed over the last couple of weeks, still investigating. It's most likely something related to all of our custom networking + microos + hetzner. For the time being, disable autoupgrades.

mysticaltech commented 6 months ago

And getting cilium to work well on Hetzner is super tricky, hence my above suggestion.

sharkymcdongles commented 6 months ago

I took what is documented here:

https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/blob/master/README.md#examples

Sadly, given the hetzner IP blacklist bs, using an egressgateway is the only way to ensure the cluster works with autoscaling because many images are stored in ghcr and hetzner ips are randomly blocked there. It's also needed for things like SMTP since many SMTP providers also block hetzner IPs.

I will do some digging and see if maybe I can find out why this happens. I disabled autoupgrades now for both nodes and k3s yet some nodes still had the same behavior.

So I did some digging and it looks like it still tried to do an upgrade leading to the NotReady nodes situation again. When I change autoupgrades to off does it not reflect it for already provisioned nodes? Do I need to remove kured? EDIT: okay I figured out that I do need to edit the nodes myself by running: systemctl --now disable transactional-update.timer

Checking the logs I see the upgrade runs then the CPU gets locked and NetworkManager gets stuck fully killing the networking for the node since it never recovers. I then see all these CPU stuck errors.

Mar 28 04:38:08 infra-large-btl kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [NetworkManager:2151]
Mar 28 04:38:08 infra-large-btl kernel: Modules linked in: algif_hash af_alg ext4 mbcache jbd2 udp_diag inet_diag ip_set xt_CT cls_bpf sch_ingre>
Mar 28 04:38:08 infra-large-btl kernel:  xhci_pci xhci_pci_renesas libata aesni_intel xhci_hcd virtio_scsi crypto_simd sd_mod cryptd t10_pi sg u>
Mar 28 04:38:08 infra-large-btl kernel: CPU: 3 PID: 2151 Comm: NetworkManager Not tainted 6.7.2-1-default #1 openSUSE Tumbleweed e152b88f51363d1>
Mar 28 04:38:08 infra-large-btl kernel: Hardware name: Hetzner vServer/Standard PC (Q35 + ICH9, 2009), BIOS 20171111 11/11/2017
Mar 28 04:38:08 infra-large-btl kernel: RIP: 0010:virtnet_send_command+0x106/0x170 [virtio_net]
Mar 28 04:38:08 infra-large-btl kernel: Code: 74 24 48 e8 fc 6b b8 c6 85 c0 78 60 48 8b 7b 08 e8 0f 4c b8 c6 84 c0 75 11 eb 22 48 8b 7b 08 e8 20>
Mar 28 04:38:08 infra-large-btl kernel: RSP: 0018:ffffbf9c40853a08 EFLAGS: 00000246
Mar 28 04:38:08 infra-large-btl kernel: RAX: 0000000000000000 RBX: ffff999ec1f229c0 RCX: 0000000000000001

I am attaching my full log file from when this happened to see if maybe someone here can shine some light on it.

logs.txt

I wonder if maybe the networking can't handle autoupgrades or updates given my settings? No idea tbh, but it'd be nice if this were solved or someone knew. I will continue researching on my end, but I think more heads are better than 1.

mysticaltech commented 6 months ago

It's possible. So try switching to default networking settings, remove cilium_values and see if it works better.

mysticaltech commented 6 months ago

@sharkymcdongles Any updates, did the suggestion work?

sharkymcdongles commented 6 months ago

@mysticaltech I haven't moved back to default cilium settings yet because I am working on a new way of handling images not pulling without the egress gateway. I am currently evaluating using a squid proxy as a replacement. Since disabling autoupgrades though I have had no further issues.

mysticaltech commented 6 months ago

Ok, great! The proxy solution sounds awesome. Don't hesitate to share in due time if you see fit.

We are narrowing down the automated upgrade issues in other threads, so will close this one for now.

kube-hetzner / terraform-hcloud-kube-hetzner