kube-hetzner / terraform-hcloud-kube-hetzner

Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command!
MIT License
2.35k stars 364 forks source link

[Bug]: ImagePullBackoff of system-upgrade controller #1321

Closed dmorn closed 6 months ago

dmorn commented 6 months ago

Description

Looks like the system controller is no longer able to perform updates.

Here is an extract of the logs from the system-upgrade-controller

stream logs failed container "system-upgrade-controller" in pod "system-upgrade-controller-fcc74f7fc-ztpd2" is waiting to start: trying and failing to pull image for system-upgrade/system-upgrade-controller-fcc74f7fc-ztpd2 (system-upgrade-controller)
system-upgrade-controller-cdddbd7bb-79bzf W0125 15:56:36.822527       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
system-upgrade-controller-cdddbd7bb-79bzf time="2024-01-25T15:56:36Z" level=info msg="Applying CRD plans.upgrade.cattle.io"
system-upgrade-controller-cdddbd7bb-79bzf time="2024-01-25T15:56:36Z" level=info msg="Waiting for CRD plans.upgrade.cattle.io to become available"
system-upgrade-controller-cdddbd7bb-79bzf time="2024-01-25T15:56:37Z" level=info msg="Done waiting for CRD plans.upgrade.cattle.io to become available"
system-upgrade-controller-cdddbd7bb-79bzf E0125 15:56:37.605988       1 memcache.go:196] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
system-upgrade-controller-cdddbd7bb-79bzf E0125 15:56:37.614097       1 memcache.go:101] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
system-upgrade-controller-cdddbd7bb-79bzf time="2024-01-25T15:56:37Z" level=info msg="Starting /v1, Kind=Node controller"
system-upgrade-controller-cdddbd7bb-79bzf time="2024-01-25T15:56:37Z" level=info msg="Starting /v1, Kind=Secret controller"
system-upgrade-controller-cdddbd7bb-79bzf E0125 15:56:37.623016       1 memcache.go:196] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
system-upgrade-controller-cdddbd7bb-79bzf E0125 15:56:37.627012       1 memcache.go:101] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
system-upgrade-controller-cdddbd7bb-79bzf time="2024-01-25T15:56:37Z" level=info msg="Starting batch/v1, Kind=Job controller"
system-upgrade-controller-cdddbd7bb-79bzf E0125 15:56:37.633057       1 memcache.go:196] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
system-upgrade-controller-cdddbd7bb-79bzf E0125 15:56:37.639328       1 memcache.go:101] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
system-upgrade-controller-cdddbd7bb-79bzf time="2024-01-25T15:56:37Z" level=info msg="Starting upgrade.cattle.io/v1, Kind=Plan controller"
system-upgrade-controller-cdddbd7bb-79bzf time="2024-01-25T16:12:09Z" level=error msg="error syncing 'system-upgrade/apply-k3s-server-on-k3s-control-fsn1-ews-with-e6be86ccf52-c9243': handler system-upgrade-controller: jobs.batch \"apply-k3s-server-on-k3s-control-fsn1-ews-with-e6be86ccf52-c9243\" not found, requeuing"
system-upgrade-controller-cdddbd7bb-79bzf time="2024-01-25T16:12:20Z" level=error msg="error syncing 'system-upgrade/apply-k3s-server-on-k3s-control-fsn1-fmc-with-e6be86ccf52-c9cfa': handler system-upgrade-controller: jobs.batch \"apply-k3s-server-on-k3s-control-fsn1-fmc-with-e6be86ccf52-c9cfa\" not found, requeuing"
system-upgrade-controller-cdddbd7bb-79bzf time="2024-01-25T16:13:02Z" level=error msg="error syncing 'system-upgrade/apply-k3s-agent-on-k3s-agent-fsn1-ktt-with-e6be86ccf52ef5-21c50': handler system-upgrade-controller: jobs.batch \"apply-k3s-agent-on-k3s-agent-fsn1-ktt-with-e6be86ccf52ef5-21c50\" not found, requeuing"
system-upgrade-controller-cdddbd7bb-79bzf time="2024-01-25T16:13:49Z" level=error msg="error syncing 'system-upgrade/apply-k3s-agent-on-k3s-agent-fsn1-swd-with-e6be86ccf52ef5-66be1': handler system-upgrade-controller: jobs.batch \"apply-k3s-agent-on-k3s-agent-fsn1-swd-with-e6be86ccf52ef5-66be1\" not found, requeuing"
....
stream logs failed container "system-upgrade-controller" in pod "system-upgrade-controller-fcc74f7fc-ztpd2" is waiting to start: trying and failing to pull image for system-upgrade/system-upgrade-controller-fcc74f7fc-ztpd2 (system-upgrade-controller)
stream logs failed container "system-upgrade-controller" in pod "system-upgrade-controller-fcc74f7fc-ztpd2" is waiting to start: trying and failing to pull image for system-upgrade/system-upgrade-controller-fcc74f7fc-ztpd2 (system-upgrade-controller)
stream logs failed container "system-upgrade-controller" in pod "system-upgrade-controller-fcc74f7fc-ztpd2" is waiting to start: trying and failing to pull image for system-upgrade/system-upgrade-controller-fcc74f7fc-ztpd2 (system-upgrade-controller)
...

The events

Name:             system-upgrade-controller-fcc74f7fc-ztpd2.17c7a298b0ee9958
Namespace:        system-upgrade
Labels:           <none>
Annotations:      <none>
API Version:      v1
Count:            6
Event Time:       <nil>
First Timestamp:  2024-04-19T08:55:59Z
Involved Object:
  API Version:       v1
  Field Path:        spec.containers{system-upgrade-controller}
  Kind:              Pod
  Name:              system-upgrade-controller-fcc74f7fc-ztpd2
  Namespace:         system-upgrade
  Resource Version:  41962801
  UID:               85e60a4b-aa54-4388-8808-84e5d9a3f1b0
Kind:                Event
Last Timestamp:      2024-04-19T08:57:52Z
Message:             Error: ImagePullBackOff
Metadata:
  Creation Timestamp:  2024-04-19T08:55:59Z
  Resource Version:    41963430
  UID:                 85e8f484-2a19-4a07-a619-6ce5cc98b434
Reason:                Failed
Reporting Component:   kubelet
Reporting Instance:    k3s-control-fsn1-hpd
Source:
  Component:  kubelet
  Host:       k3s-control-fsn1-hpd
Type:         Warning
Events:       <none>

### Kube.tf file

```terraform
module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
  source = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.11.8"
  ssh_public_key = file("keys/ed25519.pub")  ssh_private_key = file("keys/ed25519")  
  control_plane_nodepools = [
    {
      name        = "control-fsn1",
      server_type = "cax21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 3    },
  ]

  agent_nodepools = [    {
      name        = "egress",
      server_type = "cpx11",
      location    = "fsn1",
      labels = [
        "node.kubernetes.io/role=egress"
      ],
      taints = [
        "node.kubernetes.io/role=egress:NoSchedule"
      ],
      floating_ip = true
      count = 0
    },
    {
      name        = "agent-fsn1",
      server_type = "cax21",
      location    = "fsn1",
      labels      = [
      ],
      taints      = [],
      count       = 0
    },
    {
      name        = "agent-med-fsn1",
      server_type = "cax31",
      location    = "fsn1",
      labels      = [
      ],
      taints      = [],
      count       = 4
    },
    {
      name        = "agent-large-fsn1",
      server_type = "cax41",
      location    = "fsn1",
      labels      = [
      ],
      taints      = [],
      count       = 0
    },
  ]  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"
   autoscaler_nodepools = [
     {
       name        = "autoscaled-small"
       server_type = "cax11"
       location    = "fsn1"
       min_nodes   = 0
       max_nodes   = 10
       labels      = {
         "node.kubernetes.io/role": "peak-workloads"
       }
       taints      =  [{
          key: "node.kubernetes.io/role"
          value: "peak-workloads"
          effect: "NoExecute"
       }]  
     }
   ]
  enable_longhorn = false
  disable_hetzner_csi = true
  traefik_additional_options = ["--log.level=DEBUG", "--api.dashboard=true", "--entryPoints.amqp.address=:5672"]
  traefik_additional_ports = [{name = "amqp", port = 5672, exposedPort = 5672}]

  extra_firewall_rules = [
    {
      description = "For RabbitMQ"
      direction       = "in"
      protocol        = "tcp"
      port            = "5672"
      source_ips      = ["0.0.0.0/0", "::/0"]    },  ]
}

provider "hcloud" {
  token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
}

terraform {
  required_version = ">= 1.5.0"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.43.0"
    }
  }
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

variable "hcloud_token" {
  sensitive = true
  default   = ""
}

Screenshots

No response

Platform

Mac

mysticaltech commented 6 months ago

@dmorn you can fix it just by upgrading with terraform init -upgrade