[Bug]: Kured communication gets timeout when talking to Kubernetes service

tobiasehlert commented 8 months ago

Description

I'm using the Terraform provider version v2.12.0 with Cilium v1.15.1 and K3s v1.28.6+k3s2.

I don't really get my head around, why the Kured pod can't communicate with the kubernetes service running on https://10.20.144.1:443 in my cluster, but it results in a restarted pod due to timeout.

2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Kubernetes Reboot Daemon: 1.15.0"
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Node ID: k3s-01-agent-small-nbg1-vvj"
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Lock TTL not set, lock will remain until being released"
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting"
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="PreferNoSchedule taint: "
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Blocking Pod Selectors: []"
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Reboot schedule: SunMonTueWedThuFriSat between 00:00 and 23:59 UTC"
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Reboot check command: [test -f /sentinel/reboot-required] every 5m0s"
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Concurrency: 1"
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Reboot method: command"
2024-02-22T09:30:20+01:00 time="2024-02-22T08:30:20Z" level=info msg="Reboot signal: 39"
2024-02-22T09:37:31+01:00 time="2024-02-22T08:37:31Z" level=fatal msg="Error testing lock: timed out trying to get daemonset kured in namespace kube-system: Timed out trying to get daemonset kured in namespace kube-system: Get \"https://10.20.144.1:443/apis/apps/v1/namespaces/kube-system/daemonsets/kured\": dial tcp 10.20.144.1:443: i/o timeout"

Kube.tf file

module "kube-hetzner" {
  source  = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.12.0"

  // provider and hcloud token config
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token

  // ssh key parameters
  ssh_public_key    = hcloud_ssh_key.tibiadata_ssh_key["tobias_ed25519"].public_key
  ssh_private_key   = null
  hcloud_ssh_key_id = hcloud_ssh_key.tibiadata_ssh_key["tobias_ed25519"].id

  // network parameters
  network_ipv4_cidr = "10.20.128.0/17"
  cluster_ipv4_cidr = "10.20.128.0/20"
  service_ipv4_cidr = "10.20.144.0/20"
  cluster_dns_ipv4  = "10.20.144.10"

  // control plane nodepools
  control_plane_nodepools = [
    for location in ["fsn1", "hel1", "nbg1", ] : {
      name        = "control-plane-${location}",
      server_type = "cax11",
      location    = location,
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    for location in [for dc in data.hcloud_datacenter.ds : dc.location.name] : {
      // for location in ["fsn1", "hel1", "nbg1", ] : {
      name        = "agent-small-${location}",
      server_type = "cax11",
      location    = location,
      labels      = [],
      taints      = [],
      count       = 2
    }
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "hel1"

  cluster_name        = "k3s-01"
  base_domain         = "k3s-01.${var.fqdn_domain}"
  additional_tls_sans = ["k3s-01.${var.fqdn_domain}"]
  ingress_controller  = "none"

  firewall_kube_api_source = [for ip in tolist(var.firewall_whitelisting.ssh) : "${ip}/32"]
  firewall_ssh_source      = [for ip in tolist(var.firewall_whitelisting.ssh) : "${ip}/32"]

  cni_plugin          = "cilium"
  cilium_version      = "v1.15.1"
  cilium_routing_mode = "native"
  enable_wireguard    = true
  block_icmp_ping_in  = true

  enable_cert_manager = false
  create_kubeconfig   = false

}

Screenshots

No response

Platform

Linux

tobiasehlert commented 8 months ago

Found also some event that looks reasonable for one of my nodes (k3s-01-agent-small-nbg1-vvj) in the cluster:

Could not create route fac268fa-acab-4287-bc7f-5008bb1790cf 10.20.128.0/24 for node k3s-01-agent-small-nbg1-vvj after 398.38809ms: hcloud/CreateRoute: invalid gateway (invalid_input)

When looking at the pod logs of hcloud-cloud-controller-manager it looks like there is some routing issue..

2024-02-22T13:42:11+01:00 I0222 12:42:11.829375       1 route_controller.go:216] action for Node "k3s-01-control-plane-hel1-iwm" with CIDR "10.20.132.0/24": "keep"
2024-02-22T13:42:11+01:00 I0222 12:42:11.829410       1 route_controller.go:216] action for Node "k3s-01-control-plane-nbg1-oze" with CIDR "10.20.131.0/24": "keep"
2024-02-22T13:42:11+01:00 I0222 12:42:11.829422       1 route_controller.go:216] action for Node "k3s-01-agent-small-nbg1-vvj" with CIDR "10.20.128.0/24": "add"
2024-02-22T13:42:11+01:00 I0222 12:42:11.829433       1 route_controller.go:216] action for Node "k3s-01-agent-small-nbg1-yiv" with CIDR "10.20.129.0/24": "keep"
2024-02-22T13:42:11+01:00 I0222 12:42:11.829445       1 route_controller.go:216] action for Node "k3s-01-control-plane-fsn1-ywt" with CIDR "10.20.130.0/24": "keep"
2024-02-22T13:42:11+01:00 I0222 12:42:11.829459       1 route_controller.go:290] route spec to be created: &{ k3s-01-agent-small-nbg1-vvj false [{InternalIP 10.20.128.101} {Hostname k3s-01-agent-small-nbg1-vvj} {ExternalIP XX.XX.XX.XX}] 10.20.128.0/24 false}
2024-02-22T13:42:11+01:00 I0222 12:42:11.829493       1 route_controller.go:304] Creating route for node k3s-01-agent-small-nbg1-vvj 10.20.128.0/24 with hint fac268fa-acab-4287-bc7f-5008bb1790cf, throttled 12.44µs
2024-02-22T13:42:12+01:00 E0222 12:42:12.401242       1 route_controller.go:329] Could not create route fac268fa-acab-4287-bc7f-5008bb1790cf 10.20.128.0/24 for node k3s-01-agent-small-nbg1-vvj: hcloud/CreateRoute: invalid gateway (invalid_input)
2024-02-22T13:42:12+01:00 I0222 12:42:12.401365       1 route_controller.go:387] Patching node status k3s-01-agent-small-nbg1-vvj with false previous condition was:&NodeCondition{Type:NetworkUnavailable,Status:False,LastHeartbeatTime:2024-02-22 12:42:00 +0000 UTC,LastTransitionTime:2024-02-22 12:42:00 +0000 UTC,Reason:CiliumIsUp,Message:Cilium is running on this node,}
2024-02-22T13:42:12+01:00 I0222 12:42:12.401535       1 event.go:307] "Event occurred" object="k3s-01-agent-small-nbg1-vvj" fieldPath="" kind="Node" apiVersion="" type="Warning" reason="FailedToCreateRoute" message="Could not create route fac268fa-acab-4287-bc7f-5008bb1790cf 10.20.128.0/24 for node k3s-01-agent-small-nbg1-vvj after 571.712557ms: hcloud/CreateRoute: invalid gateway (invalid_input)"

Someone experienced this before?

mysticaltech commented 8 months ago

Thanks for sharing @tobiasehlert, @M4t7e FYI happening in cilium.

I suspect it's because of the cilium routing mode "native", @tobiasehlert please remove that line and let us know 🙏

tobiasehlert commented 8 months ago

I suspect it's because of the cilium routing mode "native", @tobiasehlert please remove that line and let us know 🙏

Yes, from what I've seen yet it looks exactly like that.. just removed the whole cluster and created a new one and it's not working with cilium_routing_mode set to tunnel. But there was no difference at all @mysticaltech

To me it looks like it's the hcloud csi things that are the issue in this case.. but I can't get my head around the issue.

mysticaltech commented 8 months ago

@tobiasehlert Weird, it's the first time we hear of that. Please inspect and share your hcloud ccm and csi logs then if you suspect this is causing the issue. Also please have a look at our readme's debug section and try to do some general node level debug just in case. Also, the hcloud cli can be useful here to inspect the routes and such.

M4t7e commented 8 months ago

Hey @tobiasehlert, HCCM is already hinting at what's wrong here:

Could not create route fac268fa-acab-4287-bc7f-5008bb1790cf 10.20.128.0/24 for node k3s-01-agent-small-nbg1-vvj: hcloud/CreateRoute: invalid gateway (invalid_input)

Overview:

network_ipv4_cidr = 10.20.128.0/17 (Hetzner Network)
- Starting from this network onwards: Agent Node Subnets
- At the end of this network: Server Node Subnets
cluster_ipv4_cidr = 10.20.128.0/20 (Reserved for K8s Pod Networks -> HCCM RouteController)

k3s-01-agent-small-nbg1-vvj: 10.20.128.101

HCCM RouteController tried to add the Pod network route 10.20.128.0/24 (probably matching 1:1 with the subnet of the server itself) with 10.20.128.101 as the gateway:

The gateway IP can not be contained in destination range (only exception are default routes with 0.0.0.0/0)
The Pod IP range is probably clashing with the Hetzner Network Subnets for Agent Nodes

You have to leave enough space at the beginning and at the end of network_ipv4_cidr for Hetzner Networks, so that they don't collide with Pod and Service CIDRs (especially at the beginning of the ranges).

tobiasehlert commented 8 months ago

Thanks for your response @M4t7e!

What size should the both Service and Cluster code be each? Do you have some suggestions there?

M4t7e commented 8 months ago

@tobiasehlert Yeah, sure. Here some considerations for the subnetting...

You need enough space for Hetzner Subnets. Total limit today is 50 Subnets per Network (see https://docs.hetzner.com/cloud/networks/faq#are-there-any-limits-on-how-networks-can-be-used).

For routing configuration simplicity, it's best if cluster_ipv4_cidr falls within network_ipv4_cidr. The cluster_ipv4_cidr will use most IPs since they are allocated for the Pods, and Hetzner CCM reserves larger ranges for the Nodes, adding the Pod routes with the corresponding Node IP as the gateway. Max 100 routes per Network are possible (see Hetzner faq). service_ipv4_cidr typically requires less space compared to the Pods.

Hetzner Subnets and Pod Networks are both allocated in ascending order. Therefore, we could disregard the Server Node Subnets at the end (it's highly unlikely they will ever be used) if we aim to save space.

One example could be like this:

network_ipv4_cidr = 10.0.0.0/16 (sufficient for 64 /24 Subnets -> you can treat only 10.0.0.0/18 as reserved for it)
service_ipv4_cidr = 10.0.64.0/18 (half size of cluster_ipv4_cidr)
cluster_dns_ipv4 = 10.0.64.10 (has to be in service_ipv4_cidr)
cluster_ipv4_cidr = 10.0.128.0/17 (biggest range for Pods -> more than 100 /24 networks/routes for Pods)

tobiasehlert commented 8 months ago

Thanks @M4t7e!

I'll go for this then :)

network_ipv4_cidr = "10.20.128.0/17"
service_ipv4_cidr = "10.20.160.0/19"
cluster_ipv4_cidr = "10.20.192.0/18"
cluster_dns_ipv4  = "10.20.160.10"

mysticaltech commented 8 months ago

Thanks @M4t7e, excellent! Should've had a better look at the kube.tf.

@tobiasehlert When you change IP ranges, you really have to know what you are doing and get a good look at what it affects within the code. For most scenarios, you can just keep the defaults as they are proven to work well.

tobiasehlert commented 8 months ago

@tobiasehlert When you change IP ranges, you really have to know what you are doing and get a good look at what it affects within the code. For most scenarios, you can just keep the defaults as they are proven to work well.

Yeah I saw that note about changing cidrs, but had to due some overlapping cidr :( But yeah, thanks to @M4t7e it works now.. was unaware how to portion up the subnets, but how it rocks :D

kube-hetzner / terraform-hcloud-kube-hetzner