vitobotta / hetzner-k3s

The easiest and fastest way to create and manage Kubernetes clusters in Hetzner Cloud using the lightweight distribution k3s by Rancher.
MIT License
1.91k stars 141 forks source link

Adding new node pool gets stuck on systemd: Starting k3s-agent #465

Closed boris-savic closed 1 month ago

boris-savic commented 1 month ago

Hi,

I've stumbled on another issue, that I cannot seem to find any existing issues related to.

So I've setup my cluster with two worker pools, and tried to add another ARM based workers into the mix. Nodes get created, I can ssh to them, but they never become active on the k3s. The create command gets stuck on Starting k3s-agent. Below is my config file and all the logs.

I have tried with non-arm instance, but same behavior, so it's not related to the ARM.

Using hetzner-k3s version 2.0.8 on OSX.

My config is following:

cluster_name: bugkings
kubeconfig_path: "./kubeconfig"
k3s_version: v1.31.1+k3s1

networking:
  ssh:
    port: 22
    use_agent: false 
    public_key_path: "~/.ssh/id_rsa.pub"
    private_key_path: "~/.ssh/id_rsa"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api:
      - 0.0.0.0/0
  public_network:
    ipv4: true
    ipv6: true
  private_network:
    enabled : true
    subnet: 10.0.0.0/16
  cni:
    enabled: true
    encryption: false
    mode: flannel

manifests:
   cloud_controller_manager_manifest_url: "https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/download/v1.20.0/ccm-networks.yaml"
   csi_driver_manifest_url: "https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.9.0/deploy/kubernetes/hcloud-csi.yml"
   system_upgrade_controller_deployment_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/system-upgrade-controller.yaml"
   system_upgrade_controller_crd_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/crd.yaml"
   cluster_autoscaler_manifest_url: "https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/hetzner/examples/cluster-autoscaler-run-on-master.yaml"

datastore:
  mode: etcd

schedule_workloads_on_masters: false

masters_pool:
  instance_type: cpx21
  instance_count: 1
  location: fsn1

worker_node_pools:
- name: worker-static
  instance_type: cpx31
  instance_count: 3
  location: fsn1
  labels:
    - key: server/worker
      value: true
- name: database
  instance_type: cpx31
  instance_count: 3
  location: fsn1
  labels:
    - key: server/database
      value: true
  taints:
    - key: server/database
      value: true:NoSchedule
- name: database-arm
  instance_type: cpx21
  instance_count: 3
  location: fsn1
  labels:
    - key: server/database
      value: true
    - key: server/dedicated-cpu
      value: true
  taints:
    - key: server/database
      value: true:NoSchedule

embedded_registry_mirror:
  enabled: true

Logs on the workers that are stuck:

root@bugkings-pool-database-arm-worker1:~# systemctl status  k3s-agent
● k3s-agent.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s-agent.service; enabled; preset: enabled)
     Active: activating (start) since Mon 2024-10-14 06:44:37 UTC; 41s ago
       Docs: https://k3s.io
    Process: 3307 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null (code=exited, status=0/SUCCESS)
    Process: 3309 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 3311 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 3314 (k3s-agent)
      Tasks: 36
     Memory: 518.4M (peak: 552.4M)
        CPU: 2.350s
     CGroup: /system.slice/k3s-agent.service
             ├─2695 /var/lib/rancher/k3s/data/f6cd8dbf9295d23c321d6f4e7d5c150148d4d8f17a0590bc7148a06c1e1634a6/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3704b8be58e680b65a9dec82015b699167fca4635d298983635f0d686bd40f41 -address /run/k3s/>
             ├─3314 "/usr/local/bin/k3s agent"
             └─3330 "containerd "

Oct 14 06:44:58 bugkings-pool-database-arm-worker1 k3s[3314]: time="2024-10-14T06:44:58Z" level=info msg="Network policy controller waiting for removal of node.cloudprovider.kubernetes.io/uninitialized taint"
Oct 14 06:45:00 bugkings-pool-database-arm-worker1 k3s[3314]: time="2024-10-14T06:45:00Z" level=info msg="Network policy controller waiting for removal of node.cloudprovider.kubernetes.io/uninitialized taint"
Oct 14 06:45:02 bugkings-pool-database-arm-worker1 k3s[3314]: time="2024-10-14T06:45:02Z" level=info msg="Network policy controller waiting for removal of node.cloudprovider.kubernetes.io/uninitialized taint"
Oct 14 06:45:04 bugkings-pool-database-arm-worker1 k3s[3314]: time="2024-10-14T06:45:04Z" level=info msg="Network policy controller waiting for removal of node.cloudprovider.kubernetes.io/uninitialized taint"
Oct 14 06:45:06 bugkings-pool-database-arm-worker1 k3s[3314]: time="2024-10-14T06:45:06Z" level=info msg="Network policy controller waiting for removal of node.cloudprovider.kubernetes.io/uninitialized taint"
Oct 14 06:45:08 bugkings-pool-database-arm-worker1 k3s[3314]: time="2024-10-14T06:45:08Z" level=info msg="Network policy controller waiting for removal of node.cloudprovider.kubernetes.io/uninitialized taint"
Oct 14 06:45:10 bugkings-pool-database-arm-worker1 k3s[3314]: time="2024-10-14T06:45:10Z" level=info msg="Network policy controller waiting for removal of node.cloudprovider.kubernetes.io/uninitialized taint"
Oct 14 06:45:12 bugkings-pool-database-arm-worker1 k3s[3314]: time="2024-10-14T06:45:12Z" level=info msg="Network policy controller waiting for removal of node.cloudprovider.kubernetes.io/uninitialized taint"
Oct 14 06:45:14 bugkings-pool-database-arm-worker1 k3s[3314]: time="2024-10-14T06:45:14Z" level=info msg="Network policy controller waiting for removal of node.cloudprovider.kubernetes.io/uninitialized taint"
Oct 14 06:45:16 bugkings-pool-database-arm-worker1 k3s[3314]: time="2024-10-14T06:45:16Z" level=info msg="Network policy controller waiting for removal of node.cloudprovider.kubernetes.io/uninitialized taint"

Journal logs

root@bugkings-pool-database-arm-worker1:~# journalctl -u k3s-agent
Oct 14 06:36:25 bugkings-pool-database-arm-worker1 systemd[1]: Starting k3s-agent.service - Lightweight Kubernetes...
Oct 14 06:36:25 bugkings-pool-database-arm-worker1 sh[2403]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Oct 14 06:36:25 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:25Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Oct 14 06:36:25 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:25Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/f6cd8dbf9295d23c321d6f4e7d5c150148d4d8f17a0590bc7148a06c1e1634a6"
Oct 14 06:36:27 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:27Z" level=info msg="Starting k3s agent v1.31.1+k3s1 (452dbbc1)"
Oct 14 06:36:27 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:27Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 10.0.0.2:6443"
Oct 14 06:36:27 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:27Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [10.0.0.2:6443] [default: 10.0.0.2:6443]"
Oct 14 06:36:27 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:27Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the serv>
Oct 14 06:36:28 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:28Z" level=info msg="Using private registry config file at /etc/rancher/k3s/registries.yaml"
Oct 14 06:36:28 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:28Z" level=info msg="Module overlay was already loaded"
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:29Z" level=info msg="Module br_netfilter was already loaded"
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:29Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400"
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:29Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600"
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:29Z" level=info msg="Set sysctl 'net/ipv4/conf/all/forwarding' to 1"
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:29Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_max' to 131072"
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:29Z" level=info msg="Starting distributed registry mirror at https://10.0.0.11:6443/v2 for registries [*]"
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:29Z" level=info msg="Starting distributed registry P2P node at 10.0.0.11:5001"
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.048Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:322        starting refreshing cpl 0 with key CIQAAAFIR4AAAAAAAAAAAAAAAAAAAAAA>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.048Z        WARN        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:187        failed when refreshing routing table2 errors occurred:
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]:         * failed to query for self, err=failed to find any peer in table
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]:         * failed to refresh cpl=0, err=failed to find any peer in table
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.049Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:322        starting refreshing cpl 0 with key CIQAAATTHQAAAAAAAAAAAAAAAAAAAAAA>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.049Z        WARN        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:233        failed when refreshing routing table        {"error": "2 errors occ>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.053Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:322        starting refreshing cpl 0 with key CIQAAADQKUAAAAAAAAAAAAAAAAAAAAAA>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.053Z        WARN        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:233        failed when refreshing routing table        {"error": "2 errors occ>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: I1014 06:36:29.059728    2409 secure_serving.go:213] Serving securely on [::]:6443
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: I1014 06:36:29.060187    2409 dynamic_cafile_content.go:160] "Starting controller" name="client-ca-bundle::/var/lib/rancher/k3s/agent/client-ca.crt"
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: I1014 06:36:29.060190    2409 dynamic_serving_content.go:135] "Starting controller" name="serving-cert::/var/lib/rancher/k3s/agent/serving-kubelet.crt::/var/lib/rancher/k3s/agent/ser>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: I1014 06:36:29.060788    2409 tlsconfig.go:243] "Starting DynamicServingCertificateController"
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:29Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the serv>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:29Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the serv>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: time="2024-10-14T06:36:29Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the serv>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.216Z        WARN        dht        go-libp2p-kad-dht@v0.25.2/lookup.go:43        network size estimator track peers: expected bucket size number of peers
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.216Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:322        starting refreshing cpl 0 with key CIQAAAFB4MAAAAAAAAAAAAAAAAAAAAAA>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.225Z        WARN        dht        go-libp2p-kad-dht@v0.25.2/lookup.go:43        network size estimator track peers: expected bucket size number of peers
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.225Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:331        finished refreshing cpl 0, routing table size is now 9
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.225Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:322        starting refreshing cpl 1 with key CIQAAAMCPAAAAAAAAAAAAAAAAAAAAAAA>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.231Z        WARN        dht        go-libp2p-kad-dht@v0.25.2/lookup.go:43        network size estimator track peers: expected bucket size number of peers
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.231Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:331        finished refreshing cpl 1, routing table size is now 9
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.231Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:322        starting refreshing cpl 2 with key CIQAAAHRKMAAAAAAAAAAAAAAAAAAAAAA>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.237Z        WARN        dht        go-libp2p-kad-dht@v0.25.2/lookup.go:43        network size estimator track peers: expected bucket size number of peers
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.237Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:331        finished refreshing cpl 2, routing table size is now 9
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.237Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:322        starting refreshing cpl 3 with key CIQAAAMFXAAAAAAAAAAAAAAAAAAAAAAA>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.241Z        WARN        dht        go-libp2p-kad-dht@v0.25.2/lookup.go:43        network size estimator track peers: expected bucket size number of peers
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.241Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:331        finished refreshing cpl 3, routing table size is now 9
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.241Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:322        starting refreshing cpl 4 with key CIQAAAADSYAAAAAAAAAAAAAAAAAAAAAA>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.245Z        WARN        dht        go-libp2p-kad-dht@v0.25.2/lookup.go:43        network size estimator track peers: expected bucket size number of peers
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.245Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:331        finished refreshing cpl 4, routing table size is now 9
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.245Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:322        starting refreshing cpl 5 with key CIQAAAC6QQAAAAAAAAAAAAAAAAAAAAAA>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.250Z        WARN        dht        go-libp2p-kad-dht@v0.25.2/lookup.go:43        network size estimator track peers: expected bucket size number of peers
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.250Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:331        finished refreshing cpl 5, routing table size is now 9
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.250Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:322        starting refreshing cpl 6 with key CIQAAADLQ4AAAAAAAAAAAAAAAAAAAAAA>
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.255Z        WARN        dht        go-libp2p-kad-dht@v0.25.2/lookup.go:43        network size estimator track peers: expected bucket size number of peers
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.255Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:331        finished refreshing cpl 6, routing table size is now 9
Oct 14 06:36:29 bugkings-pool-database-arm-worker1 k3s[2409]: 2024-10-14T06:36:29.255Z        INFO        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:322        starting refreshing cpl 7 with key CIQAAAA5UMAAAAAAAAAAAAAAAAAAAAAA>
boris-savic commented 1 month ago

If i remove the node.cloudprovider.kubernetes.io/uninitialized taint manually the creation is completed, but i seem to end up with non functional nodes...

The nodes are missing external IPs in K3s on them and topology labels. Nodes on the Hetzner console have public IPs, I can SSH to them, but when I rerun the hetzner-k3s create command, the process is stuck on SSH step Waiting for successful ssh connectivity with instance

boris-savic commented 1 month ago

Didnt change a thing. Retried couple of days later with same yamls and it worked....

vitobotta commented 1 month ago

Could have been a temporary problem with the Hetzner API. Something similar happened to me some months ago and the problem was that the cloud controller manager was unable to talk to Hetzner for a little while. Next time please check the CCM logs to see if there are hints to what the problem might be.

Also don't remove that taint manually. It's normally removed automatically once the networking on the node is correctly configured and the CCM works correctly.