syself / cluster-api-provider-hetzner

Cluster API Provider Hetzner :rocket: The best way to manage Kubernetes clusters on Hetzner, fully declarative, Kubernetes-native and with self-healing capabilities
https://caph.syself.com
Apache License 2.0
686 stars 60 forks source link

Control plane node unhealthy - NodeProvisioning waiting for matching ProviderID #1492

Closed dannynodies closed 1 month ago

dannynodies commented 1 month ago

/kind bug

What steps did you take and what happened: Followed the guide to bootstrap the initial cluster, but the first control-plane node never becomes healthy. I manually checked that the hcloud:// id was matching but for some reason the API hasn't picked up the machine.

What did you expect to happen: Expected bootstrapping to complete normally

Anything else you would like to add:

$ clusterctl describe cluster hetzner-cluster --show-conditions all
NAME                                                                READY  SEVERITY  REASON                                                            SINCE  MESSAGE                                              
Cluster/hetzner-cluster                                             False  Warning   NodeStartupTimeout @ Machine/hetzner-cluster-control-plane-rskqg  1s     0 of 1 completed                                      
│           ├─ControlPlaneInitialized                               True                                                                               15m                                                          
│           ├─ControlPlaneReady                                     False  Warning   NodeStartupTimeout @ Machine/hetzner-cluster-control-plane-rskqg  1s     0 of 1 completed                                      
│           └─InfrastructureReady                                   True                                                                               15m                                                          
├─ClusterInfrastructure - HetznerCluster/hetzner-cluster            True                                                                               15m                                                          
│             ├─ControlPlaneEndpointSet                             True                                                                               17m                                                          
│             ├─HCloudTokenAvailable                                True                                                                               17m                                                          
│             ├─LoadBalancerReady                                   True                                                                               17m                                                          
│             ├─PlacementGroupsSynced                               True                                                                               17m                                                          
│             ├─TargetClusterReady                                  True                                                                               15m                                                          
│             └─TargetClusterSecretReady                            True                                                                               15m                                                          
└─ControlPlane - KubeadmControlPlane/hetzner-cluster-control-plane  False  Warning   NodeStartupTimeout @ Machine/hetzner-cluster-control-plane-rskqg  1s     0 of 1 completed                                      
  │           ├─Available                                           True                                                                               15m                                                          
  │           ├─CertificatesAvailable                               True                                                                               17m                                                          
  │           ├─MachinesCreated                                     True                                                                               17m                                                          
  │           ├─MachinesReady                                       False  Warning   NodeStartupTimeout @ Machine/hetzner-cluster-control-plane-rskqg  1s     0 of 1 completed                                      
  │           └─Resized                                             True                                                                               17m                                                          
  └─Machine/hetzner-cluster-control-plane-rskqg                     False  Warning   NodeStartupTimeout                                                1s     Node failed to report startup in 15m0s                
                ├─BootstrapReady                                    True                                                                               17m                                                          
                ├─HealthCheckSucceeded                              False  Warning   NodeStartupTimeout                                                1s     Node failed to report startup in 15m0s                
                ├─InfrastructureReady                               True                                                                               16m                                                          
                └─NodeHealthy                                       False  Warning   NodeProvisioning                                                  15m    Waiting for a node with matching ProviderID to exist

eventually fails:

$ clusterctl describe cluster hetzner-cluster 
NAME                                                                READY  SEVERITY  REASON                                                            SINCE  MESSAGE                                
Cluster/hetzner-cluster                                             False  Warning   NodeStartupTimeout @ Machine/hetzner-cluster-control-plane-bxjrh  123m   0 of 1 completed                        
├─ClusterInfrastructure - HetznerCluster/hetzner-cluster            True                                                                               138m                                           
└─ControlPlane - KubeadmControlPlane/hetzner-cluster-control-plane  False  Warning   NodeStartupTimeout @ Machine/hetzner-cluster-control-plane-bxjrh  123m   0 of 1 completed                        
  └─Machine/hetzner-cluster-control-plane-bxjrh                     False  Warning   NodeStartupTimeout                                                123m   Node failed to report startup in 15m0s

Environment:

janiskemper commented 1 month ago

Did you deploy the CCM and CNI?

guettli commented 1 month ago

IPs of Hetzner are sometimes blocked.

Then downloading kubeadm dpkg/rpm packages fails.

You can see here some commands to check if that is the case for your machines:

https://github.com/kubernetes/registry.k8s.io/issues/138#issuecomment-2013011206

You can ssh into that machine and execute these commands.

And you can look at /var/log/cloud-init-output.log.

Our caph controller creates an event if cloud-init failed. You can look at the events with k9s (for example).

You can use that command to get an overview of unhealthy Conditions:

go  run github.com/guettli/check-conditions@latest all

What is the output of that?

kubectl get events  --sort-by='.lastTimestamp' -A 

Did that help you?

dannynodies commented 1 month ago

Hi, thanks for the responses. I installed the CCM (I think this is part of the initial steps when bootstrapping) but I think the CNI comes after the control plane is up, unless I misunderstood. In this case the control-plane doesn't start up.

I ssh'd to the machine and verified that kubelet was running. I also checked /var/log/cloud-init-output.log and everything seems to have worked. Output of check-conditions:

$ go  run github.com/guettli/check-conditions@latest all
  default clusters hetzner-cluster Condition ControlPlaneReady=False NodeStartupTimeout @ Machine/hetzner-cluster-control-plane-bxjrh "0 of 1 completed" (2h8m33s)
  default kubeadmcontrolplanes hetzner-cluster-control-plane Condition MachinesReady=False NodeStartupTimeout @ Machine/hetzner-cluster-control-plane-bxjrh "0 of 1 completed" (2h8m33s)
  default machines hetzner-cluster-control-plane-bxjrh Condition HealthCheckSucceeded=False NodeStartupTimeout "Node failed to report startup in 15m0s" (2h8m33s)
  default machines hetzner-cluster-control-plane-bxjrh Condition NodeHealthy=False NodeProvisioning "Waiting for a node with matching ProviderID to exist" (2h23m33s)
  default machines hetzner-cluster-control-plane-bxjrh Condition OwnerRemediated=False WaitingForRemediation "KCP can't remediate if current replicas are less or equal to 1" (2h5m32s)
Checked 274 conditions of 500 resources of 84 types. Duration: 125ms

Output of kubectl get events is blank but this might be due to the time elapsed. I will try to repeat again (this has occurred 5-6 times in a row so I don't think there are any transient conditions causing an issue)

dannynodies commented 1 month ago

When I initially ran this, I created the cluster with 3 worker machines and 3 control-plane. I observed that it started a single control-plane machine, and then booted 3 worker machines. However the deployment eventually failed with the same errors as above.

dannynodies commented 1 month ago

Rerun:

Looking ok so far:

$ clusterctl describe cluster hetzner-cluster --show-conditions all
NAME                                                                READY  SEVERITY  REASON                                                        SINCE  MESSAGE                                                                               
Cluster/hetzner-cluster                                             False  Info      ServerStarting @ Machine/hetzner-cluster-control-plane-z6hbh  50s    0 of 1 completed                                                                       
│           ├─ControlPlaneInitialized                               False  Info      WaitingForControlPlaneProviderInitialized                     78s    Waiting for control plane provider to indicate the control plane has been initialized  
│           ├─ControlPlaneReady                                     False  Info      ServerStarting @ Machine/hetzner-cluster-control-plane-z6hbh  50s    0 of 1 completed                                                                       
│           └─InfrastructureReady                                   False  Info      TargetClusterControlPlaneNotReady                             58s    target cluster not ready                                                               
├─ClusterInfrastructure - HetznerCluster/hetzner-cluster            False  Info      TargetClusterControlPlaneNotReady                             58s    target cluster not ready                                                               
│             ├─ControlPlaneEndpointSet                             True                                                                           77s                                                                                           
│             ├─HCloudTokenAvailable                                True                                                                           78s                                                                                           
│             ├─LoadBalancerReady                                   True                                                                           78s                                                                                           
│             ├─PlacementGroupsSynced                               True                                                                           78s                                                                                           
│             └─TargetClusterReady                                  False  Info      TargetClusterControlPlaneNotReady                             58s    target cluster not ready                                                               
└─ControlPlane - KubeadmControlPlane/hetzner-cluster-control-plane  False  Info      ServerStarting @ Machine/hetzner-cluster-control-plane-z6hbh  50s    0 of 1 completed                                                                       
  │           ├─Available                                           False  Info      WaitingForKubeadmInit                                         76s                                                                                           
  │           ├─CertificatesAvailable                               True                                                                           76s                                                                                           
  │           ├─MachinesCreated                                     True                                                                           66s                                                                                           
  │           ├─MachinesReady                                       False  Info      ServerStarting @ Machine/hetzner-cluster-control-plane-z6hbh  50s    0 of 1 completed                                                                       
  │           └─Resized                                             True                                                                           61s                                                                                           
  └─Machine/hetzner-cluster-control-plane-z6hbh                     True                                                                           5s                                                                                            
                ├─BootstrapReady                                    True                                                                           76s                                                                                           
                ├─InfrastructureReady                               True                                                                           5s                                                                                            
                └─NodeHealthy                                       False  Info      WaitingForNodeRef                                             76s                     
dannynodies commented 1 month ago
$ clusterctl describe cluster hetzner-cluster --show-conditions all
NAME                                                                READY  SEVERITY  REASON                             SINCE  MESSAGE                                              
Cluster/hetzner-cluster                                             False  Info      TargetClusterControlPlaneNotReady  6s     target cluster not ready                              
│           ├─ControlPlaneInitialized                               True                                                6s                                                           
│           ├─ControlPlaneReady                                     True                                                6s                                                           
│           └─InfrastructureReady                                   False  Info      TargetClusterControlPlaneNotReady  107s   target cluster not ready                              
├─ClusterInfrastructure - HetznerCluster/hetzner-cluster            False  Info      TargetClusterControlPlaneNotReady  107s   target cluster not ready                              
│             ├─ControlPlaneEndpointSet                             True                                                2m6s                                                         
│             ├─HCloudTokenAvailable                                True                                                2m7s                                                         
│             ├─LoadBalancerReady                                   True                                                2m7s                                                         
│             ├─PlacementGroupsSynced                               True                                                2m7s                                                         
│             └─TargetClusterReady                                  False  Info      TargetClusterControlPlaneNotReady  107s   target cluster not ready                              
└─ControlPlane - KubeadmControlPlane/hetzner-cluster-control-plane  True                                                6s                                                           
  │           ├─Available                                           True                                                6s                                                           
  │           ├─CertificatesAvailable                               True                                                2m5s                                                         
  │           ├─MachinesCreated                                     True                                                115s                                                         
  │           ├─MachinesReady                                       True                                                39s                                                          
  │           └─Resized                                             True                                                110s                                                         
  └─Machine/hetzner-cluster-control-plane-z6hbh                     True                                                54s                                                          
                ├─BootstrapReady                                    True                                                2m5s                                                         
                ├─InfrastructureReady                               True                                                54s                                                          
                └─NodeHealthy                                       False  Warning   NodeProvisioning                   7s     Waiting for a node with matching ProviderID to exist  

This is as far as it got last time before the 15m timeout

dannynodies commented 1 month ago
$ clusterctl describe cluster hetzner-cluster --show-conditions all
NAME                                                                READY  SEVERITY  REASON            SINCE  MESSAGE                                              
Cluster/hetzner-cluster                                             True                               7s                                                           
│           ├─ControlPlaneInitialized                               True                               73s                                                          
│           ├─ControlPlaneReady                                     True                               73s                                                          
│           └─InfrastructureReady                                   True                               7s                                                           
├─ClusterInfrastructure - HetznerCluster/hetzner-cluster            True                               7s                                                           
│             ├─ControlPlaneEndpointSet                             True                               3m13s                                                        
│             ├─HCloudTokenAvailable                                True                               3m14s                                                        
│             ├─LoadBalancerReady                                   True                               3m14s                                                        
│             ├─PlacementGroupsSynced                               True                               3m14s                                                        
│             ├─TargetClusterReady                                  True                               7s                                                           
│             └─TargetClusterSecretReady                            True                               7s                                                           
└─ControlPlane - KubeadmControlPlane/hetzner-cluster-control-plane  True                               73s                                                          
  │           ├─Available                                           True                               73s                                                          
  │           ├─CertificatesAvailable                               True                               3m12s                                                        
  │           ├─MachinesCreated                                     True                               3m2s                                                         
  │           ├─MachinesReady                                       True                               106s                                                         
  │           └─Resized                                             True                               2m57s                                                        
  └─Machine/hetzner-cluster-control-plane-z6hbh                     True                               2m1s                                                         
                ├─BootstrapReady                                    True                               3m12s                                                        
                ├─InfrastructureReady                               True                               2m1s                                                         
                └─NodeHealthy                                       False  Warning   NodeProvisioning  74s    Waiting for a node with matching ProviderID to exist

Is this actually healthy or do I need to wait? Running without showing conditions seems to indicate the node is ready? Do I need to install the CNI at this point (within 15m)?

batistein commented 1 month ago

Is it possible to create another cluster without deleting the current one? there are sometimes IPs blocked on AWS (which is hosting the mirror for k8s and could be the reason cloud-init is failing).. You could also check the output of cloud-init to see where in the process the node provisioning fails

dannynodies commented 1 month ago

as far as I can tell, cloud-init is successful.

root@hetzner-cluster-control-plane-2xhw9:~# grep success /var/log/cloud-init-output.log
Your Kubernetes control-plane has initialized successfully!
+ echo success
dannynodies commented 1 month ago

Looks like everything started up OK

root@hetzner-cluster-control-plane-2xhw9:~# systemctl status kubelet kubepods.slice
● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: active (running) since Wed 2024-10-16 15:31:15 UTC; 4min 13s ago
       Docs: https://kubernetes.io/docs/
   Main PID: 3264 (kubelet)
      Tasks: 12 (limit: 4532)
     Memory: 32.1M
        CPU: 3.291s
     CGroup: /system.slice/kubelet.service
             └─3264 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
● kubepods.slice - libcontainer container kubepods.slice
     Loaded: loaded (/run/systemd/transient/kubepods.slice; transient)
  Transient: yes
    Drop-In: /run/systemd/transient/kubepods.slice.d
             └─50-CPUWeight.conf, 50-MemoryMax.conf, 50-TasksMax.conf
     Active: active since Wed 2024-10-16 15:30:26 UTC; 5min ago
         IO: 0B read, 12.9M written
      Tasks: 46 (limit: 30214)
     Memory: 263.3M (max: 3.7G available: 3.4G)
        CPU: 26.121s
     CGroup: /kubepods.slice
             ├─kubepods-besteffort.slice
dannynodies commented 1 month ago

Created another cluster (using the same kind cluster as before) with the same result:

$ clusterctl describe cluster hetzner-cluster-2 --show-conditions all
NAME                                                                  READY  SEVERITY  REASON            SINCE  MESSAGE                                              
Cluster/hetzner-cluster-2                                             True                               3m53s                                                        
│           ├─ControlPlaneInitialized                                 True                               3m53s                                                        
│           ├─ControlPlaneReady                                       True                               3m53s                                                        
│           └─InfrastructureReady                                     True                               4m11s                                                        
├─ClusterInfrastructure - HetznerCluster/hetzner-cluster-2            True                               4m11s                                                        
│             ├─ControlPlaneEndpointSet                               True                               7m14s                                                        
│             ├─HCloudTokenAvailable                                  True                               7m15s                                                        
│             ├─LoadBalancerReady                                     True                               7m15s                                                        
│             ├─PlacementGroupsSynced                                 True                               7m15s                                                        
│             ├─TargetClusterReady                                    True                               4m11s                                                        
│             └─TargetClusterSecretReady                              True                               4m11s                                                        
└─ControlPlane - KubeadmControlPlane/hetzner-cluster-2-control-plane  True                               3m53s                                                        
  │           ├─Available                                             True                               3m53s                                                        
  │           ├─CertificatesAvailable                                 True                               7m14s                                                        
  │           ├─MachinesCreated                                       True                               7m3s                                                         
  │           ├─MachinesReady                                         True                               5m48s                                                        
  │           └─Resized                                               True                               6m58s                                                        
  └─Machine/hetzner-cluster-2-control-plane-gsxj5                     True                               6m2s                                                         
                ├─BootstrapReady                                      True                               7m13s                                                        
                ├─InfrastructureReady                                 True                               6m2s                                                         
                └─NodeHealthy                                         False  Warning   NodeProvisioning  3m53s  Waiting for a node with matching ProviderID to exist
dannynodies commented 1 month ago

created an entirely new kind cluster and repeated the tutorial but got the same result

will try again with a new project and apikey

janiskemper commented 1 month ago

you should create the CNI immediately, see the docs: https://syself.com/docs/caph/getting-started/quickstart/creating-a-workload-cluster#deploying-the-cni-solution

dannynodies commented 1 month ago

Thanks for the update! I can confirm that deploying the CNI allowed the rest of the cluster to deploy succesfully! Thanks for the help, and sorry for the noise

janiskemper commented 1 month ago

I'm very glad that it's working! Sorry for the many advanced ideas on how to solve this, without going into the easy ones first.