Closed dannynodies closed 1 month ago
Did you deploy the CCM and CNI?
IPs of Hetzner are sometimes blocked.
Then downloading kubeadm dpkg/rpm packages fails.
You can see here some commands to check if that is the case for your machines:
https://github.com/kubernetes/registry.k8s.io/issues/138#issuecomment-2013011206
You can ssh into that machine and execute these commands.
And you can look at /var/log/cloud-init-output.log
.
Our caph controller creates an event if cloud-init failed. You can look at the events with k9s (for example).
You can use that command to get an overview of unhealthy Conditions:
go run github.com/guettli/check-conditions@latest all
What is the output of that?
kubectl get events --sort-by='.lastTimestamp' -A
Did that help you?
Hi, thanks for the responses. I installed the CCM (I think this is part of the initial steps when bootstrapping) but I think the CNI comes after the control plane is up, unless I misunderstood. In this case the control-plane doesn't start up.
I ssh'd to the machine and verified that kubelet was running. I also checked /var/log/cloud-init-output.log
and everything seems to have worked. Output of check-conditions
:
$ go run github.com/guettli/check-conditions@latest all
default clusters hetzner-cluster Condition ControlPlaneReady=False NodeStartupTimeout @ Machine/hetzner-cluster-control-plane-bxjrh "0 of 1 completed" (2h8m33s)
default kubeadmcontrolplanes hetzner-cluster-control-plane Condition MachinesReady=False NodeStartupTimeout @ Machine/hetzner-cluster-control-plane-bxjrh "0 of 1 completed" (2h8m33s)
default machines hetzner-cluster-control-plane-bxjrh Condition HealthCheckSucceeded=False NodeStartupTimeout "Node failed to report startup in 15m0s" (2h8m33s)
default machines hetzner-cluster-control-plane-bxjrh Condition NodeHealthy=False NodeProvisioning "Waiting for a node with matching ProviderID to exist" (2h23m33s)
default machines hetzner-cluster-control-plane-bxjrh Condition OwnerRemediated=False WaitingForRemediation "KCP can't remediate if current replicas are less or equal to 1" (2h5m32s)
Checked 274 conditions of 500 resources of 84 types. Duration: 125ms
Output of kubectl get events
is blank but this might be due to the time elapsed. I will try to repeat again (this has occurred 5-6 times in a row so I don't think there are any transient conditions causing an issue)
When I initially ran this, I created the cluster with 3 worker machines and 3 control-plane. I observed that it started a single control-plane machine, and then booted 3 worker machines. However the deployment eventually failed with the same errors as above.
Rerun:
Looking ok so far:
$ clusterctl describe cluster hetzner-cluster --show-conditions all
NAME READY SEVERITY REASON SINCE MESSAGE
Cluster/hetzner-cluster False Info ServerStarting @ Machine/hetzner-cluster-control-plane-z6hbh 50s 0 of 1 completed
│ ├─ControlPlaneInitialized False Info WaitingForControlPlaneProviderInitialized 78s Waiting for control plane provider to indicate the control plane has been initialized
│ ├─ControlPlaneReady False Info ServerStarting @ Machine/hetzner-cluster-control-plane-z6hbh 50s 0 of 1 completed
│ └─InfrastructureReady False Info TargetClusterControlPlaneNotReady 58s target cluster not ready
├─ClusterInfrastructure - HetznerCluster/hetzner-cluster False Info TargetClusterControlPlaneNotReady 58s target cluster not ready
│ ├─ControlPlaneEndpointSet True 77s
│ ├─HCloudTokenAvailable True 78s
│ ├─LoadBalancerReady True 78s
│ ├─PlacementGroupsSynced True 78s
│ └─TargetClusterReady False Info TargetClusterControlPlaneNotReady 58s target cluster not ready
└─ControlPlane - KubeadmControlPlane/hetzner-cluster-control-plane False Info ServerStarting @ Machine/hetzner-cluster-control-plane-z6hbh 50s 0 of 1 completed
│ ├─Available False Info WaitingForKubeadmInit 76s
│ ├─CertificatesAvailable True 76s
│ ├─MachinesCreated True 66s
│ ├─MachinesReady False Info ServerStarting @ Machine/hetzner-cluster-control-plane-z6hbh 50s 0 of 1 completed
│ └─Resized True 61s
└─Machine/hetzner-cluster-control-plane-z6hbh True 5s
├─BootstrapReady True 76s
├─InfrastructureReady True 5s
└─NodeHealthy False Info WaitingForNodeRef 76s
$ clusterctl describe cluster hetzner-cluster --show-conditions all
NAME READY SEVERITY REASON SINCE MESSAGE
Cluster/hetzner-cluster False Info TargetClusterControlPlaneNotReady 6s target cluster not ready
│ ├─ControlPlaneInitialized True 6s
│ ├─ControlPlaneReady True 6s
│ └─InfrastructureReady False Info TargetClusterControlPlaneNotReady 107s target cluster not ready
├─ClusterInfrastructure - HetznerCluster/hetzner-cluster False Info TargetClusterControlPlaneNotReady 107s target cluster not ready
│ ├─ControlPlaneEndpointSet True 2m6s
│ ├─HCloudTokenAvailable True 2m7s
│ ├─LoadBalancerReady True 2m7s
│ ├─PlacementGroupsSynced True 2m7s
│ └─TargetClusterReady False Info TargetClusterControlPlaneNotReady 107s target cluster not ready
└─ControlPlane - KubeadmControlPlane/hetzner-cluster-control-plane True 6s
│ ├─Available True 6s
│ ├─CertificatesAvailable True 2m5s
│ ├─MachinesCreated True 115s
│ ├─MachinesReady True 39s
│ └─Resized True 110s
└─Machine/hetzner-cluster-control-plane-z6hbh True 54s
├─BootstrapReady True 2m5s
├─InfrastructureReady True 54s
└─NodeHealthy False Warning NodeProvisioning 7s Waiting for a node with matching ProviderID to exist
This is as far as it got last time before the 15m timeout
$ clusterctl describe cluster hetzner-cluster --show-conditions all
NAME READY SEVERITY REASON SINCE MESSAGE
Cluster/hetzner-cluster True 7s
│ ├─ControlPlaneInitialized True 73s
│ ├─ControlPlaneReady True 73s
│ └─InfrastructureReady True 7s
├─ClusterInfrastructure - HetznerCluster/hetzner-cluster True 7s
│ ├─ControlPlaneEndpointSet True 3m13s
│ ├─HCloudTokenAvailable True 3m14s
│ ├─LoadBalancerReady True 3m14s
│ ├─PlacementGroupsSynced True 3m14s
│ ├─TargetClusterReady True 7s
│ └─TargetClusterSecretReady True 7s
└─ControlPlane - KubeadmControlPlane/hetzner-cluster-control-plane True 73s
│ ├─Available True 73s
│ ├─CertificatesAvailable True 3m12s
│ ├─MachinesCreated True 3m2s
│ ├─MachinesReady True 106s
│ └─Resized True 2m57s
└─Machine/hetzner-cluster-control-plane-z6hbh True 2m1s
├─BootstrapReady True 3m12s
├─InfrastructureReady True 2m1s
└─NodeHealthy False Warning NodeProvisioning 74s Waiting for a node with matching ProviderID to exist
Is this actually healthy or do I need to wait? Running without showing conditions seems to indicate the node is ready? Do I need to install the CNI at this point (within 15m)?
Is it possible to create another cluster without deleting the current one? there are sometimes IPs blocked on AWS (which is hosting the mirror for k8s and could be the reason cloud-init is failing).. You could also check the output of cloud-init to see where in the process the node provisioning fails
as far as I can tell, cloud-init is successful.
root@hetzner-cluster-control-plane-2xhw9:~# grep success /var/log/cloud-init-output.log
Your Kubernetes control-plane has initialized successfully!
+ echo success
Looks like everything started up OK
root@hetzner-cluster-control-plane-2xhw9:~# systemctl status kubelet kubepods.slice
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Wed 2024-10-16 15:31:15 UTC; 4min 13s ago
Docs: https://kubernetes.io/docs/
Main PID: 3264 (kubelet)
Tasks: 12 (limit: 4532)
Memory: 32.1M
CPU: 3.291s
CGroup: /system.slice/kubelet.service
└─3264 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
● kubepods.slice - libcontainer container kubepods.slice
Loaded: loaded (/run/systemd/transient/kubepods.slice; transient)
Transient: yes
Drop-In: /run/systemd/transient/kubepods.slice.d
└─50-CPUWeight.conf, 50-MemoryMax.conf, 50-TasksMax.conf
Active: active since Wed 2024-10-16 15:30:26 UTC; 5min ago
IO: 0B read, 12.9M written
Tasks: 46 (limit: 30214)
Memory: 263.3M (max: 3.7G available: 3.4G)
CPU: 26.121s
CGroup: /kubepods.slice
├─kubepods-besteffort.slice
Created another cluster (using the same kind
cluster as before) with the same result:
$ clusterctl describe cluster hetzner-cluster-2 --show-conditions all
NAME READY SEVERITY REASON SINCE MESSAGE
Cluster/hetzner-cluster-2 True 3m53s
│ ├─ControlPlaneInitialized True 3m53s
│ ├─ControlPlaneReady True 3m53s
│ └─InfrastructureReady True 4m11s
├─ClusterInfrastructure - HetznerCluster/hetzner-cluster-2 True 4m11s
│ ├─ControlPlaneEndpointSet True 7m14s
│ ├─HCloudTokenAvailable True 7m15s
│ ├─LoadBalancerReady True 7m15s
│ ├─PlacementGroupsSynced True 7m15s
│ ├─TargetClusterReady True 4m11s
│ └─TargetClusterSecretReady True 4m11s
└─ControlPlane - KubeadmControlPlane/hetzner-cluster-2-control-plane True 3m53s
│ ├─Available True 3m53s
│ ├─CertificatesAvailable True 7m14s
│ ├─MachinesCreated True 7m3s
│ ├─MachinesReady True 5m48s
│ └─Resized True 6m58s
└─Machine/hetzner-cluster-2-control-plane-gsxj5 True 6m2s
├─BootstrapReady True 7m13s
├─InfrastructureReady True 6m2s
└─NodeHealthy False Warning NodeProvisioning 3m53s Waiting for a node with matching ProviderID to exist
created an entirely new kind
cluster and repeated the tutorial but got the same result
will try again with a new project and apikey
you should create the CNI immediately, see the docs: https://syself.com/docs/caph/getting-started/quickstart/creating-a-workload-cluster#deploying-the-cni-solution
Thanks for the update! I can confirm that deploying the CNI allowed the rest of the cluster to deploy succesfully! Thanks for the help, and sorry for the noise
I'm very glad that it's working! Sorry for the many advanced ideas on how to solve this, without going into the easy ones first.
/kind bug
What steps did you take and what happened: Followed the guide to bootstrap the initial cluster, but the first control-plane node never becomes healthy. I manually checked that the hcloud:// id was matching but for some reason the API hasn't picked up the machine.
What did you expect to happen: Expected bootstrapping to complete normally
Anything else you would like to add:
eventually fails:
Environment:
1.0.0
/etc/os-release
): Ubuntu 22.04