Closed kisahm closed 2 years ago
@kisahm this looks like the bootstrap provider (e.g. kubeadm) didn't created the kubeconfig secret... See: secrets "mycluster-kubeconfig" not found. Is the error persistent after you recreated the cluster?
Edit: After revisiting your output, it seems that the secret is there. So the question is only if the error is persistent?
could you ssh to the node and do crictl ps?
And did you installed the ccm and a cni? I see that the node status is NotReady
A kubectl get pods -A
could be very helpful.
@batistein i've dropped the cluster but i'll recreate a new and send you the output. does the capi provider needs the cni and ccm to complete the deployment after the first control plane is up?
@kisahm every kubernetes cluster needs a ccn and a cni to work properly. In simple terms, the cluster api takes over the lifecycle management of the cluster. It also checks whether the nodes are ready, apiserver, etcd are functioning as intended. So if the initialization of the node is missing it will not create any further control-planes.
@batistein thx for your support. i missunterstood the docs. i've deployed cni and ccm after the first controller is up and after this the other control plane nodes are created
I'm glad to help! Do you have a suggestion on how we can improve the documentation at this point? And if you like the project please give it a :star:
Re docs improvements: can you mention in https://github.com/syself/cluster-api-provider-hetzner/blob/main/docs/topics/quickstart.md#apply-the-workload-cluster that following two steps (CNI and CCM deployment) are needed to finalize the bootstrap before the timeout for node deployment expires?
so basically I cant have 3 control-plane in my initial setup and can only add extra control-plane, once I move it to the new cluster ?
so basically I cant have 3 control-plane in my initial setup and can only add extra control-plane, once I move it to the new cluster ?
Hey @ashish1099, Can you give me more details around the following questions? Which version of CAPH you're using? Ideally, you can scale your controlplanes even if you've 3 controlplanes in the starting. For example if you started your cluster with 3 controlplanes and 1 worker nodes and want to have 4 controlplanes and 4 worker nodes, you can scale your kubeadmcontrolplane and machinedeployment objects respectively.
once I move it to the new cluster ?
Do you mean moving from bootstrap to management cluster.
I'm using v1.6.3
for cluster-api and hcloud-ccm 1.19.0 and CAPI v1.0.0-beta.33
This is a simple control-plane nodes starting with 3 (1 nodes get setup fine) and I can do kubectl get pods and install cilium as well.
but the other two nodes never get picked up. all I see is this error
Warning ControlPlaneUnhealthy 11m (x980 over 4h32m) kubeadm-control-plane-controller Waiting for control plane to pass preflight checks to continue reconciliation: Machine kcm-control-plane-mlrvl does not have a corresponding Node yet (Machine.status.nodeRef not set)
clusterctl describe cluster k8s -n caph-system
NAME READY SEVERITY REASON SINCE MESSAGE
Cluster/kcm False Warning ScalingUp 4h33m Scaling up control plane to 3 replicas (actual 1)
├─ClusterInfrastructure - HetznerCluster/kcm True 113m
└─ControlPlane - KubeadmControlPlane/kcm-control-plane False Warning ScalingUp 4h33m Scaling up control plane to 3 replicas (actual 1)
└─Machine/kcm-control-plane-mlrvl False Warning MachineHasFailure 32m FailureReason: UpdateError
Do you also install the CCM, additionally to the CNI?
@ashish1099 Thank you for the input.
Can you share me the logs of hetzner-ccm? Also, it'll be great if you can share how to reproduce the problem you're facing? What does you HcloudMachine
object look like(the ones those are not coming up, can you share those with status please)?
This is my HetznerBareMetalHost
---
# Source: capi-hetzner/templates/HetznerBareMetalHost.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerBareMetalHost
metadata:
name: htzhel1-ax41d.enableit.dk
labels:
role: kcm-control-plane
spec:
serverID: 123456
maintenanceMode: false
rootDeviceHints:
raid:
wwn:
- "0x50000397cb700acd"
- "0x500003981be00179"
description: "Cluster kcm control plane node htzhel1-ax41d.enableit.dk"
---
# Source: capi-hetzner/templates/HetznerBareMetalHost.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerBareMetalHost
metadata:
name: htzhel1-ax41e.enableit.dk
labels:
role: kcm-control-plane
spec:
serverID: 123457
maintenanceMode: false
rootDeviceHints:
raid:
wwn:
- "0x5000cca25ed86494"
- "0x5000cca25ecf496b"
description: "Cluster kcm control plane node htzhel1-ax41e.enableit.dk"
---
# Source: capi-hetzner/templates/HetznerBareMetalHost.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerBareMetalHost
metadata:
name: htzhel1-ax41f.enableit.dk
labels:
role: kcm-control-plane
spec:
serverID: 123458
maintenanceMode: false
rootDeviceHints:
raid:
wwn:
- "0x50014ee2097c6ca4"
- "0x50014ee2097c9304"
description: "Cluster kcm control plane node htzhel1-ax41f.enableit.dk"
and one of the node status
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerBareMetalHost
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","kind":"HetznerBareMetalHost","metadata":{"annotations":{},"labels":{"argocd.argoproj.io/instance":"kcm-capi-hetzner","role":"kcm-control-plane"},"name":"htzhel1-ax41d.enableit.dk","namespace":"caph-system"},"spec":{"description":"Cluster kcm control plane node htzhel1-ax41d.enableit.dk","maintenanceMode":false,"rootDeviceHints":{"raid":{"wwn":["0x50000397cb700acd","0x500003981be00179"[]}},"serverID":1866472}}
creationTimestamp: "2024-04-02T11:00:55Z"
finalizers:
- hetznerbaremetalhost.infrastructure.cluster.x-k8s.io
generation: 1
labels:
argocd.argoproj.io/instance: kcm-capi-hetzner
role: kcm-control-plane
name: htzhel1-ax41d.enableit.dk
namespace: caph-system
resourceVersion: "497386198"
uid: 471425b9-f5b6-4a1d-9bd8-1a06c872423a
spec:
description: Cluster kcm control plane node htzhel1-ax41d.enableit.dk
maintenanceMode: false
rootDeviceHints:
raid:
wwn:
- "0x50000397cb700acd"
- "0x500003981be00179"
serverID: 1866472
status:
errorCount: 0
errorMessage: ""
hetznerClusterRef: ""
ipv4: ""
ipv6: ""
sshStatus: {}
I have tried deleting the cluster multiples time, and it picks up any node between the 3 and it get installed as well on a fresh new cluster setup
and this is the logs
Flag --allow-untagged-cloud has been deprecated, This flag is deprecated and will be removed in a future release. A cluster-id will be required on cloud instances.
I0401 23:50:15.733829 1 serving.go:348] Generated self-signed cert in-memory
W0401 23:50:15.734085 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0401 23:50:15.943727 1 metrics.go:69] Starting metrics server at :8233
I0401 23:50:16.693611 1 cloud.go:123] Hetzner Cloud k8s cloud controller v1.19.0 started
W0401 23:50:16.693787 1 main.go:75] detected a cluster without a ClusterID. A ClusterID will be required in the future. Please tag your cluster to avoid any future issues
I0401 23:50:16.693870 1 controllermanager.go:168] Version: v0.0.0-master+$Format:%H$
I0401 23:50:16.709732 1 secure_serving.go:213] Serving securely on :10258
I0401 23:50:16.710062 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0401 23:50:16.710209 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0401 23:50:16.710329 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0401 23:50:16.710444 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0401 23:50:16.710457 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0401 23:50:16.710444 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0401 23:50:16.710469 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0401 23:50:16.732607 1 controllermanager.go:337] Started "cloud-node-controller"
I0401 23:50:16.732740 1 node_controller.go:165] Sending events to api server.
I0401 23:50:16.732781 1 controllermanager.go:337] Started "cloud-node-lifecycle-controller"
I0401 23:50:16.732836 1 node_controller.go:174] Waiting for informer caches to sync
I0401 23:50:16.733030 1 node_lifecycle_controller.go:113] Sending events to api server
I0401 23:50:16.733059 1 controllermanager.go:337] Started "service-lb-controller"
W0401 23:50:16.733080 1 core.go:111] --configure-cloud-routes is set, but cloud provider does not support routes. Will not configure cloud provider routes.
W0401 23:50:16.733086 1 controllermanager.go:325] Skipping "node-route-controller"
I0401 23:50:16.733228 1 controller.go:231] Starting service controller
I0401 23:50:16.733243 1 shared_informer.go:311] Waiting for caches to sync for service
I0401 23:50:16.810775 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
I0401 23:50:16.810810 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0401 23:50:16.810775 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0401 23:50:16.833322 1 shared_informer.go:318] Caches are synced for service
E0402 04:06:57.267505 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41a.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:06:57.435053 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41b.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:06:57.579798 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41c.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:06:57.756797 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41na.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:06:57.865007 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nb.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:06:58.034098 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nc.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:11:58.299382 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41c.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:11:58.427021 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41na.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:11:58.544025 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nb.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:11:58.674332 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nc.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:11:58.782996 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41a.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:11:58.894625 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41b.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:16:59.127089 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nc.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:16:59.279204 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41a.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:16:59.397971 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41b.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:16:59.521534 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41c.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:16:59.638731 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41na.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:16:59.741794 1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nb.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
and the nodes in the logs which is coming up, I dont want those to be setup, so it looks fine to me
adding another piece of information
{"level":"ERROR","time":"2024-04-02T10:43:41.402Z","file":"controller/controller.go:329","message":"Reconciler error","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","HetznerBareMetalHost":{"name":"htzhel1-ax41d.enableit.dk","namespace":"caph-system"},"namespace":"caph-system","name":"htzhel1-ax41d.enableit.dk","reconcileID":"c140622e-7c0a-47b0-826c-cfc5c140a2c4","error":"failed to get HetznerCluster: HetznerCluster.infrastructure.cluster.x-k8s.io \"\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.5/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.5/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.5/pkg/internal/controller/controller.go:227"}
Do you also install the CCM, additionally to the CNI?
yes, CCM is installed 1.19.0 version here and CNI is on 1.14.4
you get a 403 error there. Are you sure that there is one node that is up successfully?
And you use the hetzner ccm. Can you try our fork? I have never tried the one of Hetzner until now. Not sure whether it works. Maybe the 403 is related to that
you get a 403 error there. Are you sure that there is one node that is up successfully?
And you use the hetzner ccm. Can you try our fork? I have never tried the one of Hetzner until now. Not sure whether it works. Maybe the 403 is related to that
let me try the fork one, I didnt tried that so far :)
With the hcloud-cloud-controller-manager
you need to make sure that the correct secret is configured to access the Hetzner Cloud API. In the default configuration the secret created by cluster-api-provider-hetzner
in the workload cluster is called hetzner
. hcloud-ccm
expects a secret called hcloud
by default, but can be configured to use whatever secret you want.
I have my cluster running with CAPH & hcloud-ccm
and can confirm that it works well for Hetzner-Cloud only scenarios. The logs indicate invalid credentials & issues while talking to the Hetzner Robot API. During the timeframe indicated by the Logs, there was a scheduled maintenance of the Robot API, which would explain the issues: https://status.hetzner.com/incident/e162e98c-61b8-4a5d-8ea8-1ea3a84a6ee1
the fork CCM worked for me.
One thing I noticed is that the first node got provisioned and the second node didn't got provisioned, until I installed the CCM on the new cluster + CNI (cilium 1.14.4) and the second node got provisioned and soon after this the 3rd node got provisioned too.
so not sure if it was just a coincidence or is it a requirement, maybe a doc improvement ?
yes it is a requirement (see above in this issue). @kranurag7 did you include the cni and ccm already as requirement in your docs PR?
did you include the cni and ccm already as requirement in your docs PR?
I didn't as part of my PR but it's there in quickstart guide. Ref:
then add a sentence that both CNI and CCM are required after the first control plane is ready, otherwise the cluster won't be usable
/kind bug
What steps did you take and what happened: i tried to deploy a k8s cluster by using the Hetzner capi provider but the control plane is not able to get healthy. the deployment stuck after the first control plane.
my steps:
cluster is created but the controlplane is not able to get healthy:
If tested hetzner:v1.0.0-alpha.19 and hetzner:v1.0.0-alpha.20 but i get the same result.
Environment: