controlplane bootstrap fails/stuck after first controlplane node

kisahm commented 2 years ago

/kind bug

What steps did you take and what happened: i tried to deploy a k8s cluster by using the Hetzner capi provider but the control plane is not able to get healthy. the deployment stuck after the first control plane.

my steps:

clusterctl generate provider --infrastructure hetzner:v1.0.0-alpha.20 > hetzner-capi.yml
kubectl apply -f hetzner-capi.yml

export HCLOUD_TOKEN=xxxxxxxxxxxx
export HCLOUD_SSH_KEY="mykey"
export CLUSTER_NAME="mycluster"
export HCLOUD_REGION="fsn1"
export CONTROL_PLANE_MACHINE_COUNT=3
export WORKER_MACHINE_COUNT=3
export KUBERNETES_VERSION=1.24.1
export HCLOUD_CONTROL_PLANE_MACHINE_TYPE=cpx31
export HCLOUD_WORKER_MACHINE_TYPE=cpx31

kubectl create secret generic hetzner --from-literal=hcloud=${HCLOUD_TOKEN}
kubectl patch secret hetzner-p '{"metadata":{"labels":{"clusterctl.cluster.x-k8s.io/move":""}}}'
clusterctl generate cluster --infrastructure hetzner:v1.0.0-alpha.20 ${CLUSTER_NAME} > ${CLUSTER_NAME}.yaml
kubectl apply -f ${CLUSTER_NAME}.yaml

cluster is created but the controlplane is not able to get healthy:

$ clusterctl describe cluster ${CLUSTER_NAME}
NAME                                                                       READY  SEVERITY  REASON                       SINCE  MESSAGE                                                                  
Cluster/mycluster                                                          False  Warning   ScalingUp                    71m    Scaling up control plane to 3 replicas (actual 1)                        
├─ClusterInfrastructure - HetznerCluster/mycluster                                                                                                                                                       
├─ControlPlane - KubeadmControlPlane/mycluster-control-plane               False  Warning   ScalingUp                    71m    Scaling up control plane to 3 replicas (actual 1)                        
│ └─Machine/mycluster-control-plane-x5jb5                                  False  Warning   NodeStartupTimeout           49m    Node failed to report startup in &Duration{Duration:20m0s,}              
│   └─MachineInfrastructure - HCloudMachine/mycluster-control-plane-prxwq                                                                                                                                
└─Workers                                                                                                                                                                                                
  └─MachineDeployment/mycluster-md-0                                       False  Warning   WaitingForAvailableMachines  72m    Minimum availability requires 3 replicas, current 0 available            
    └─3 Machines...                                                        True                                          7m53s  See mycluster-md-0-59f5696b48-khjkp, mycluster-md-0-59f5696b48-v57kg, ...

$ kubectl get KubeadmControlPlane
NAME                          CLUSTER         INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
mycluster-control-plane       mycluster       true                                 1                  1         1             73m   v1.24.1

$ kubectl describe KubeadmControlPlane mycluster-control-plane
.....
Events:
  Type     Reason                 Age                    From                              Message
  ----     ------                 ----                   ----                              -------
  Warning  ControlPlaneUnhealthy  3m28s (x285 over 73m)  kubeadm-control-plane-controller  Waiting for control plane to pass preflight checks to continue reconciliation: [machine mycluster-control-plane-x5jb5 does not have APIServerPodHealthy condition, machine mycluster-control-plane-x5jb5 does not have ControllerManagerPodHealthy condition, machine mycluster-control-plane-x5jb5 does not have SchedulerPodHealthy condition, machine mycluster-control-plane-x5jb5 does not have EtcdPodHealthy condition, machine mycluster-control-plane-x5jb5 does not have EtcdMemberHealthy condition]

$ kubectl get MachineHealthCheck
NAME                                       CLUSTER         EXPECTEDMACHINES   MAXUNHEALTHY   CURRENTHEALTHY   AGE
mycluster-control-plane-unhealthy-5m       mycluster       1                  100%                            74m
mycluster-md-0-unhealthy-5m                mycluster       3                  100%                            74m

$ kubectl describe MachineHealthCheck mycluster-control-plane-unhealthy-5m
Events:
  Type     Reason          Age                 From                           Message
  ----     ------          ----                ----                           -------
  Warning  ReconcileError  75m (x13 over 75m)  machinehealthcheck-controller  error creating client and cache for remote cluster: error fetching REST client config for remote cluster "default/mycluster": failed to retrieve kubeconfig secret for Cluster default/mycluster: secrets "mycluster-kubeconfig" not found
  Warning  ReconcileError  74m                 machinehealthcheck-controller  error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster "default/mycluster": Get "https://142.132.240.114:443/api?timeout=10s": dial tcp 142.132.240.114:443: i/o timeout
  Warning  ReconcileError  73m (x4 over 74m)   machinehealthcheck-controller  error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster "default/mycluster": context deadline exceeded

$ kubectl get secrets
NAME                                TYPE                                  DATA   AGE
default-token-vcccj                 kubernetes.io/service-account-token   3      5h35m
hetzner                             Opaque                                1      78m
mycluster-ca                        cluster.x-k8s.io/secret               2      76m
mycluster-control-plane-xkvnf       cluster.x-k8s.io/secret               2      76m
mycluster-etcd                      cluster.x-k8s.io/secret               2      76m
mycluster-kubeconfig                cluster.x-k8s.io/secret               1      76m
mycluster-md-0-48wdl                cluster.x-k8s.io/secret               2      12m
mycluster-md-0-rlccb                cluster.x-k8s.io/secret               2      13m
mycluster-md-0-x2q8s                cluster.x-k8s.io/secret               2      12m
mycluster-proxy                     cluster.x-k8s.io/secret               2      76m
mycluster-sa                        cluster.x-k8s.io/secret               2      76m

# Get the node status of the deployed cluster
$ kubectl get no --kubeconfig mycluster
NAME                            STATUS   ROLES              AGE    VERSION
mycluster-control-plane-x5jb5   NotReady    control-plane   74m    v1.24.1
mycluster-md-0-khjkp            NotReady    <none>          72m    v1.24.1
mycluster-md-0-v57kg            NotReady    <none>          72m    v1.24.1
mycluster-md-0-x5d32            NotReady    <none>          72m    v1.24.1

# Try to fetch API - possible
$ curl https://142.132.240.114:443/api?timeout=10s -k
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/api\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

If tested hetzner:v1.0.0-alpha.19 and hetzner:v1.0.0-alpha.20 but i get the same result.

Environment:

cluster-api-provider-hetzner version: v1.0.0-alpha.19 and v1.0.0-alpha.20

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.6", GitCommit:"42a9a90338d705a1650fb68b7891f84b62adb0b0", GitTreeState:"clean", BuildDate:"2022-06-15T04:25:21Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}

OS:
- Server image: ubuntu 20.04
- client: OSX

batistein commented 2 years ago

@kisahm this looks like the bootstrap provider (e.g. kubeadm) didn't created the kubeconfig secret... See: secrets "mycluster-kubeconfig" not found. Is the error persistent after you recreated the cluster?

Edit: After revisiting your output, it seems that the secret is there. So the question is only if the error is persistent?

batistein commented 2 years ago

could you ssh to the node and do crictl ps? And did you installed the ccm and a cni? I see that the node status is NotReady A kubectl get pods -A could be very helpful.

kisahm commented 2 years ago

@batistein i've dropped the cluster but i'll recreate a new and send you the output. does the capi provider needs the cni and ccm to complete the deployment after the first control plane is up?

batistein commented 2 years ago

@kisahm every kubernetes cluster needs a ccn and a cni to work properly. In simple terms, the cluster api takes over the lifecycle management of the cluster. It also checks whether the nodes are ready, apiserver, etcd are functioning as intended. So if the initialization of the node is missing it will not create any further control-planes.

kisahm commented 2 years ago

@batistein thx for your support. i missunterstood the docs. i've deployed cni and ccm after the first controller is up and after this the other control plane nodes are created

batistein commented 2 years ago

I'm glad to help! Do you have a suggestion on how we can improve the documentation at this point? And if you like the project please give it a :star:

radimk commented 2 years ago

Re docs improvements: can you mention in https://github.com/syself/cluster-api-provider-hetzner/blob/main/docs/topics/quickstart.md#apply-the-workload-cluster that following two steps (CNI and CCM deployment) are needed to finalize the bootstrap before the timeout for node deployment expires?

ashish1099 commented 8 months ago

so basically I cant have 3 control-plane in my initial setup and can only add extra control-plane, once I move it to the new cluster ?

kranurag7 commented 8 months ago

so basically I cant have 3 control-plane in my initial setup and can only add extra control-plane, once I move it to the new cluster ?

Hey @ashish1099, Can you give me more details around the following questions? Which version of CAPH you're using? Ideally, you can scale your controlplanes even if you've 3 controlplanes in the starting. For example if you started your cluster with 3 controlplanes and 1 worker nodes and want to have 4 controlplanes and 4 worker nodes, you can scale your kubeadmcontrolplane and machinedeployment objects respectively.

once I move it to the new cluster ?

Do you mean moving from bootstrap to management cluster.

ashish1099 commented 8 months ago

I'm using v1.6.3 for cluster-api and hcloud-ccm 1.19.0 and CAPI v1.0.0-beta.33

This is a simple control-plane nodes starting with 3 (1 nodes get setup fine) and I can do kubectl get pods and install cilium as well.

but the other two nodes never get picked up. all I see is this error

 Warning  ControlPlaneUnhealthy  11m (x980 over 4h32m)  kubeadm-control-plane-controller  Waiting for control plane to pass preflight checks to continue reconciliation: Machine kcm-control-plane-mlrvl does not have a corresponding Node yet (Machine.status.nodeRef not set)

clusterctl describe cluster k8s -n caph-system
NAME                                                    READY  SEVERITY  REASON             SINCE  MESSAGE                                           
Cluster/kcm                                             False  Warning   ScalingUp          4h33m  Scaling up control plane to 3 replicas (actual 1)  
├─ClusterInfrastructure - HetznerCluster/kcm           True                                113m                                                      
└─ControlPlane - KubeadmControlPlane/kcm-control-plane  False  Warning   ScalingUp          4h33m  Scaling up control plane to 3 replicas (actual 1)  
  └─Machine/kcm-control-plane-mlrvl                     False  Warning   MachineHasFailure  32m    FailureReason: UpdateError

janiskemper commented 8 months ago

Do you also install the CCM, additionally to the CNI?

kranurag7 commented 8 months ago

@ashish1099 Thank you for the input.

Can you share me the logs of hetzner-ccm? Also, it'll be great if you can share how to reproduce the problem you're facing? What does you HcloudMachine object look like(the ones those are not coming up, can you share those with status please)?

ashish1099 commented 8 months ago

This is my HetznerBareMetalHost

---
# Source: capi-hetzner/templates/HetznerBareMetalHost.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerBareMetalHost
metadata:
  name: htzhel1-ax41d.enableit.dk
  labels:
    role: kcm-control-plane
spec:
  serverID: 123456
  maintenanceMode: false
  rootDeviceHints:
    raid:
      wwn:
        - "0x50000397cb700acd"
        - "0x500003981be00179"
  description: "Cluster kcm control plane node htzhel1-ax41d.enableit.dk"
---
# Source: capi-hetzner/templates/HetznerBareMetalHost.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerBareMetalHost
metadata:
  name: htzhel1-ax41e.enableit.dk
  labels:
    role: kcm-control-plane
spec:
  serverID: 123457
  maintenanceMode: false
  rootDeviceHints:
    raid:
      wwn:
        - "0x5000cca25ed86494"
        - "0x5000cca25ecf496b"
  description: "Cluster kcm control plane node htzhel1-ax41e.enableit.dk"
---
# Source: capi-hetzner/templates/HetznerBareMetalHost.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerBareMetalHost
metadata:
  name: htzhel1-ax41f.enableit.dk
  labels:
    role: kcm-control-plane
spec:
  serverID: 123458
  maintenanceMode: false
  rootDeviceHints:
    raid:
      wwn:
        - "0x50014ee2097c6ca4"
        - "0x50014ee2097c9304"
  description: "Cluster kcm control plane node htzhel1-ax41f.enableit.dk"

and one of the node status

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerBareMetalHost
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","kind":"HetznerBareMetalHost","metadata":{"annotations":{},"labels":{"argocd.argoproj.io/instance":"kcm-capi-hetzner","role":"kcm-control-plane"},"name":"htzhel1-ax41d.enableit.dk","namespace":"caph-system"},"spec":{"description":"Cluster kcm control plane node htzhel1-ax41d.enableit.dk","maintenanceMode":false,"rootDeviceHints":{"raid":{"wwn":["0x50000397cb700acd","0x500003981be00179"[]}},"serverID":1866472}}
  creationTimestamp: "2024-04-02T11:00:55Z"
  finalizers:
  - hetznerbaremetalhost.infrastructure.cluster.x-k8s.io
  generation: 1
  labels:
    argocd.argoproj.io/instance: kcm-capi-hetzner
    role: kcm-control-plane
  name: htzhel1-ax41d.enableit.dk
  namespace: caph-system
  resourceVersion: "497386198"
  uid: 471425b9-f5b6-4a1d-9bd8-1a06c872423a
spec:
  description: Cluster kcm control plane node htzhel1-ax41d.enableit.dk
  maintenanceMode: false
  rootDeviceHints:
    raid:
      wwn:
      - "0x50000397cb700acd"
      - "0x500003981be00179"
  serverID: 1866472
  status:
    errorCount: 0
    errorMessage: ""
    hetznerClusterRef: ""
    ipv4: ""
    ipv6: ""
    sshStatus: {}

ashish1099 commented 8 months ago

I have tried deleting the cluster multiples time, and it picks up any node between the 3 and it get installed as well on a fresh new cluster setup

and this is the logs

Flag --allow-untagged-cloud has been deprecated, This flag is deprecated and will be removed in a future release. A cluster-id will be required on cloud instances.
I0401 23:50:15.733829       1 serving.go:348] Generated self-signed cert in-memory
W0401 23:50:15.734085       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0401 23:50:15.943727       1 metrics.go:69] Starting metrics server at :8233
I0401 23:50:16.693611       1 cloud.go:123] Hetzner Cloud k8s cloud controller v1.19.0 started
W0401 23:50:16.693787       1 main.go:75] detected a cluster without a ClusterID.  A ClusterID will be required in the future.  Please tag your cluster to avoid any future issues
I0401 23:50:16.693870       1 controllermanager.go:168] Version: v0.0.0-master+$Format:%H$
I0401 23:50:16.709732       1 secure_serving.go:213] Serving securely on :10258
I0401 23:50:16.710062       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0401 23:50:16.710209       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0401 23:50:16.710329       1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0401 23:50:16.710444       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0401 23:50:16.710457       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0401 23:50:16.710444       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0401 23:50:16.710469       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0401 23:50:16.732607       1 controllermanager.go:337] Started "cloud-node-controller"
I0401 23:50:16.732740       1 node_controller.go:165] Sending events to api server.
I0401 23:50:16.732781       1 controllermanager.go:337] Started "cloud-node-lifecycle-controller"
I0401 23:50:16.732836       1 node_controller.go:174] Waiting for informer caches to sync
I0401 23:50:16.733030       1 node_lifecycle_controller.go:113] Sending events to api server
I0401 23:50:16.733059       1 controllermanager.go:337] Started "service-lb-controller"
W0401 23:50:16.733080       1 core.go:111] --configure-cloud-routes is set, but cloud provider does not support routes. Will not configure cloud provider routes.
W0401 23:50:16.733086       1 controllermanager.go:325] Skipping "node-route-controller"
I0401 23:50:16.733228       1 controller.go:231] Starting service controller
I0401 23:50:16.733243       1 shared_informer.go:311] Waiting for caches to sync for service
I0401 23:50:16.810775       1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
I0401 23:50:16.810810       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0401 23:50:16.810775       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0401 23:50:16.833322       1 shared_informer.go:318] Caches are synced for service
E0402 04:06:57.267505       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41a.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:06:57.435053       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41b.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:06:57.579798       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41c.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:06:57.756797       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41na.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:06:57.865007       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nb.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:06:58.034098       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nc.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:11:58.299382       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41c.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:11:58.427021       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41na.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:11:58.544025       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nb.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:11:58.674332       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nc.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:11:58.782996       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41a.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:11:58.894625       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41b.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:16:59.127089       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nc.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:16:59.279204       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41a.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:16:59.397971       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41b.enableit.dk": hcloud/getRobotServerByName: Due to maintenance the robot is currently not available. (MAINTENANCE)
E0402 04:16:59.521534       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41c.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:16:59.638731       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41na.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403
E0402 04:16:59.741794       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to get robot server "htzhel1-ax41nb.enableit.dk": hcloud/getRobotServerByName: server responded with status code 403

and the nodes in the logs which is coming up, I dont want those to be setup, so it looks fine to me

adding another piece of information

{"level":"ERROR","time":"2024-04-02T10:43:41.402Z","file":"controller/controller.go:329","message":"Reconciler error","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","HetznerBareMetalHost":{"name":"htzhel1-ax41d.enableit.dk","namespace":"caph-system"},"namespace":"caph-system","name":"htzhel1-ax41d.enableit.dk","reconcileID":"c140622e-7c0a-47b0-826c-cfc5c140a2c4","error":"failed to get HetznerCluster: HetznerCluster.infrastructure.cluster.x-k8s.io \"\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.5/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.5/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.5/pkg/internal/controller/controller.go:227"}

ashish1099 commented 8 months ago

Do you also install the CCM, additionally to the CNI?

yes, CCM is installed 1.19.0 version here and CNI is on 1.14.4

janiskemper commented 8 months ago

you get a 403 error there. Are you sure that there is one node that is up successfully?

And you use the hetzner ccm. Can you try our fork? I have never tried the one of Hetzner until now. Not sure whether it works. Maybe the 403 is related to that

ashish1099 commented 8 months ago

you get a 403 error there. Are you sure that there is one node that is up successfully?

And you use the hetzner ccm. Can you try our fork? I have never tried the one of Hetzner until now. Not sure whether it works. Maybe the 403 is related to that

let me try the fork one, I didnt tried that so far :)

apricote commented 8 months ago

With the hcloud-cloud-controller-manager you need to make sure that the correct secret is configured to access the Hetzner Cloud API. In the default configuration the secret created by cluster-api-provider-hetzner in the workload cluster is called hetzner. hcloud-ccm expects a secret called hcloud by default, but can be configured to use whatever secret you want.

I have my cluster running with CAPH & hcloud-ccm and can confirm that it works well for Hetzner-Cloud only scenarios. The logs indicate invalid credentials & issues while talking to the Hetzner Robot API. During the timeframe indicated by the Logs, there was a scheduled maintenance of the Robot API, which would explain the issues: https://status.hetzner.com/incident/e162e98c-61b8-4a5d-8ea8-1ea3a84a6ee1

ashish1099 commented 8 months ago

the fork CCM worked for me.

One thing I noticed is that the first node got provisioned and the second node didn't got provisioned, until I installed the CCM on the new cluster + CNI (cilium 1.14.4) and the second node got provisioned and soon after this the 3rd node got provisioned too.

so not sure if it was just a coincidence or is it a requirement, maybe a doc improvement ?

janiskemper commented 8 months ago

yes it is a requirement (see above in this issue). @kranurag7 did you include the cni and ccm already as requirement in your docs PR?

kranurag7 commented 8 months ago

did you include the cni and ccm already as requirement in your docs PR?

I didn't as part of my PR but it's there in quickstart guide. Ref:

janiskemper commented 8 months ago

then add a sentence that both CNI and CCM are required after the first control plane is ready, otherwise the cluster won't be usable

syself / cluster-api-provider-hetzner

controlplane bootstrap fails/stuck after first controlplane node #252