hetznercloud / hcloud-cloud-controller-manager

Kubernetes cloud-controller-manager for Hetzner Cloud
Apache License 2.0
713 stars 116 forks source link

Hetzner Cloud Control manager not connecting with Hetzner #663

Open abishekas opened 3 months ago

abishekas commented 3 months ago

TL;DR

Hi Everyone,

We were planning to move our production environment to the Hetzner cloud. So we provisioned a Kubernetes cluster (self-managed) setup in Hetzner servers for our project and for making connection establishment between Hetzner and our Kubernetes cluster we used the Hetzner cloud controller manager. By following the below document, we provisioned it.

https://community.hetzner.com/tutorials/install-kubernetes-cluster#:~:text=Now%20deploy%20the%20Hetzner%20Cloud%20controller%20manager%20into%20the%20cluster

Expected behavior

We deployed this during March 2024 and everything was working as expected till yesterday, but today when we create a new server in the Hcloud console and add it to the same cluster, the hcloud providerid and the region topology labels are not added for that server and we are utilizing the nginx ingress as Loadbalancer for this setup. when we apply the ingress-nginx it will automatically connect with the load balancer in the cloud but from today that connection is also not working.

Observed behavior

We tried to resolve this with logs from the Hetzner cloud controller manager but we couldn't see any errors in the logs. I'm sharing the log data below for reference. We also tried provisioning a new setup to see if that works, but we received the same issue. We verified the network connectivity to Hetzner Cloud from our server through API calls, and through PING requests, it works fine. We even created a new setup with another region, but the issue still persists.

We have planned our production migration for this weekend, so any quick help would be greatly appreciated. Thanks.

Minimal working example

No response

Log output

I0612 17:13:22.834269       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:13:22.834282       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:13:52.838954       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:13:52.839004       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:13:52.839018       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:13:52.839030       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:13:56.277820       1 load_balancers.go:137] "ensure Load Balancer" op="hcloud/loadBalancers.EnsureLoadBalancer" service="ingress-nginx-controller" nodes=["jenkins-server","postgresql-testing","hcloud-owrker"]
I0612 17:13:56.277968       1 event.go:307] "Event occurred" object="ingress-nginx/ingress-nginx-controller" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0612 17:13:56.777461       1 load_balancer.go:820] "update service" op="hcops/LoadBalancerOps.ReconcileHCLBServices" port=80 loadBalancerID=1798225
I0612 17:13:57.567504       1 load_balancer.go:820] "update service" op="hcops/LoadBalancerOps.ReconcileHCLBServices" port=443 loadBalancerID=1798225
E0612 17:13:58.576626       1 controller.go:298] error processing service ingress-nginx/ingress-nginx-controller (retrying with exponential backoff): failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): 
I0612 17:13:58.576714       1 event.go:307] "Event occurred" object="ingress-nginx/ingress-nginx-controller" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): "
I0612 17:14:22.848432       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:14:22.848502       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:14:22.848524       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:14:22.848543       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:14:52.888746       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:14:52.888807       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:14:52.888831       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:14:52.888856       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:15:22.824760       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:15:22.824815       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:15:22.824830       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:15:22.824843       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:15:52.824370       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:15:52.824415       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:15:52.824433       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:15:52.824448       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:16:22.965547       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:16:22.965596       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:16:22.965618       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:16:22.965637       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:16:52.946044       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:16:52.946077       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:16:52.946092       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:16:52.946105       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:17:22.863320       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:17:22.863365       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:17:22.863382       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:17:22.863402       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"

Additional information

No response

apricote commented 3 months ago

Hey @abishekas,

first off, there was an incident in our API yesterday around 17:00-17:40 CEST which might have been the cause for this. Can you try again today?

If it still does not work:

vigneshb118 commented 3 months ago

Hey @apricote ,

Myself and @abishekas are part of the same team. Here are the logs you have requested. hccm-logs.txt

Screenshot:

Screenshot 2024-06-13 at 1 33 39 PM

P.S: Below are the errors when I try to install kubernetes package on the new server inside the old projects I had yesterday where I faced actual issue: curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg curl: (22) The requested URL returned error: 403 gpg: no valid OpenPGP data found. So I have created new project and tried this by adding new server. Because on the older machine where I tried yesterday I was getting the error while installing kuberentes packages.

apricote commented 3 months ago

The node does not have the unitialized taint that HCCM expects. Are you sure you started the kubelet on that node with --cloud-provider=external? HCCM will only "adopt" the node if that taint is set.

You can try to re-add the taint with kubectl taint node master node.cloudprovider.kubernetes.io/uninitialized:NoSchedule


curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg curl: (22) The requested URL returned error: 403 gpg: no valid OpenPGP data found. 

This sounds like your IP is blocked by pkgs.k8s.io. This unfortunately happens from time to time and you will need to try with another IP. We recommend to mirror all assets you need for your production infrastructure to local services. You can not rely on pkgs.k8s.io being available at all times. See this thread for previous discussions of this topic: https://github.com/kubernetes/registry.k8s.io/issues/138

vigneshb118 commented 3 months ago

Hi @apricote , Thanks for your valuable response , we will always up all nodes with --cloud-provider=external flag in the kubelet configuration, and also the taint is already there in my master machines and am attaching this screenshot for your reference.

Screenshot 2024-06-13 at 10 48 33 PM

Today around 13:30 UTC+0 we saw a maintenance work on the cloud API and cloud console in hetzner side and after that our cluster’s are able to make connection with the hetzner cloud. I am attaching that maintenance window screenshot for your reference.

Screenshot 2024-06-13 at 10 48 41 PM

It is resolved right after the maintenance window. Not sure if anything is changed at your end. We want this to be future proof. As a precaution do you have any suggestions to solve this for future if such issue happen again?

apricote commented 3 months ago

Good to hear that everything works now.

I am not really sure what the issue was, so I do not have any suggestions on what you can improve for the future.

If you ever encounter issues again, you can try to run HCCM with env variable HCLOUD_DEBUG=true and the flag -v=5 to get way more logs.

github-actions[bot] commented 3 weeks ago

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.