hetznercloud / hcloud-cloud-controller-manager

Kubernetes cloud-controller-manager for Hetzner Cloud
Apache License 2.0
745 stars 117 forks source link

Node Addresses won't get updated when using Wireguard for Cluster Creation #635

Open derfabianpeter opened 7 months ago

derfabianpeter commented 7 months ago

TL;DR

We're building a hybrid cluster from Cloud Servers and external Bare-Metal machines. For this to work properly we're using a Wireguard Network between all nodes and use the Wireguard IPs as the nodes' InternalIP and for connecting between them. We're not using HCLOUD internal Networks at all.

Expected behavior

HCCM is able to deduct the machine ID, etc from a node. If not from InternalIP then at least from ExternalIP which we correctly set to the public IP of a node.

Observed behavior

HCCM fails to get the machine ID from the HCLOUD API since it only uses InternalIP as source for identification which in our case is a Wireguard IP from the 172.16.187.0/24 range. This results in nodes not properly being initialized and Loadbalancers not being able to be provisioned due to missing backend node infos.

Minimal working example

This is the Deployment.yml we use to install HCCM into our k3s Cluster with Cilium CNI:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hcloud-cloud-controller-manager
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hcloud-cloud-controller-manager
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: hcloud-cloud-controller-manager
    spec:
      containers:
        - name: hcloud-cloud-controller-manager
          image: hetznercloud/hcloud-cloud-controller-manager:v1.19.0
          command:
            - /bin/hcloud-cloud-controller-manager
            - '--allow-untagged-cloud'
            - '--cloud-provider=hcloud'
            - '--route-reconciliation-period=30s'
            - '--webhook-secure-port=0'
            - '--leader-elect=false'
          ports:
            - name: metrics
              containerPort: 8233
              protocol: TCP
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: HCLOUD_LOAD_BALANCERS_LOCATION
              value: fsn1
            - name: HCLOUD_LOAD_BALANCERS_USE_PRIVATE_IP
              value: 'False'
            - name: HCLOUD_LOAD_BALANCERS_ENABLED
              value: 'True'
            - name: HCLOUD_LOAD_BALANCERS_DISABLE_PRIVATE_INGRESS
              value: 'True'
            - name: HCLOUD_LOAD_BALANCERS_DISABLE_IPV6
              value: 'True'
            - name: HCLOUD_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hcloud
                  key: token
          resources:
            requests:
              cpu: 100m
              memory: 50Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: Default
      serviceAccountName: cloud-controller-manager
      serviceAccount: cloud-controller-manager
      hostNetwork: true
      securityContext: {}
      schedulerName: default-scheduler
      tolerations:
        - key: node.cloudprovider.kubernetes.io/uninitialized
          value: 'true'
          effect: NoSchedule
        - key: CriticalAddonsOnly
          operator: Exists
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule
        - key: node.kubernetes.io/not-ready
          effect: NoSchedule
        - key: node.kubernetes.io/not-ready
          effect: NoExecute
        - key: node.kubernetes.io/unreachable
          effect: NoSchedule
      priorityClassName: system-cluster-critical
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 2
  progressDeadlineSeconds: 600

Log output

I0428 15:37:13.679403       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0428 15:37:13.679800       1 secure_serving.go:213] Serving securely on [::]:10258
I0428 15:37:13.679836       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0428 15:37:13.715527       1 controllermanager.go:337] Started "cloud-node-controller"
I0428 15:37:13.715653       1 controllermanager.go:337] Started "cloud-node-lifecycle-controller"
I0428 15:37:13.715681       1 node_controller.go:165] Sending events to api server.
I0428 15:37:13.715701       1 node_lifecycle_controller.go:113] Sending events to api server
I0428 15:37:13.715765       1 node_controller.go:174] Waiting for informer caches to sync
I0428 15:37:13.715892       1 controllermanager.go:337] Started "service-lb-controller"
W0428 15:37:13.715905       1 core.go:111] --configure-cloud-routes is set, but cloud provider does not support routes. Will not configure cloud provider routes.
W0428 15:37:13.715910       1 controllermanager.go:325] Skipping "node-route-controller"
I0428 15:37:13.716039       1 controller.go:231] Starting service controller
I0428 15:37:13.716066       1 shared_informer.go:311] Waiting for caches to sync for service
I0428 15:37:13.780441       1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
I0428 15:37:13.780484       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0428 15:37:13.780573       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0428 15:37:13.816152       1 shared_informer.go:318] Caches are synced for service
E0428 15:37:13.816182       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to convert provider id to server id: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): k3s://laser-1-controlplane-1
E0428 15:37:13.816206       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to convert provider id to server id: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): k3s://laser-1-controlplane-2
E0428 15:37:13.816225       1 node_controller.go:281] Error getting instance metadata for node addresses: hcloud/instancesv2.InstanceMetadata: failed to convert provider id to server id: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): k3s://laser-1-controlplane-3
I0428 15:37:13.816327       1 load_balancers.go:137] "ensure Load Balancer" op="hcloud/loadBalancers.EnsureLoadBalancer" service="ayedo-test" nodes=["laser-1-controlplane-2","laser-1-controlplane-3","laser-1-worker-1","laser-1-worker-2","laser-1-worker-3","laser-1-worker-4","laser-1-controlplane-1"]
I0428 15:37:13.816448       1 event.go:307] "Event occurred" object="ayedo-test/ayedo-test" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0428 15:37:14.103770       1 node_controller.go:394] Failed to update node addresses for node "laser-1-worker-1": failed to get node address from cloud provider that matches ip: 172.16.187.21
I0428 15:37:14.202850       1 load_balancer.go:820] "update service" op="hcops/LoadBalancerOps.ReconcileHCLBServices" port=6443 loadBalancerID=1819076
E0428 15:37:14.248593       1 node_controller.go:394] Failed to update node addresses for node "laser-1-worker-2": failed to get node address from cloud provider that matches ip: 172.16.187.22
E0428 15:37:14.518378       1 node_controller.go:394] Failed to update node addresses for node "laser-1-worker-3": failed to get node address from cloud provider that matches ip: 172.16.187.23
E0428 15:37:14.824094       1 node_controller.go:394] Failed to update node addresses for node "laser-1-worker-4": failed to get node address from cloud provider that matches ip: 172.16.187.24
E0428 15:37:14.973532       1 controller.go:298] error processing service ayedo-test/ayedo-test (retrying with exponential backoff): failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): k3s://laser-1-controlplane-2
I0428 15:37:14.973662       1 event.go:307] "Event occurred" object="ayedo-test/ayedo-test" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): k3s://laser-1-controlplane-2"
I0428 15:37:19.974750       1 load_balancers.go:137] "ensure Load Balancer" op="hcloud/loadBalancers.EnsureLoadBalancer" service="ayedo-test" nodes=["laser-1-worker-4","laser-1-controlplane-1","laser-1-controlplane-2","laser-1-controlplane-3","laser-1-worker-1","laser-1-worker-2","laser-1-worker-3"]
I0428 15:37:19.974853       1 event.go:307] "Event occurred" object="ayedo-test/ayedo-test" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0428 15:37:20.106580       1 load_balancer.go:820] "update service" op="hcops/LoadBalancerOps.ReconcileHCLBServices" port=6443 loadBalancerID=1819076
E0428 15:37:20.913237       1 controller.go:298] error processing service ayedo-test/ayedo-test (retrying with exponential backoff): failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): 
I0428 15:37:20.913325       1 event.go:307] "Event occurred" object="ayedo-test/ayedo-test" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): "
I0428 15:37:30.913891       1 load_balancers.go:137] "ensure Load Balancer" op="hcloud/loadBalancers.EnsureLoadBalancer" service="ayedo-test" nodes=["laser-1-controlplane-1","laser-1-controlplane-2","laser-1-controlplane-3","laser-1-worker-1","laser-1-worker-2","laser-1-worker-3","laser-1-worker-4"]
I0428 15:37:30.914034       1 event.go:307] "Event occurred" object="ayedo-test/ayedo-test" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0428 15:37:31.060101       1 load_balancer.go:820] "update service" op="hcops/LoadBalancerOps.ReconcileHCLBServices" port=6443 loadBalancerID=1819076
E0428 15:37:31.697080       1 controller.go:298] error processing service ayedo-test/ayedo-test (retrying with exponential backoff): failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): k3s://laser-1-controlplane-1
I0428 15:37:31.697173       1 event.go:307] "Event occurred" object="ayedo-test/ayedo-test" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): k3s://laser-1-controlplane-1"

Additional information

vitobotta commented 7 months ago

Hi @derfabianpeter, have you found a workaround in the meantime? I am having the same problem right now (using Tailscale for the private network).

derfabianpeter commented 7 months ago

Hi @vitobotta - not yet, sorry

vitobotta commented 7 months ago

Hi @vitobotta - not yet, sorry

Thanks for letting me know :) If you do find a solution please update this thread. I will do the same :)

apricote commented 7 months ago

Hey you two, sorry for the late response.


HCCM is able to deduct the machine ID, etc from a node. If not from InternalIP then at least from ExternalIP which we correctly set to the public IP of a node.

HCCM actually uses the name of the Kubernetes Node object to find a Server in the Hetzner Cloud API that has the same name. This is not well explained for the "Cloud" part of HCCM, but we added a section on this for Robot.

While HCCM tries to "initialize" the node (and remove the uninitialized taint) it also compares the Node Addresses it gets from the Hetzner Cloud API to the IPs that are already specified on the Node object, and fails the initialization if there are conflicts. This is the error you are seeing.


Running supported nodes (Hetzner Cloud & Robot) together with "unsupported" nodes goes against the design of the kubernetes/cloud-provider library we use to interact with Kubernetes. I have spent some more time explaining this in https://github.com/hetznercloud/hcloud-cloud-controller-manager/issues/530#issuecomment-2060425409 if you are interested. To properly support mixed/hybrid clusters there need to be large changes to k/cloud-provider to allow for this.

The Node IP is another one of these bits, where k/cloud-provider assumes that it knows everything it should know and the cloud-provider can dictate the IP addresses, afterall, on AWS you probably use the VPC & ENI to connect your nodes and dont need any external tools for this.


That said, maybe we can figure out how to make this usable for you. What parts of the functionality of hcloud-cloud-controller-manager do you want to use? There is a list of the general features in the README.

Just from the log @derfabianpeter posted, it might be the Load Balancer part.

The Load Balancer (service) controller depends on the Node controller (which currently fails) to set the Node.spec.providerID field.

Instead of using the Node controller to fill this field, you can also set it yourself when you start the kubelet by setting the flag --provider-id=hcloud://$SERVER_ID. You can even get the Server ID from the Metadata Service or through cloud-init directly on the server:

This would make it possible to disable the Node controller in your installs, but you will also miss out on the labels it sets. The Node controller is also responsible for removing the uninitialized taint, so you would need to remove the --cloud-provider=external flag from the Kubelet, which adds this taint.

If you depend on any other feature, I am happy to discuss your requirements.


Hope this gave you an initial insight into the way things are and what the options are to work around this.

vitobotta commented 6 months ago

Hi @apricote , thanks for the update :) In my case I only use cloud instance (no dedicated servers etc), and I am mostly interested in the ability to provision load balancer. The configurations I am testing currently are with larger clusters of over 100 nodes, so I cannot use the Hetzner private networks. Instead I am using Cilium with wireguard encryption to use the public internet for the communication between the nodes. So yeah, since the Hetzner networks are excluded from this kind of configuration, all I need the CCM for is basically just the load balancers provisioning. Did In understand it correcrtly that I can just set the node provider id directly without even installing the CCM? Or do I still need to install it? Thanks!

apricote commented 6 months ago

Instead I am using Cilium with wireguard encryption to use the public internet for the communication between the nodes

I would like to check out how this interacts with the Node Addresses & HCCM. Is this your current Cilium configuration? If not, could you paste the values you use? https://github.com/vitobotta/hetzner-k3s/blob/d824c126f45071f72ff2686b59fd8ccc5825c5a2/src/kubernetes/software/cilium.cr

Did In understand it correcrtly that I can just set the node provider id directly without even installing the CCM? Or do I still need to install it?

You can set the node provider id directly, but the Load Balancers are still created and managed by hcloud-cloud-controller-manager.

If I understand k/cloud-provider correctly, you can disable the Node controller by passing --controllers=-cloud-node-controller (note the - infront of the controller name) to HCCM. But I have never tested this configuration and we do not officially support it.

derfabianpeter commented 6 months ago

@apricote thanks for dealing with this so quickly and for the detailed explanations.

With regards to the features I'm interested in: only the provisioning of Loadbalancers backed by Cloud Nodes. I wanted to hook up a few bare metal machines that we operate in a dedicated datacenter with the cluster to make their compute available to the services we want to run in that cluster. But I found a way to make that happen without mixing Hetzner Cloud and external nodes while having the cluster backed by HCCM.

Thanks again for your thoughtful explanations. With that said, I guess this is more a Feature Request and not a bug.

vitobotta commented 6 months ago

Hi @apricote,

Instead I am using Cilium with wireguard encryption to use the public internet for the communication between the nodes

I would like to check out how this interacts with the Node Addresses & HCCM. Is this your current Cilium configuration? If not, could you paste the values you use? https://github.com/vitobotta/hetzner-k3s/blob/d824c126f45071f72ff2686b59fd8ccc5825c5a2/src/kubernetes/software/cilium.cr

Yep, that one. The chart version is currently v1.15.4 and the encryption is enabled. No other settings apart from what you see in that code :)

Did In understand it correcrtly that I can just set the node provider id directly without even installing the CCM? Or do I still need to install it?

You can set the node provider id directly, but the Load Balancers are still created and managed by hcloud-cloud-controller-manager.

If I understand k/cloud-provider correctly, you can disable the Node controller by passing --controllers=-cloud-node-controller (note the - infront of the controller name) to HCCM. But I have never tested this configuration and we do not officially support it.

I see, thanks

elohmeier commented 6 months ago

I had the same problem using Hetzner Cloud VMs and Hetzner Robot servers connected via Wireguard. Since I'm using Consul service discovery I can use DNS to lookup the node VPN IPs. This should also work fine e.g. with Tailscale Magic DNS. See this commit for my solution. Maybe this approach makes sense for others as well and could be converted into a more generic solution, e.g. some kind of flags --use-dns-for-internal-ip and --internal-ip-dns-suffix.

jorikseldeslachts commented 4 months ago

Does this mean the HCCM can not work for nodes that only have a private ipv4 and no public ipv4? I have all my nodes on a private network behind a NAT gateway and they do not have public addresses.

github-actions[bot] commented 1 month ago

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.