hetznercloud / hcloud-cloud-controller-manager

Kubernetes cloud-controller-manager for Hetzner Cloud
Apache License 2.0
703 stars 112 forks source link

Support for bare-metal workers #113

Closed ByteAlex closed 2 years ago

ByteAlex commented 3 years ago

Hello,

is it possible to add servers from the Hetzner Robot to the cluster created with the CCM?

I've been using a K3s cluster which I bootstrapped manually and when I tried to install the hcloud CCM the hcloud:// provider was not working for all servers - whether they were Cloud or Robot servers.

Now I've bootstrapped a cluster using kubeadm and followed the instructions and the hcloud:// provider seems to be working, yet I still have my bare-metal servers and before I let them join my cluster and possibly destroy the CCM, I'd rather ask for clarification first.

My expectations would be:

Thank you!

malikkirchner commented 3 years ago

The bare metal support would be highly appreciated. A label, that causes CCM to ignore bare metal nodes, would be fine as an intermediate step. This would make CCM still functional and useful in the meantime.

LKaemmerling commented 3 years ago

Additional (already closed) issues: https://github.com/hetznercloud/hcloud-cloud-controller-manager/issues/9

5

There are a few problems with adding dedicated servers are real "nodes" to the k8s cluster.

  1. Dedicated Servers have a completely different API, Using a Hetzner Cloud Token does not allow getting the data about a root server.
  2. Based on the spec: https://kubernetes.io/docs/concepts/architecture/cloud-controller/#node-controller k8s deactivates all nodes that are not known from the cloud provider

We will look into how we can improve it, but I can not promise something.

ctodea commented 3 years ago

Any update on this?

batistein commented 3 years ago

@ctodea you can have a look to this https://github.com/cluster-api-provider-hcloud/hcloud-cloud-controller-manager

malikkirchner commented 3 years ago

@ctodea we managed to get a cluster working, where most nodes, including the master, are cloud servers. And some nodes are root servers, e.g. for databases. Basically the root servers should be mostly ignored by the CCM and CSI plugin. Maybe this helps:

You need to connect the root servers via vSwitch, though.

Maybe #172 results in a mainline solution ...

ctodea commented 3 years ago

Many thanks for the update @malikkirchner @batistein Will give it a try, but unfortunately, I guess won't be any time soon.

identw commented 3 years ago

@ctodea we managed to get a cluster working, where most nodes, including the master, are cloud servers. And some nodes are root servers, e.g. for databases. Basically the root servers should be mostly ignored by the CCM and CSI plugin. Maybe this helps:

Hi @malikkirchner I can see from the code that you are skipping creating routes for root servers because the API doesn't allow it (https://github.com/xelonic/hcloud-cloud-controller-manager/blob/root-server-support/hcloud/routes.go#L104). But I don't understand how pod-to-pod communication between cloud and dedicated nodes works for you. For example: 10.240.0.2 - cloud node, 10.244.0.0/24 pod network on the cloud node 10.240.1.2 - dedicated node, 10.244.1.0/24 pod network on the dedicated node

But you can't create a route 10.240.1.0/24 via 10.240.1.2 in api. Then how will the communication between the pods of the 10.240.0.0/24 and 10.240.1.0/24 network work?

malikkirchner commented 3 years ago

Hi @identw,

that is an excellent point, I do not know and was wondering myself. According to https://github.com/hetznercloud/hcloud-cloud-controller-manager/issues/133#issuecomment-739257865 that should never have worked. We are using kubeadm to setup the cluster and Cilium as CNI plugin. I am happy to share the exact config, if you are interested.

I have two guesses how this can 'work'. Either the vSwitch does some routing, that I do not understand, or Cilium somehow manages to route to the root server. Leakage over the public device is ruled out by the root server's Hetzner firewall.

Though, it is possible, that this is a bug, that will be fixed and not work anymore, like #133. If so, I was wondering if it would make sense, to use a layer of wireguard peer-to-peer between all nodes, kinda as a unified substrate for cilium.

Any clarification on this topic is highly appreciated.

identw commented 3 years ago

@malikkirchner

that is an excellent point, I do not know and was wondering myself

Cilium uses an overlay network between nodes (vxlan or geneve) by default, maybe you haven't disabled it? Check your cilium configmap. For example:

$ kubectl -n kube-system get cm cilium-config -o yaml | grep "tunnel"
  tunnel: vxlan

This configuration will work in any way, even without hetzner cloud networks and vswich.

I was wondering if it would make sense, to use a layer of wireguard peer-to-peer between all nodes, kinda as a unified substrate for cilium

For cilium, this is not necessary, since it already knows how to build tunnels between nodes and does it by default. If encryption is required then cilium supports ipsec (https://docs.cilium.io/en/v1.9/gettingstarted/encryption/).

Also I recommend paying attention to latency when connecting vswitch to the cloud:

ping from cloud node to dedicated node via public ip:

$ ping 135.181.96.131
PING 135.181.96.131 (135.181.96.131) 56(84) bytes of data.
64 bytes from 135.181.96.131: icmp_seq=1 ttl=59 time=0.442 ms
64 bytes from 135.181.96.131: icmp_seq=2 ttl=59 time=0.372 ms
64 bytes from 135.181.96.131: icmp_seq=3 ttl=59 time=0.460 ms
64 bytes from 135.181.96.131: icmp_seq=4 ttl=59 time=0.539 ms

ping from cloud node to same dedicated node via vswitch:

$ ping 10.240.1.2
PING 10.240.1.2 (10.240.1.2) 56(84) bytes of data.
64 bytes from 10.240.1.2: icmp_seq=1 ttl=63 time=47.4 ms
64 bytes from 10.240.1.2: icmp_seq=2 ttl=63 time=47.0 ms
64 bytes from 10.240.1.2: icmp_seq=3 ttl=63 time=46.9 ms
64 bytes from 10.240.1.2: icmp_seq=4 ttl=63 time=46.9 ms

~0.5ms via public network vs ~46.5ms via private network =(.

malikkirchner commented 3 years ago

@identw thank you for the hint, you are right, our Cilium uses vxlan as tunnel. That explains why it works. We deploy Istio on top of Cilium, I guess there is no real need for the Cilium encryption for us at the moment. As I understand enabling the Cilium encryption also conflicts with some features of Istio.

The ping from a cloud server to the dedicated server via vSwitch is not that bad for us:

# ping starfleet-janeway 
PING starfleet-janeway (10.0.1.2) 56(84) bytes of data.
64 bytes from starfleet-janeway (10.0.1.2): icmp_seq=1 ttl=63 time=3.70 ms
64 bytes from starfleet-janeway (10.0.1.2): icmp_seq=2 ttl=63 time=3.57 ms

Our cloud nodes are hosted in nbg1-dc3 and the dedicated server lives in fsn1-dc15. I guess that would be even better, if we moved the cloud nodes to Falkenstein.

FYI we encountered a problem with Cilium and systemd in Debian bullseye, buster is fine: https://github.com/cilium/cilium/issues/14658.

identw commented 3 years ago

@malikkirchner

As I understand enabling the Cilium encryption also conflicts with some features of Istio.

I mentioned encryption because you wrote about wireguard. Encryption is optional

The ping from a cloud server to the dedicated server via vSwitch is not that bad for us:

Not so bad. I tested in the hel1 location (dedicated node from hel1-dc4, cloud node from hel1-dc2).

FYI we encountered a problem with Cilium and systemd in Debian bullseye, buster is fine: cilium/cilium#14658.

Thank you interesting. I really also use cilium without kube-proxy, but I have not seen this bug.

github-actions[bot] commented 3 years ago

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

Bessonov commented 3 years ago

further action occurs

Donatas-L commented 3 years ago

I saw that someone made a repo (https://github.com/identw/hetzner-cloud-controller-manager) to solve this, has anyone tried it?

randrusiak commented 3 years ago

Any updates here? @LKaemmerling are you going to implement support for root server soon?

hendrikkiedrowski commented 3 years ago

@Donatas-L I tried it. It works great with some tidbits. It would need a bit of attention from the community to keep track with the development of the Hetzner Team @LKaemmerling you may also want to have a look here. Maybe you can take this idea ;)

github-actions[bot] commented 2 years ago

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

acjohnson commented 2 years ago

I also am interested in using bare-metal workers via vSwitch and have it working with calico CNI. Any chance this could become mainlined in the hcloud-cloud-controller-manager?

wethinkagile commented 2 years ago

If we want to push the european cloud we need to push awesome Hetzner to push itself to grow above itself. This way many open source cloud projects and startups with GDPR and DSGVO compliant ISMS' will able to get founded in Europe. tl;dr yea I'm interested, too.

acjohnson commented 2 years ago

I went ahead and rebased the work that @malikkirchner did against master from this repo and built a new image with a few fixes that seemed to be required to use Hetzner Robot servers via vSwitch/Cloud Networks

src: https://github.com/acjohnson/hcloud-cloud-controller-manager/tree/root-server-support image: https://hub.docker.com/r/acjohnson/hcloud-cloud-controller-manager

This seems to work next to perfectly with only a couple of transient messages in the cloud-controllers logs such as

I1117 01:31:27.718391       1 util.go:39] hcloud/getServerByName: server with name kube02 not found, are the name in the Hetzner Cloud and the node name identical?
E1117 01:31:27.718445       1 node_controller.go:245] Error getting node addresses for node "kube02": error fetching node by provider ID: hcloud/instances.NodeAddressesByProviderID: hcloud/providerIDToServerID: missing prefix hcloud://: , and error by node name: hcloud/instances.NodeAddresses: instance not found

...but otherwise load balancer creation works and ignores all nodes that have the instance.hetzner.cloud/is-root-server=true label set

I'd file a PR but this really isn't my work, just a few fixes on top of what y'all have already done.

Hoping something more legit will make its way into this repo but for now this will have to do.

acjohnson commented 2 years ago

@LKaemmerling would you consider reopening this issue as there is a fair bit of support for this feature and quite a bit of hacking that has gone into it already

malikkirchner commented 2 years ago

@acjohnson thank you for improving on Boris' change.

maaft commented 1 year ago

Uhm, why is this closed? Currently it does not work. What can I do please? Any step by step instructions how I can provision a LB connected to my 3 root servers?

batistein commented 1 year ago

use this one: https://github.com/syself/hetzner-cloud-controller-manager

batistein commented 1 year ago

It's already full integrated with: https://github.com/syself/cluster-api-provider-hetzner

maaft commented 1 year ago

Ah, yes. I've read about that CAPI a few days ago already. Thanks mate!

maaft commented 1 year ago

I'm getting Cloud provider could not be initialized: unknown cloud provider "hetzner" from the logs.

Any Idea how to fix this?

batistein commented 1 year ago

sounds like you have the wrong provider argument in the deployment... Did you only replaced the image? see: https://github.com/syself/hetzner-cloud-controller-manager/blob/master/deploy/ccm.yaml#L63

maaft commented 1 year ago

Well, after removing the "old" ccm, I installed the suggested one with:

kubectl apply -f https://github.com/syself/hetzner-cloud-controller-manager/releases/latest/download/ccm.yaml

Which contains:

containers:
        - image: quay.io/syself/hetzner-cloud-controller-manager:v1.13.0-0.0.1
          name: hcloud-cloud-controller-manager
          command:
            - "/bin/hetzner-cloud-controller-manager"
            - "--cloud-provider=hetzner"
            - "--leader-elect=false"
            - "--allow-untagged-cloud"

Any slack/discord channels available? Don't want to spam this issue here further.

batistein commented 1 year ago

kubernetes slack workspace channel #hetzner