rancher / elemental

Elemental is a software stack enabling centralized, full cloud-native OS management with Kubernetes.
https://elemental.docs.rancher.com/
Apache License 2.0
299 stars 39 forks source link

Elemental CAPI controlPlaneEndpoint provisioning #1011

Open anmazzotti opened 1 year ago

anmazzotti commented 1 year ago

This issue is about the controlPlaneEndpoint infrastructure logic.

For reference, check the Cluster provider contract.

In particular:

(The ElementalCluster) Must have a `spec` field with the following:

    Required fields:
        controlPlaneEndpoint (apiEndpoint): the endpoint for the cluster’s control plane. apiEndpoint is defined as:
            host (string): DNS name or IP address
            port (int32): TCP port

And:

If the provider created a load balancer for the control plane, record its hostname or IP in spec.controlPlaneEndpoint

In CAPI the controlPlaneEndpoint is the address of the downstream Kubernetes control plane. The CAPI operator installed on the management cluster will try to contact the downstream cluster using this endpoint, to confirm provisioning was successful. This implies the CAPI management cluster needs connectivity to the downstream cluster, through a public IP, internal network, VPN tunnel or other means.

The correct setup of any control plane load balancer can be part of the infrastructure provider logic, for example when creating a load balancer dynamically. This is not a strict requirement however, it is an option to leave that to the end-user.

For example, let's say we implement zero logic regarding this. Let's take the scenario where I am an administrator, I have 10 Elemental machines running, I want to use them to provision a new cluster, I want 3 control plane nodes, and 7 worker nodes.
Then I could:

  1. Mark 3 ElementalHost with a control-plane: "true" label.
  2. Create a new CAPI cluster manifest
  3. Edit the ElementalMachineTemplate associated to the control plane resource with a control-plane: "true" label selector.
  4. Edit the controlPlaneEndpoint to point to any of the 3 machines address (As admin I know this in advance or I could learn it by looking at the ElementalHost resource)

Now this should work. The endpoint will refer to only one of the 3 control plane nodes, so there will be no load balancing. This could be acceptable for small or single-node clusters, however it is prone to failures, as if anything happens to the arbitrarily selected control plane node, then the entire cluster will be unreachable (until manual intervention updates controlPlaneEndpoint reference).

Another option from the user side, is to include kube-vip.
This however is probably easy on a datacenter scenario where all machines belong to an internal network and kube-vip can be setup in ARP or BGP mode, but it would not be that easy in an edge scenario where the machine for example only has access to one network interface and one only public IP. In that case I would personally use kube-vip in Wireguard mode (and by no mean expose my control-planes to the public), but this also mean additional setup complexity.

That said, to me it is completely visible to leave this task to the end user to figure out, this would be part of their network infrastructure design.

However the Elemental provider could implement some quality of life features. One low hanging fruit here is to allow the end user to leave the controlPlaneEndpoint blank. If the field is blank, the operator can then:

  1. Get all ElementalMachines belonging to the cluster that contain any cluster.x-k8s.io/control-plane flag.
  2. For each ElementalMachine, check if the underlying ElementalHost is bootstrapped, then update the Cluster controlPlaneEndpoint with the ElementalHost (public? which interface?) address.

Similarly on ElementalMachine deletion the operator could figure out if the controlPlaneEndpoint of the linked cluster needs an update (to a different control plane address).

TL;DR Can we simply expect the user to take care of control plane load balancing (through kube-vip or other means), or does the Elemental provider need to implement any logic to improve user experience?

anmazzotti commented 1 year ago

Linking parent epic: #968