vitali-ipquants commented 1 year ago

Hello,

My team is exploring the option to run a k8s cluster which contains more than 100 nodes, so we're wondering is there a way to do this with hcloud CCM.

We'll ensure connectivity between the different private networks, but we still need the routes for the POD networks which CCM creates.

Will it be possible to run multiple CCM instance, an instance per private network?

If there's no way to do it at the moment, could you please share if you have plans to implement such support?

Thank you!

apricote commented 1 year ago

Hey @vitali-ipquants,

right now it is not possible to run multiple HCCM instances at the same time. Both would try to process all nodes.

Would something like a node label selector that you can configure per HCCM work for you, so only a subset of nodes are processed by each CCM?

but we still need the routes for the POD networks which CCM creates.

What sort of routes would you expect the HCCM to create in this case? Does this only include the pod networks for nodes in the same network or would you expect routes to be created also for pod networks in other private networks?

vitali-ipquants commented 1 year ago

Hey Julian,

Thank you for looking at this question!

But I'm not ready with reasonable answers on your questions, I need more time to play around and understand better what we need/want (and also would sound reasonable). When I'm ready I'll continue the discussion.

Vitali

vitali-ipquants commented 1 year ago

Hi @apricote,

It took us some time to clarify what we think could work and makes sense.

The text below is quite long, but we wanted to be as clear as possible.

One clarification: If Hetzner will increase the limit for servers in a network, then we could probably close this topic. Are you aware of such plans?

Current state

Hetzner Cloud Controller Manager (HCCM) supports a k8s cluster which runs inside a single hetzner network. This sets a limit of the size of the cluster, because of Hetzner limitations, like:

A Hetzner network can contain at most 100 servers
A Hetzner network can contain at most 100 routes

Our experiments are with Talos (talos.dev) and Flannel for CNI plugin.

The setup briefly is:

All control plane nodes are put in control-plane network. This network is configured in the HCLOUD_NETWORK env var of HCCM.
We have two worker nodes networks. The idea here is that we'll be scaling the cluster up by adding worker nodes networks. Each new worker network will increase the capacity with about 100 servers.
Connectivity between these networks is ensured by the so called Interconnect Gateways (ICGW) machines.

In each worker network we'll add a pair of ICGW machines. They will be also attached to the control plane network. The ICGW machines are two for redundancy/HA reasons. At any given moment only one of them will be active and it will be holding an AliasIP for its worker network and an AliasIP for the control plane network. (The two ICGW machines will be running keepalived which will decide which one of them will be active at the moment.)

In each worker network we are adding a route with destination 0.0.0.0/0 and gateway the ICGW AliasIP. The ICGW machine is routing all worker networks to the control plane network gateway (.1). We'll have to find a way to automate the creation of these routes, but this is out of the scope of the HCCM.

In the control plane network we are adding a route with destination given worker network and the ICGW AliasIP as gateway.

This configuration ensures that all cluster nodes can reach each other no matter in which network they are, because:
- an outgoing packet from worker network 1 will be routed to the control plane network. This is done by the ICGW.
- when the packet reaches the control plane network it will be routed to the ICGW from the target worker network. This routing will be done by the routes configured on the hetzner control plane network.
Internet connectivity

All control plane and worker nodes do not have public network, they are attached only to their Hetzner private network.

The ICGW machines in the worker networks act also as Internet Gateways. We do SNAT via iptables.

For the control plane network we have a pair of VMs which act only as internet gateway.

I attached a diagram which visualizes the idea, it could be helpful. diagram

How we made this working

We disabled the routes creation

The HCCM creates routes for each node's POD network. We tested running HCCM with env var HCLOUD_NETWORK_ROUTES_ENABLED=false, so it no longer creates the node's POD network routes.

If we use HCCM with disbled routes feature, then we'll never exhaust the limit of 100 routes per network.

Patched version of HCCM

We made a really ugly patch of HCCM in order to test if what we want is possible. What changed the code here to not filter by network.ID.

With this change HCCM successfully initializes all worker nodes, even when they are not attached to the network configured in HCLOUD_NETWORK env var.

Proposal

So of course we don't want an ugly patch like the one above, we'd like a more reliable solution.

For backward compatibility reasons HCCM will continue to support single network mode, the network will be set in HCLOUD_NETWORK.

NB: At this point it is unclear what we'll do with routes for node's POD networks, because the limit for routes on Hetzner networks can be exhausted if the number of nodes is greater than 100. I know that I said we'll need these routes in the initial comment of the ticket, but we realized that we don't need them.

For multiple network support we consider the following variables:

HCLOUD_CONTROL_PLANE_NETWORK
- it will hold the name of the Hetzner network in which control plane nodes will be added.
- it will be used in cloud.go, function serverIsAttachedToNetwork
HCLOUD_WORKER_NETWORK_REGEX (or _PREFIX) - a regular expression which will match the worker networks by their name. This variable will be used together with HCLOUD_CONTROL_PLANE_NETWORK be used in the instances.go function nodeAddresses. If given node network matches either the regex or the HCLOUD_CONTROL_PLANE_NETWORK, then this address will be part of the result of the nodeAddresses function.
LoadBalancer support - at the moment the implementation of the LoadBalancer interface uses the sole network specified in the HCLOUD_NETWORK env var.

There are two possible ways to overcome this limitation.
- annotation approach: In this approach in the k8s spec for the load balancer, there will be an annotation specifying the network name to be used. This network name should be either the control plane network or one of the worker networks. This approach would address several issues:
- The Hetzner LoadBalancer will use as targets only the the nodes inside the specified network (while at the moment it will try to add all nodes which will fail if they are not in the same network)
- This way we'll have an effective way to control which nodes are sitting behind given LoadBalancer, we could also control their size, etc. We might have different worker networks for different services, each of them with its own load balancer.
- A dedicated variable HCLOUD_LOAD_BALANCER_NETWORK - this is a sub-case of the above suggestion. It is not as flexible as the above approach - just one of the worker networks will be used for exposing k8s services, all load balancers will be attached in this network. But in this case we'll again have better control on what nodes will be targets of the load balancers.
- Probably both approaches could be implemented. If there's no annotation the LoadBalancer controller will use the dedicated HCLOUD_LOAD_BALANCER_NETWORK variable

apricote commented 1 year ago

Hey @vitali-ipquants,

sorry for the very late response.

If Hetzner will increase the limit for servers in a network, then we could probably close this topic. Are you aware of such plans?

This is not planned.

HCLOUD_CONTROL_PLANE_NETWORK, HCLOUD_WORKER_NETWORK_REGEX, HCLOUD_LOAD_BALANCER_NETWORK

This is a very specialized interface that touches a lot of our code to support one (niche) use case.

I counter-propose the following interface that should still work for your use case:

We add a new config option HCLOUD_SERVER_LABEL_SELECTOR. Based on this label selector HCCM decides if it wants to "own" the server/node or not.

You can then deploy multiple instances of hcloud-cloud-controller-manager to handle the different tasks:

For Control Plane

HCLOUD_NETWORK_ROUTES_ENABLED: "false"
HCLOUD_LOAD_BALANCERS_ENABLED: "false"

HCLOUD_NETWORK: cluster-control-plane
HCLOUD_SERVER_LABEL_SELECTOR: node-group=control-plane

For Service Nodes (LB). This might work in tandem with #373

HCLOUD_NETWORK_ROUTES_ENABLED: "false"
HCLOUD_LOAD_BALANCERS_ENABLED: "true"

HCLOUD_NETWORK: cluster-service-nodes
HCLOUD_SERVER_LABEL_SELECTOR: node-group=service-nodes

For Worker K Nodes (LB)

HCLOUD_NETWORK_ROUTES_ENABLED: "false"
HCLOUD_LOAD_BALANCERS_ENABLED: "false"

HCLOUD_NETWORK: cluster-worker-k
HCLOUD_SERVER_LABEL_SELECTOR: node-group=worker-k

vitali-ipquants commented 1 year ago

Hi Julian,

My team will need some time to think about your idea, I'll reply soon. Thank you for looking into this!

Vitali

vitali-ipquants commented 1 year ago

Hi again @apricote,

What you suggests seems a lot easier and simpler. We'll try to make implementation of the HCLOUD_SERVER_LABEL_SELECTOR config option and test if it is going to work in our PoC env.

If it works, we might open PR.

I'll update you when we have news. Thanks again, Vitali

sasoiliev commented 1 year ago

Hi @apricote,

I am currently trying out your proposal for label selector-based filtering of servers in HCCM.

I've hit a problem with the implementation, and I need some input from your side, as it imposes some design decisions to me made.

Please have a look at the problem and the solution proposal below and let us know if you find this acceptable.

It'd also be great if you think there's a clearer and simpler solution to this, or that it's not actually a problem and that we have missed/overlooked something.

Thanks!

Problem statement

The node controller (as far as I understand it) runs in the HCCM process and calls the syncNode() function for each node it find in the Kubernetes API server in an infinite loop.

syncNode() calls into HCCM's InstancesV2 implementation, passing in the Node object. HCCM's InstancesV2 looks up the corresponding server either by providerID (if exists) or by name.

Presumably the first call looks up the server by name, as the providerID is not yet set in the Node's definition. This lookup call into the List() method which accepts ListOpts where we can specify the label selector that had been set through the HCLOUD_SERVER_LABEL_SELECTOR environment variable as per your proposal.

Subsequent lookups however will be done by providerID since it had been already set by the result of the first lookup. The GetByID() API call accepts just an ID and no label selector filtering can be done there.

Proposed solution

Add yet another environment variable (e.g., HCLOUD_CCM_INSTANCE_ID) the value of which will need to be set to something unique per HCCM instance. This value would then become a part of the providerID that is returned by HCCM's InstancesV2 to the node controller in the first lookup (the one made by the Node's name).

Subsequent calls will parse the providerID and compare the HCCM instance ID part to their own instance ID (set in HCLOUD_CCM_INSTANCE_ID) and will return data about the Node only if it matches.

Implications

Updating labels won't be honored

The solution proposed above implies that switching the HCCM instance "owning" a server by changing its labels will not be supported, as the label selector-based filtering will be done upon the server's first lookup only.
providerID contract

The solution proposed above changes the providerID field format. This means that migrating an existing cluster from multi-HCCM instance to single or vice vers will not be possible.

Both points will need to be clearly documented.

apricote commented 1 year ago

Hey Alex & Vitali,

I took some hours today to evaluate this again and deep dive into the code.

Presumably the first call looks up the server by name, as the providerID is not yet set in the Node's definition. This lookup call into the List() method which accepts ListOpts where we can specify the label selector that had been set through the HCLOUD_SERVER_LABEL_SELECTOR environment variable as per your proposal.

I think there was a misunderstanding, my suggestion of HCLOUD_SERVER_LABEL_SELECTOR was supposed to match Kubernetes Node Labels, I have to admit that my naming was not very good for this.

Whenever k/cloud-provider calls into our methods, we get the full Node objects, and filtering by a label on those should be easy.

I do not believe that we need to change the format of provider ID to make this work for you, but adding this functionality is also not a priority for use as this is outside of what we want to support.

Behaviour of k/c-p if we deploy multiple ccms with label selectors

I took a look at how the different controllers in k/c-p react if we were to deploy multiple instances with different configurations for them.

Node

Purpose: Removes external-cloud-provider taint, update node.status.addresses and well known labels (topology, region, instance type)

How: Calls instancev2.InstanceMetadata(node)

Initialize Nodes
- Only handles nodes with taint
- Remove taint
- Adds labels (topology, region, instance type)
- Set Node Addresses
- => If no metadata is returned from instancev2.InstanceMetatada(node) it does nothing
Periodically Update Node Status
- Does not handle nodes with taint
- Reconcile node.status.addresses
- Syncs well-known labels between stable->beta
- => Possible crashes in updateNodeAddress() if we return no metadata from instancev2.InstanceMetadata(node)

Insights: => Node Initializer would work well with a label selector => Node Status Updater might crash when trying to update addresses for unselected Nodes => Adding a check for metadata == nil { return } to status update would be easy, and would be similar to what the initializer is doing. => Deploying multiple ccms right now would cause conflicts between the node.status.addresses updates coming from them (containing different/no InternalIPs)

NodeLifecycle

Purpose: Periodically checks with cloud to see if any nodes are shutdown/deleted -> If deleted, delete the Node object -> If shutdown, add taint

How: Calls instancev2.InstanceExists(node) & instancev2.InstanceShutdown(node)

Insights: => While we could check the labels of the node in InstanceExists & InstanceShutdown, returning no-info here would lead to the instance being deleted. => These methods still need to handle Responses for all nodes => Nothing in here depends on the Network that the servers are connected to => This loop is running on a fixed interval, running x CCMs will cause x times the number of API requests

Route

Can not work in this setup, as there will be more than 100 Nodes/Pod CIDRs for which we have to setup routes.

Service (LB)

Purpose: Provide Cloud LoadBalancers for Services type: LoadBalancer and add all nodes as targets

How: Calling Service.EnsureLoadBalancer()

Insights: => Filtering for nodes by label selector (or any other criteria) in our implementation of Service.EnsureLoadBalancer() should be easy enough

General

Insights: => Its possible to disable specific controllers using the --controllers flag.

Summary

Deploying multiple ccm's with different configuration and purposes it theoretically possible.

In practice this fails because the UpdateNodeStatus in Node controller does not handle a nil Metadata response properly. If this were to be fixed, I believe its possible to deploy as you intend.

The patch might look something like this:

diff --git a/controllers/node/node_controller.go b/controllers/node/node_controller.go
index ace70dd..640b351 100644
--- a/controllers/node/node_controller.go
+++ b/controllers/node/node_controller.go
@@ -277,6 +277,11 @@ func (cnc *CloudNodeController) UpdateNodeStatus(ctx context.Context) error {
                        klog.Errorf("Error getting instance metadata for node addresses: %v", err)
                        return
                }
+               if instanceMetadata == nil {
+                       // do nothing when external cloud providers provide nil instanceMetadata
+                       klog.Infof("Skip update node %s because cloud provided nil metadata", node.Name)
+                       return
+               }

                cnc.updateNodeAddress(ctx, node, instanceMetadata)

vitali-ipquants commented 1 year ago

Hi Julian,

First of all - thank you very much for the time you have spent on this ticket. The insights you pointed out are really helpful...

At the end we gave up and we will not work on the idea for bring up 100+ nodes k8s cluster in Hetzner cloud.

Thank you again for the help!

Cheers :beers: Vitali and Alex

apricote commented 1 year ago

Very well then, it was an interesting ride and I learned a lot, so thanks for the opportunity! :)

hetznercloud / hcloud-cloud-controller-manager

Support for multiple networks #342

Current state

How we made this working

We disabled the routes creation

Patched version of HCCM

Proposal

Problem statement

Proposed solution

Implications

Behaviour of k/c-p if we deploy multiple ccms with label selectors

Node

NodeLifecycle

Route

Service (LB)

General

Summary