hetznercloud / hcloud-cloud-controller-manager

Kubernetes cloud-controller-manager for Hetzner Cloud
Apache License 2.0
717 stars 116 forks source link

Support for multiple networks #342

Closed vitali-ipquants closed 1 year ago

vitali-ipquants commented 1 year ago

Hello,

My team is exploring the option to run a k8s cluster which contains more than 100 nodes, so we're wondering is there a way to do this with hcloud CCM.

We'll ensure connectivity between the different private networks, but we still need the routes for the POD networks which CCM creates.

Will it be possible to run multiple CCM instance, an instance per private network?

If there's no way to do it at the moment, could you please share if you have plans to implement such support?

Thank you!

apricote commented 1 year ago

Hey @vitali-ipquants,

right now it is not possible to run multiple HCCM instances at the same time. Both would try to process all nodes.

Would something like a node label selector that you can configure per HCCM work for you, so only a subset of nodes are processed by each CCM?

but we still need the routes for the POD networks which CCM creates.

What sort of routes would you expect the HCCM to create in this case? Does this only include the pod networks for nodes in the same network or would you expect routes to be created also for pod networks in other private networks?

vitali-ipquants commented 1 year ago

Hey Julian,

Thank you for looking at this question!

But I'm not ready with reasonable answers on your questions, I need more time to play around and understand better what we need/want (and also would sound reasonable). When I'm ready I'll continue the discussion.

Vitali

vitali-ipquants commented 1 year ago

Hi @apricote,

It took us some time to clarify what we think could work and makes sense.

The text below is quite long, but we wanted to be as clear as possible.

One clarification: If Hetzner will increase the limit for servers in a network, then we could probably close this topic. Are you aware of such plans?

Current state

Hetzner Cloud Controller Manager (HCCM) supports a k8s cluster which runs inside a single hetzner network. This sets a limit of the size of the cluster, because of Hetzner limitations, like:

Our experiments are with Talos (talos.dev) and Flannel for CNI plugin.

The setup briefly is:

I attached a diagram which visualizes the idea, it could be helpful. diagram

How we made this working

We disabled the routes creation

The HCCM creates routes for each node's POD network. We tested running HCCM with env var HCLOUD_NETWORK_ROUTES_ENABLED=false, so it no longer creates the node's POD network routes.

If we use HCCM with disbled routes feature, then we'll never exhaust the limit of 100 routes per network.

Patched version of HCCM

We made a really ugly patch of HCCM in order to test if what we want is possible. What changed the code here to not filter by network.ID.

With this change HCCM successfully initializes all worker nodes, even when they are not attached to the network configured in HCLOUD_NETWORK env var.

Proposal

So of course we don't want an ugly patch like the one above, we'd like a more reliable solution.

For backward compatibility reasons HCCM will continue to support single network mode, the network will be set in HCLOUD_NETWORK.

NB: At this point it is unclear what we'll do with routes for node's POD networks, because the limit for routes on Hetzner networks can be exhausted if the number of nodes is greater than 100. I know that I said we'll need these routes in the initial comment of the ticket, but we realized that we don't need them.

For multiple network support we consider the following variables:

apricote commented 1 year ago

Hey @vitali-ipquants,

sorry for the very late response.

If Hetzner will increase the limit for servers in a network, then we could probably close this topic. Are you aware of such plans?

This is not planned.

HCLOUD_CONTROL_PLANE_NETWORK, HCLOUD_WORKER_NETWORK_REGEX, HCLOUD_LOAD_BALANCER_NETWORK

This is a very specialized interface that touches a lot of our code to support one (niche) use case.

I counter-propose the following interface that should still work for your use case:

We add a new config option HCLOUD_SERVER_LABEL_SELECTOR. Based on this label selector HCCM decides if it wants to "own" the server/node or not.

You can then deploy multiple instances of hcloud-cloud-controller-manager to handle the different tasks:

  1. For Control Plane

    HCLOUD_NETWORK_ROUTES_ENABLED: "false"
    HCLOUD_LOAD_BALANCERS_ENABLED: "false"
    
    HCLOUD_NETWORK: cluster-control-plane
    HCLOUD_SERVER_LABEL_SELECTOR: node-group=control-plane
  2. For Service Nodes (LB). This might work in tandem with #373

    HCLOUD_NETWORK_ROUTES_ENABLED: "false"
    HCLOUD_LOAD_BALANCERS_ENABLED: "true"
    
    HCLOUD_NETWORK: cluster-service-nodes
    HCLOUD_SERVER_LABEL_SELECTOR: node-group=service-nodes
  3. For Worker K Nodes (LB)

    HCLOUD_NETWORK_ROUTES_ENABLED: "false"
    HCLOUD_LOAD_BALANCERS_ENABLED: "false"
    
    HCLOUD_NETWORK: cluster-worker-k
    HCLOUD_SERVER_LABEL_SELECTOR: node-group=worker-k
vitali-ipquants commented 1 year ago

Hi Julian,

My team will need some time to think about your idea, I'll reply soon. Thank you for looking into this!

Vitali

vitali-ipquants commented 1 year ago

Hi again @apricote,

What you suggests seems a lot easier and simpler. We'll try to make implementation of the HCLOUD_SERVER_LABEL_SELECTOR config option and test if it is going to work in our PoC env.

If it works, we might open PR.

I'll update you when we have news. Thanks again, Vitali

sasoiliev commented 1 year ago

Hi @apricote,

I am currently trying out your proposal for label selector-based filtering of servers in HCCM.

I've hit a problem with the implementation, and I need some input from your side, as it imposes some design decisions to me made.

Please have a look at the problem and the solution proposal below and let us know if you find this acceptable.

It'd also be great if you think there's a clearer and simpler solution to this, or that it's not actually a problem and that we have missed/overlooked something.

Thanks!

Problem statement

The node controller (as far as I understand it) runs in the HCCM process and calls the syncNode() function for each node it find in the Kubernetes API server in an infinite loop.

syncNode() calls into HCCM's InstancesV2 implementation, passing in the Node object. HCCM's InstancesV2 looks up the corresponding server either by providerID (if exists) or by name.

Presumably the first call looks up the server by name, as the providerID is not yet set in the Node's definition. This lookup call into the List() method which accepts ListOpts where we can specify the label selector that had been set through the HCLOUD_SERVER_LABEL_SELECTOR environment variable as per your proposal.

Subsequent lookups however will be done by providerID since it had been already set by the result of the first lookup. The GetByID() API call accepts just an ID and no label selector filtering can be done there.

Proposed solution

Add yet another environment variable (e.g., HCLOUD_CCM_INSTANCE_ID) the value of which will need to be set to something unique per HCCM instance. This value would then become a part of the providerID that is returned by HCCM's InstancesV2 to the node controller in the first lookup (the one made by the Node's name).

Subsequent calls will parse the providerID and compare the HCCM instance ID part to their own instance ID (set in HCLOUD_CCM_INSTANCE_ID) and will return data about the Node only if it matches.

Implications

Both points will need to be clearly documented.

apricote commented 1 year ago

Hey Alex & Vitali,

I took some hours today to evaluate this again and deep dive into the code.

Presumably the first call looks up the server by name, as the providerID is not yet set in the Node's definition. This lookup call into the List() method which accepts ListOpts where we can specify the label selector that had been set through the HCLOUD_SERVER_LABEL_SELECTOR environment variable as per your proposal.

I think there was a misunderstanding, my suggestion of HCLOUD_SERVER_LABEL_SELECTOR was supposed to match Kubernetes Node Labels, I have to admit that my naming was not very good for this.

Whenever k/cloud-provider calls into our methods, we get the full Node objects, and filtering by a label on those should be easy.

I do not believe that we need to change the format of provider ID to make this work for you, but adding this functionality is also not a priority for use as this is outside of what we want to support.

Behaviour of k/c-p if we deploy multiple ccms with label selectors

I took a look at how the different controllers in k/c-p react if we were to deploy multiple instances with different configurations for them.

Node

Purpose: Removes external-cloud-provider taint, update node.status.addresses and well known labels (topology, region, instance type)

How: Calls instancev2.InstanceMetadata(node)

  1. Initialize Nodes
    • Only handles nodes with taint
    • Remove taint
    • Adds labels (topology, region, instance type)
    • Set Node Addresses
    • => If no metadata is returned from instancev2.InstanceMetatada(node) it does nothing
  2. Periodically Update Node Status
    • Does not handle nodes with taint
    • Reconcile node.status.addresses
    • Syncs well-known labels between stable->beta
    • => Possible crashes in updateNodeAddress() if we return no metadata from instancev2.InstanceMetadata(node)

Insights: => Node Initializer would work well with a label selector => Node Status Updater might crash when trying to update addresses for unselected Nodes => Adding a check for metadata == nil { return } to status update would be easy, and would be similar to what the initializer is doing. => Deploying multiple ccms right now would cause conflicts between the node.status.addresses updates coming from them (containing different/no InternalIPs)

NodeLifecycle

Purpose: Periodically checks with cloud to see if any nodes are shutdown/deleted -> If deleted, delete the Node object -> If shutdown, add taint

How: Calls instancev2.InstanceExists(node) & instancev2.InstanceShutdown(node)

Insights: => While we could check the labels of the node in InstanceExists & InstanceShutdown, returning no-info here would lead to the instance being deleted. => These methods still need to handle Responses for all nodes => Nothing in here depends on the Network that the servers are connected to => This loop is running on a fixed interval, running x CCMs will cause x times the number of API requests

Route

Can not work in this setup, as there will be more than 100 Nodes/Pod CIDRs for which we have to setup routes.

Service (LB)

Purpose: Provide Cloud LoadBalancers for Services type: LoadBalancer and add all nodes as targets

How: Calling Service.EnsureLoadBalancer()

Insights: => Filtering for nodes by label selector (or any other criteria) in our implementation of Service.EnsureLoadBalancer() should be easy enough

General

Insights: => Its possible to disable specific controllers using the --controllers flag.

Summary

Deploying multiple ccm's with different configuration and purposes it theoretically possible.

In practice this fails because the UpdateNodeStatus in Node controller does not handle a nil Metadata response properly. If this were to be fixed, I believe its possible to deploy as you intend.

The patch might look something like this:

diff --git a/controllers/node/node_controller.go b/controllers/node/node_controller.go
index ace70dd..640b351 100644
--- a/controllers/node/node_controller.go
+++ b/controllers/node/node_controller.go
@@ -277,6 +277,11 @@ func (cnc *CloudNodeController) UpdateNodeStatus(ctx context.Context) error {
                        klog.Errorf("Error getting instance metadata for node addresses: %v", err)
                        return
                }
+               if instanceMetadata == nil {
+                       // do nothing when external cloud providers provide nil instanceMetadata
+                       klog.Infof("Skip update node %s because cloud provided nil metadata", node.Name)
+                       return
+               }

                cnc.updateNodeAddress(ctx, node, instanceMetadata)
vitali-ipquants commented 1 year ago

Hi Julian,

First of all - thank you very much for the time you have spent on this ticket. The insights you pointed out are really helpful...

At the end we gave up and we will not work on the idea for bring up 100+ nodes k8s cluster in Hetzner cloud.

Thank you again for the help!

Cheers :beers: Vitali and Alex

apricote commented 1 year ago

Very well then, it was an interesting ride and I learned a lot, so thanks for the opportunity! :)