Closed vitali-ipquants closed 1 year ago
Hey @vitali-ipquants,
right now it is not possible to run multiple HCCM instances at the same time. Both would try to process all nodes.
Would something like a node label selector that you can configure per HCCM work for you, so only a subset of nodes are processed by each CCM?
but we still need the routes for the POD networks which CCM creates.
What sort of routes would you expect the HCCM to create in this case? Does this only include the pod networks for nodes in the same network or would you expect routes to be created also for pod networks in other private networks?
Hey Julian,
Thank you for looking at this question!
But I'm not ready with reasonable answers on your questions, I need more time to play around and understand better what we need/want (and also would sound reasonable). When I'm ready I'll continue the discussion.
Vitali
Hi @apricote,
It took us some time to clarify what we think could work and makes sense.
The text below is quite long, but we wanted to be as clear as possible.
One clarification: If Hetzner will increase the limit for servers in a network, then we could probably close this topic. Are you aware of such plans?
Hetzner Cloud Controller Manager (HCCM) supports a k8s cluster which runs inside a single hetzner network. This sets a limit of the size of the cluster, because of Hetzner limitations, like:
Our experiments are with Talos (talos.dev) and Flannel for CNI plugin.
The setup briefly is:
All control plane nodes are put in control-plane
network.
This network is configured in the HCLOUD_NETWORK
env var of HCCM.
We have two worker nodes
networks. The idea here is that we'll be scaling
the cluster up by adding worker nodes
networks. Each new worker network
will increase the capacity with about 100 servers.
Connectivity between these networks is ensured by the so called Interconnect Gateways (ICGW) machines.
In each worker network
we'll add a pair of ICGW machines. They will be
also attached to the control plane
network.
The ICGW machines are two for redundancy/HA reasons. At any given moment
only one of them will be active and it will be holding an AliasIP for its
worker network
and an AliasIP for the control plane
network. (The two
ICGW machines will be running keepalived
which will decide which one of them
will be active at the moment.)
In each worker network
we are adding a route with destination
0.0.0.0/0 and gateway the ICGW AliasIP. The ICGW machine is routing all
worker networks to the control plane network gateway (.1).
We'll have to find a way to automate the creation of these routes,
but this is out of the scope of the HCCM.
In the control plane
network we are adding a route with destination given
worker network
and the ICGW AliasIP as gateway.
This configuration ensures that all cluster nodes can reach each other no matter in which network they are, because:
Internet connectivity
All control plane and worker nodes do not have public network, they are attached only to their Hetzner private network.
The ICGW machines in the worker networks act also as Internet Gateways. We do SNAT via iptables.
For the control plane network we have a pair of VMs which act only as internet gateway.
I attached a diagram which visualizes the idea, it could be helpful.
The HCCM creates routes for each node's POD network. We tested running HCCM with env var HCLOUD_NETWORK_ROUTES_ENABLED=false, so it no longer creates the node's POD network routes.
If we use HCCM with disbled routes feature, then we'll never exhaust the limit of 100 routes per network.
We made a really ugly patch of HCCM in order to test if what we want is possible.
What changed the code here to not filter by network.ID
.
With this change HCCM successfully initializes all worker nodes, even when they
are not attached to the network configured in HCLOUD_NETWORK
env var.
So of course we don't want an ugly patch like the one above, we'd like a more reliable solution.
For backward compatibility reasons HCCM will continue to support single network mode, the network will be set in HCLOUD_NETWORK.
NB: At this point it is unclear what we'll do with routes for node's POD networks, because the limit for routes on Hetzner networks can be exhausted if the number of nodes is greater than 100. I know that I said we'll need these routes in the initial comment of the ticket, but we realized that we don't need them.
For multiple network support we consider the following variables:
HCLOUD_CONTROL_PLANE_NETWORK
serverIsAttachedToNetwork
HCLOUD_WORKER_NETWORK_REGEX (or _PREFIX) - a regular expression which will match
the worker networks by their name.
This variable will be used together with HCLOUD_CONTROL_PLANE_NETWORK
be used in the instances.go
function nodeAddresses
.
If given node network matches either the regex or the HCLOUD_CONTROL_PLANE_NETWORK,
then this address will be part of the result of the nodeAddresses
function.
LoadBalancer support - at the moment the implementation of the LoadBalancer interface uses the sole network specified in the HCLOUD_NETWORK env var.
There are two possible ways to overcome this limitation.
annotation approach: In this approach in the k8s spec for the load balancer, there will be an annotation specifying the network name to be used. This network name should be either the control plane network or one of the worker networks. This approach would address several issues:
The Hetzner LoadBalancer will use as targets only the the nodes inside the specified network (while at the moment it will try to add all nodes which will fail if they are not in the same network)
This way we'll have an effective way to control which nodes are sitting behind given LoadBalancer, we could also control their size, etc. We might have different worker networks for different services, each of them with its own load balancer.
A dedicated variable HCLOUD_LOAD_BALANCER_NETWORK - this is a sub-case of the above suggestion. It is not as flexible as the above approach - just one of the worker networks will be used for exposing k8s services, all load balancers will be attached in this network. But in this case we'll again have better control on what nodes will be targets of the load balancers.
Probably both approaches could be implemented. If there's no annotation the LoadBalancer controller will use the dedicated HCLOUD_LOAD_BALANCER_NETWORK variable
Hey @vitali-ipquants,
sorry for the very late response.
If Hetzner will increase the limit for servers in a network, then we could probably close this topic. Are you aware of such plans?
This is not planned.
HCLOUD_CONTROL_PLANE_NETWORK
,HCLOUD_WORKER_NETWORK_REGEX
,HCLOUD_LOAD_BALANCER_NETWORK
This is a very specialized interface that touches a lot of our code to support one (niche) use case.
I counter-propose the following interface that should still work for your use case:
We add a new config option HCLOUD_SERVER_LABEL_SELECTOR
. Based on this label selector HCCM decides if it wants to "own" the server/node or not.
You can then deploy multiple instances of hcloud-cloud-controller-manager to handle the different tasks:
For Control Plane
HCLOUD_NETWORK_ROUTES_ENABLED: "false"
HCLOUD_LOAD_BALANCERS_ENABLED: "false"
HCLOUD_NETWORK: cluster-control-plane
HCLOUD_SERVER_LABEL_SELECTOR: node-group=control-plane
For Service Nodes (LB). This might work in tandem with #373
HCLOUD_NETWORK_ROUTES_ENABLED: "false"
HCLOUD_LOAD_BALANCERS_ENABLED: "true"
HCLOUD_NETWORK: cluster-service-nodes
HCLOUD_SERVER_LABEL_SELECTOR: node-group=service-nodes
For Worker K Nodes (LB)
HCLOUD_NETWORK_ROUTES_ENABLED: "false"
HCLOUD_LOAD_BALANCERS_ENABLED: "false"
HCLOUD_NETWORK: cluster-worker-k
HCLOUD_SERVER_LABEL_SELECTOR: node-group=worker-k
Hi Julian,
My team will need some time to think about your idea, I'll reply soon. Thank you for looking into this!
Vitali
Hi again @apricote,
What you suggests seems a lot easier and simpler. We'll try to make implementation of the HCLOUD_SERVER_LABEL_SELECTOR config option and test if it is going to work in our PoC env.
If it works, we might open PR.
I'll update you when we have news. Thanks again, Vitali
Hi @apricote,
I am currently trying out your proposal for label selector-based filtering of servers in HCCM.
I've hit a problem with the implementation, and I need some input from your side, as it imposes some design decisions to me made.
Please have a look at the problem and the solution proposal below and let us know if you find this acceptable.
It'd also be great if you think there's a clearer and simpler solution to this, or that it's not actually a problem and that we have missed/overlooked something.
Thanks!
The node controller (as far as I understand it) runs in the HCCM process and
calls the syncNode()
function for each node it find in the Kubernetes API
server in an infinite loop.
syncNode()
calls into HCCM's InstancesV2
implementation, passing in the
Node
object. HCCM's InstancesV2
looks up the corresponding server either
by providerID
(if exists) or by name.
Presumably the first call looks up the server by name, as the providerID
is
not yet set in the Node
's definition. This lookup call into the List()
method
which accepts ListOpts
where we can specify the label selector that had been
set through the HCLOUD_SERVER_LABEL_SELECTOR
environment variable as per your
proposal.
Subsequent lookups however will be done by providerID
since it had been already
set by the result of the first lookup. The GetByID()
API call accepts just an
ID and no label selector filtering can be done there.
Add yet another environment variable (e.g., HCLOUD_CCM_INSTANCE_ID
) the value of
which will need to be set to something unique per HCCM instance. This value would
then become a part of the providerID
that is returned by HCCM's InstancesV2
to
the node controller in the first lookup (the one made by the Node
's name).
Subsequent calls will parse the providerID
and compare the HCCM instance ID part
to their own instance ID (set in HCLOUD_CCM_INSTANCE_ID
) and will return data
about the Node
only if it matches.
Updating labels won't be honored
The solution proposed above implies that switching the HCCM instance "owning" a server by changing its labels will not be supported, as the label selector-based filtering will be done upon the server's first lookup only.
providerID
contract
The solution proposed above changes the providerID
field format. This means that
migrating an existing cluster from multi-HCCM instance to single or vice vers will
not be possible.
Both points will need to be clearly documented.
Hey Alex & Vitali,
I took some hours today to evaluate this again and deep dive into the code.
Presumably the first call looks up the server by name, as the providerID is not yet set in the Node's definition. This lookup call into the List() method which accepts ListOpts where we can specify the label selector that had been set through the HCLOUD_SERVER_LABEL_SELECTOR environment variable as per your proposal.
I think there was a misunderstanding, my suggestion of HCLOUD_SERVER_LABEL_SELECTOR
was supposed to match Kubernetes Node Labels, I have to admit that my naming was not very good for this.
Whenever k/cloud-provider calls into our methods, we get the full Node
objects, and filtering by a label on those should be easy.
I do not believe that we need to change the format of provider ID to make this work for you, but adding this functionality is also not a priority for use as this is outside of what we want to support.
I took a look at how the different controllers in k/c-p react if we were to deploy multiple instances with different configurations for them.
Purpose: Removes external-cloud-provider taint, update node.status.addresses
and well known labels (topology, region, instance type)
How: Calls instancev2.InstanceMetadata(node)
instancev2.InstanceMetatada(node)
it does nothingnode.status.addresses
updateNodeAddress()
if we return no metadata from instancev2.InstanceMetadata(node)
Insights:
=> Node Initializer would work well with a label selector
=> Node Status Updater might crash when trying to update addresses for unselected Nodes
=> Adding a check for metadata == nil { return }
to status update would be easy, and would be similar to what the initializer is doing.
=> Deploying multiple ccms right now would cause conflicts between the node.status.addresses
updates coming from them (containing different/no InternalIPs)
Purpose: Periodically checks with cloud to see if any nodes are shutdown/deleted -> If deleted, delete the Node object -> If shutdown, add taint
How: Calls instancev2.InstanceExists(node)
& instancev2.InstanceShutdown(node)
Insights: => While we could check the labels of the node in InstanceExists & InstanceShutdown, returning no-info here would lead to the instance being deleted. => These methods still need to handle Responses for all nodes => Nothing in here depends on the Network that the servers are connected to => This loop is running on a fixed interval, running x CCMs will cause x times the number of API requests
Can not work in this setup, as there will be more than 100 Nodes/Pod CIDRs for which we have to setup routes.
Purpose: Provide Cloud LoadBalancers for Services type: LoadBalancer
and add all nodes as targets
How: Calling Service.EnsureLoadBalancer()
Insights:
=> Filtering for nodes by label selector (or any other criteria) in our implementation of Service.EnsureLoadBalancer()
should be easy enough
Insights:
=> Its possible to disable specific controllers using the --controllers
flag.
Deploying multiple ccm's with different configuration and purposes it theoretically possible.
In practice this fails because the UpdateNodeStatus
in Node controller does not handle a nil
Metadata response properly. If this were to be fixed, I believe its possible to deploy as you intend.
The patch might look something like this:
diff --git a/controllers/node/node_controller.go b/controllers/node/node_controller.go
index ace70dd..640b351 100644
--- a/controllers/node/node_controller.go
+++ b/controllers/node/node_controller.go
@@ -277,6 +277,11 @@ func (cnc *CloudNodeController) UpdateNodeStatus(ctx context.Context) error {
klog.Errorf("Error getting instance metadata for node addresses: %v", err)
return
}
+ if instanceMetadata == nil {
+ // do nothing when external cloud providers provide nil instanceMetadata
+ klog.Infof("Skip update node %s because cloud provided nil metadata", node.Name)
+ return
+ }
cnc.updateNodeAddress(ctx, node, instanceMetadata)
Hi Julian,
First of all - thank you very much for the time you have spent on this ticket. The insights you pointed out are really helpful...
At the end we gave up and we will not work on the idea for bring up 100+ nodes k8s cluster in Hetzner cloud.
Thank you again for the help!
Cheers :beers: Vitali and Alex
Very well then, it was an interesting ride and I learned a lot, so thanks for the opportunity! :)
Hello,
My team is exploring the option to run a k8s cluster which contains more than 100 nodes, so we're wondering is there a way to do this with hcloud CCM.
We'll ensure connectivity between the different private networks, but we still need the routes for the POD networks which CCM creates.
Will it be possible to run multiple CCM instance, an instance per private network?
If there's no way to do it at the moment, could you please share if you have plans to implement such support?
Thank you!