liqotech / liqo

Enable dynamic and seamless Kubernetes multi-cluster topologies
https://liqo.io
Apache License 2.0
1.09k stars 103 forks source link

[Feature] offloading to nodes with specific labels #1494

Open DevSusu opened 1 year ago

DevSusu commented 1 year ago

Is your feature request related to a problem? Please describe. I've walked through the examples and documents, and would like to use Liqo for our team's multicluster management.

Our use-case & situation can be summarized as below

Describe the solution you'd like Currently, it seems that created virtual node summarizes all nodes in the remote cluster. I suggest using nodeSelector labels when offloading. So that virtual nodes reflects nodes only with the matching selector. Also injecting a nodeSelector term on offloaded pods will be useful (but OK to be done outside of Liqo).

Describe alternatives you've considered I can write a mutatingwebhook in the EKS cluster to inject the nodeSelector term, but then the virtual node's contains too much unnecessary(or even confusing) information. Enough resource(CPU, memory) in virtual node, but pods not scheduling.

Additional context This feature can also help in multi-tenant scenarios, where you might not want to dedicate a cluster in every offloaded namespace.

In https://github.com/liqotech/liqo/issues/1249 , @giorio94 suggested creating local shadow node for each remote node. nodeSelector feature will also help this scenario too.

DevSusu commented 1 year ago

can this be done here

https://github.com/liqotech/liqo/blob/master/cmd/virtual-kubelet/root/root.go#L145

    nodeRunner, err := node.NewNodeController(
        nodeProvider, nodeProvider.GetNode(),
        localClient.CoreV1().Nodes(), // add nodeselector label here
        node.WithNodeEnableLeaseV1(localClient.CoordinationV1().Leases(corev1.NamespaceNodeLease), int32(c.NodeLeaseDuration.Seconds())),
        ...
giorio94 commented 1 year ago

Hi @DevSusu,

If I understand it correctly, you would like to specify a node selector to offer only a subset of the resources available in the provider cluster (i.e., those associated with the nodes matching the selector). This feature makes sense to me (and also relates to excluding the tainted control plane nodes from the computation); it would require some modifications in the computation logic and in the shadow pod controller, to inject the given node selectors for offloaded pods. I cannot give you any timeline for this right now, but I'll add it to our roadmap for the future. If you would like to contribute, I can also give you more information about where to extend the logic to introduce it.

As for the piece of code you mentioned, that is the controller which deals with the creation of the virtual node. Still, the amount of resources associated with that node are taken from the ResourceOffer, which is created by the provider cluster through the above mentioned computation logic, and then propagated to the consumer cluster. Hence, you cannot use that to tune the amount of resources.

DevSusu commented 1 year ago

@giorio94 thanks for the prompt response, I really appreciate it

you would like to specify a node selector to offer only a subset of the resources available in the provider cluster

thanks for the summary 😄

I would like to contribute, it'll be great if you can give out some starting points!

giorio94 commented 1 year ago

Nice to hear that! In the following you can find some additional pointers:

Feel free to ask for any further information.

DevSusu commented 1 year ago

@giorio94 , thanks for the pointers

I've skimmed through, and have a question/suggestion

instead of caching the pods info, what about managing pod informers per node? when a node informer sends that a new node has been added, then register a pod informer with the nodeName field selector. (when deleted, vise-versa) Thus 1 node informer with label selector, and pod informer per node.

this way, we don't need to worry about the timing issue you mentioned. caching the pod info needs some guessing about how long we should wait until the node infos come in, and that delay will effect the virtual node resource update period also.

giorio94 commented 1 year ago

instead of caching the pods info, what about managing pod informers per node? when a node informer sends that a new node has been added, then register a pod informer with the nodeName field selector. (when deleted, vise-versa) Thus 1 node informer with label selector, and pod informer per node.

To me, the approach you propose makes definitely sense, and it also reduces the amount of pods observed by the informers in case most nodes are excluded by the label selector.

As for the caching one, you could avoid guessing through a refactoring of the data structure towards a more node oriented approach (i.e,, storing the resources used by each peered cluster per physical node, rather than as a whole), and then marking whether a given node shall be included or excluded. This would also allow to cover the case in which node labels are modified, changing whether it matches the selector.

I personally have no particular preference. I feel your proposal to be a bit cleaner, although it also requires some more work/refactoring to integrate it with the existing code (as the data structure changes are probably needed nonetheless to account for cleanup when a node is removed).