[Feature] offloading to nodes with specific labels

DevSusu commented 1 year ago

Is your feature request related to a problem? Please describe. I've walked through the examples and documents, and would like to use Liqo for our team's multicluster management.

Our use-case & situation can be summarized as below

we're behind a VPN
we run a on-premise k8s cluster, and a EKS cluster
offload on-premise cluster's namespaces to EKS cluster
- namespace eks-gpu: offloads to EKS with nvidia.com/gpu=true label nodes
- namespace eks-spot: offloads to EKS with capacity=spot label nodes

Describe the solution you'd like Currently, it seems that created virtual node summarizes all nodes in the remote cluster. I suggest using nodeSelector labels when offloading. So that virtual nodes reflects nodes only with the matching selector. Also injecting a nodeSelector term on offloaded pods will be useful (but OK to be done outside of Liqo).

Describe alternatives you've considered I can write a mutatingwebhook in the EKS cluster to inject the nodeSelector term, but then the virtual node's contains too much unnecessary(or even confusing) information. Enough resource(CPU, memory) in virtual node, but pods not scheduling.

Additional context This feature can also help in multi-tenant scenarios, where you might not want to dedicate a cluster in every offloaded namespace.

run multiple namespace and node groups in EKS cluster
tenant1 gets tenant1 namespace(on-premise) and tenant1-eks namespace(offloading).
tenant2 gets tenant2 namespace(on-premise) and tenant2-eks namespace(offloading).
tenant1 cannot schedule pods nor get information about tenant2 nodes.

In https://github.com/liqotech/liqo/issues/1249 , @giorio94 suggested creating local shadow node for each remote node. nodeSelector feature will also help this scenario too.

DevSusu commented 1 year ago

can this be done here

https://github.com/liqotech/liqo/blob/master/cmd/virtual-kubelet/root/root.go#L145

    nodeRunner, err := node.NewNodeController(
        nodeProvider, nodeProvider.GetNode(),
        localClient.CoreV1().Nodes(), // add nodeselector label here
        node.WithNodeEnableLeaseV1(localClient.CoordinationV1().Leases(corev1.NamespaceNodeLease), int32(c.NodeLeaseDuration.Seconds())),
        ...

giorio94 commented 1 year ago

Hi @DevSusu,

If I understand it correctly, you would like to specify a node selector to offer only a subset of the resources available in the provider cluster (i.e., those associated with the nodes matching the selector). This feature makes sense to me (and also relates to excluding the tainted control plane nodes from the computation); it would require some modifications in the computation logic and in the shadow pod controller, to inject the given node selectors for offloaded pods. I cannot give you any timeline for this right now, but I'll add it to our roadmap for the future. If you would like to contribute, I can also give you more information about where to extend the logic to introduce it.

As for the piece of code you mentioned, that is the controller which deals with the creation of the virtual node. Still, the amount of resources associated with that node are taken from the ResourceOffer, which is created by the provider cluster through the above mentioned computation logic, and then propagated to the consumer cluster. Hence, you cannot use that to tune the amount of resources.

DevSusu commented 1 year ago

@giorio94 thanks for the prompt response, I really appreciate it

you would like to specify a node selector to offer only a subset of the resources available in the provider cluster

thanks for the summary 😄

I would like to contribute, it'll be great if you can give out some starting points!

giorio94 commented 1 year ago

Nice to hear that! In the following you can find some additional pointers:

the local-resource-monitor is the component which includes the logic to compute the amount of resources to be offered to remote clusters. It includes two informers, one for the nodes and the other for the pods, and continuously keeps track of the amount of free resources. The easy part here is to filter out the nodes that do not match a given label selector. More tricky, instead, is to filter out pods hosted by excluded nodes, since you might get a notification for a pod before you get that for the hosting node, and you need to somehow cache that info until you know whether the node matches or not the selector.
the shadow-pod-controller is the controller that creates remote pods starting from shadowpod resources (an abstraction we use for increased reliability). It should be fairly simple to enforce the node selector there.
here you can find the logic to create a node selector from the parameters of a command, which might be useful also in this case.

Feel free to ask for any further information.

DevSusu commented 1 year ago

@giorio94 , thanks for the pointers

I've skimmed through, and have a question/suggestion

instead of caching the pods info, what about managing pod informers per node? when a node informer sends that a new node has been added, then register a pod informer with the nodeName field selector. (when deleted, vise-versa) Thus 1 node informer with label selector, and pod informer per node.

this way, we don't need to worry about the timing issue you mentioned. caching the pod info needs some guessing about how long we should wait until the node infos come in, and that delay will effect the virtual node resource update period also.

giorio94 commented 1 year ago

instead of caching the pods info, what about managing pod informers per node? when a node informer sends that a new node has been added, then register a pod informer with the nodeName field selector. (when deleted, vise-versa) Thus 1 node informer with label selector, and pod informer per node.

To me, the approach you propose makes definitely sense, and it also reduces the amount of pods observed by the informers in case most nodes are excluded by the label selector.

As for the caching one, you could avoid guessing through a refactoring of the data structure towards a more node oriented approach (i.e,, storing the resources used by each peered cluster per physical node, rather than as a whole), and then marking whether a given node shall be included or excluded. This would also allow to cover the case in which node labels are modified, changing whether it matches the selector.

I personally have no particular preference. I feel your proposal to be a bit cleaner, although it also requires some more work/refactoring to integrate it with the existing code (as the data structure changes are probably needed nonetheless to account for cleanup when a node is removed).

liqotech / liqo

[Feature] offloading to nodes with specific labels #1494