[Feature] Remote cluster node fragmentation casing local cluster to stop scheduling smaller jobs

Sharathmk99 commented 2 months ago

Is your feature request related to a problem? Please describe. We have a huge remote cluster of resource 7500 CPUs, 70TB of RAM and 500 GPUs. Using Liqo we have paired remote cluster to local cluster which create one virtual node of above resources.

We want to schedule one pod of resource request 20CPU, 200GB RAM, assume that remote cluster is full and this pod can't be scheduled in any node, but virtual node accepted the pod and created pod in remote cluster and it's status is Pending.

Similar pods were created multiple times like 50 pods and all 50 pods are in Pending state in remote cluster, which made liqo virtual node 100% allocated.

Now a new small pod wants to scheduled with resource 2CPU, 20GB RAM, as liqo virtual node is full it will not accept the pod, but remote cluster has a space to run 2CPU, 20GB RAM pod. This is happening because bigger job make virtual node full, but remote cluster has space to run smaller jobs.

How can we resolve this type of issue? Because we have almost 40% GPU free, but because of liqo virtual node we are not able to allocate smaller jobs with GPU to remote cluster.

One of the option is to accept the pod in virtual kublelet if it's only possible to schedule the job in remote cluster using tools like https://github.com/kubernetes-sigs/cluster-capacity

Describe the solution you'd like Accept the pod in virtual kubelet if it's possible to schedule in remote cluster.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

aleoli commented 2 months ago

Hi @Sharathmk99!

I understand your issue, but I see some problems implementing it:

the consumer cluster does not have permission to list nodes and pods in the provider cluster (required to understand if the pod can be scheduled)
the k8s scheduler decides where to place a pod. The only option we have is to not add the toleration for the liqo virtual node taint on not schedulable pods, but in that case, these pods will be pending forever, even when the virtual node will have resources to run them

A possible solution could be a descheduler that evicts pods that are scheduled on a virtual node but pending for more than n minutes. When the big pod is evicted, the smaller ones should be scheduled and executed on the virtual node. But I'm still seeing issues here: if the small pod scheduling is in exponential backoff, we cannot guarantee it will be scheduled before the "new" big pod.

We need to think about it to have a good solution.

aleoli commented 2 months ago

Additionally, a potential workaround is using the new multiple virtual nodes feature, and using node affinities to schedule big pods on one of them, and small ones on the other

Sharathmk99 commented 2 months ago

@aleoli Thank you for taking a look at it.

descheduler will help for sure, but when we use with orchestration frameworks like Argo Workflow, killing the pod and scheduling again will fail the Workflow.

Multiple virtual node will help to certain extend, but we can't schedule pod based on pod size as we have leave lot of resource unused.

One option is to custom implement resource plugin to expend the capacity to accommodate new pod, but needs lot of logic to be handled.

liqotech / liqo

[Feature] Remote cluster node fragmentation casing local cluster to stop scheduling smaller jobs #2712