LeaderWorkerSet doesn't support gang-scheduling

xgchena commented 2 months ago

What happened:

It seems that the LeaderWorkerSet doesn't support gang-scheduling of a group of pods. If more replicas are scheduled at the same time, and there are not enough capacity to host them all, then the scheduler may prioritize scheduling of leader pods, and leave their worker pods pending forever.

What you expected to happen:

LeaderWorkerSet should support gang-scheduling, i.e. the pods of a group are either scheduled all together, or nothing.

How to reproduce it (as minimally and precisely as possible):

I have tried with the vllm example with an EKS cluster which has 4 nodes, each node has 1 GPU and enough resources to meet the requests of the pods. The example manifest uses size 2 and replicas 2, in total 4 pods.

It works fine for the initial deployment, as each node can host one pod.

$ kubectl apply -f lws.yaml
$ kubectl get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     1/1     Running   0          10m
vllm-0-1   1/1     Running   0          10m
vllm-1     1/1     Running   0          3m43s
vllm-1-1   1/1     Running   0          3m43s

But scheduling problem shows up if lws is scaled in and then scaled out to more replicas than the available nodes can host. See below, the first group was good, but the rest of the nodes were used to schedule two leader pods who workers were pending forever due to no capacity. Eventually only one group was working.

$ kubectl scale --replicas=0 lws/vllm

$ kubectl get pods
No resources found in default namespace.

$ kubectl scale --replicas=4 lws/vllm
$ k get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     0/1     Running   0          7s
vllm-0-1   1/1     Running   0          7s
vllm-1     0/1     Running   0          7s
vllm-1-1   0/1     Pending   0          7s
vllm-2     0/1     Running   0          7s
vllm-2-1   0/1     Pending   0          7s
vllm-3     0/1     Pending   0          7s
vllm-3-1   0/1     Pending   0          7s

The problem is more obvious in the following example when the scheduler prioritized scheduling the first 4 leader pods. Eventually no group was working.

$ kubectl scale --replicas=0 lws/vllm

$ kubectl get pods
No resources found in default namespace.

$ kubectl scale --replicas=10 lws/vllm
$ k get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     0/1     Running   0          5s
vllm-0-1   0/1     Pending   0          4s
vllm-1     0/1     Running   0          5s
vllm-1-1   0/1     Pending   0          4s
vllm-2     0/1     Running   0          5s
vllm-2-1   0/1     Pending   0          4s
vllm-3     0/1     Running   0          5s
vllm-3-1   0/1     Pending   0          4s
vllm-4     0/1     Pending   0          4s
vllm-4-1   0/1     Pending   0          4s
vllm-5     0/1     Pending   0          4s
vllm-5-1   0/1     Pending   0          4s
vllm-6     0/1     Pending   0          4s
vllm-6-1   0/1     Pending   0          4s
vllm-7     0/1     Pending   0          4s
vllm-7-1   0/1     Pending   0          3s
vllm-8     0/1     Pending   0          4s
vllm-8-1   0/1     Pending   0          3s
vllm-9     0/1     Pending   0          4s
vllm-9-1   0/1     Pending   0          4s

The expected behavior is that the first two groups should be scheduled (pods vllm-0, vllm-0-1, vllm-1, and vllm-1-1).

Anything else we need to know?:

Also tried with the co-scheduling plugin, but grouping all the Pods by the same static pod-group label is the same as no PodGroup.

Environment:

Kubernetes version (use kubectl version): v1.29.3
LWS version (use git describe --tags --dirty --always): v0.3.0-8-ga4c468e
Cloud provider or hardware configuration: AWS EKS (server version v1.29.4-eks-036c24b), node instance type g4dn.2xlarge
OS (e.g: cat /etc/os-release): Amazon Linux 2
Kernel (e.g. uname -a): 5.10.218
Install tools: N/A
Others: N/A

liurupeng commented 1 month ago

@xgchena sry for the late reply since I was on vacation last week. It's expected that when you have 2 replicas with size 4 and only 4 nodes, then two leaders can be scheduled, causing some workers not be scheduled. In this case, it's recommended to scale the pod group based on the number of available nodes or use cluster autoscaler to provision new nodes automatically when there are unscheduled pods. We could improve scheduling to accommodate as many pod groups as possible to avoid the case you described but it wouldn't be simple. And we need a real use case that we must add the gang-scheduling support. Since there is a workaround for this, I will add a feature label and would wait for a use case to prioritize.

kerthcet commented 1 month ago

Generally, gang scheduling needs the support of scheduler. The upstream has the co-scheduliing plugin and an ongoing proposal about gang scheduling https://github.com/kubernetes/enhancements/issues/4671, which I will try to push next release.

xgchena commented 1 month ago

Thank you both for the responses.

xgchena commented 1 month ago

Hi Rupeng, regarding your comments,

It's expected that when you have 2 replicas with size 4 and only 4 nodes, then two leaders can be scheduled, causing some workers not be scheduled. In this case, it's recommended to scale the pod group based on the number of available nodes or use cluster autoscaler to provision new nodes automatically when there are unscheduled pods. We could improve scheduling to accommodate as many pod groups as possible to avoid the case you described but it wouldn't be simple. And we need a real use case that we must add the gang-scheduling support. Since there is a workaround for this, I will add a feature label and would wait for a use case to prioritize.

Multi-host inference is often used to resolve the problem that a model is too large to be deployed to a single instance, not even using the most advanced instance types (like those with 8 GPUs). In real world, there is capacity constraint on advanced instance types.

Example user scenario 1: As an user, I have created a cluster using an advanced instance type. To save cost, I chose the on-demand pool (shared by all the users). I have deployed a large model using LWS, and set up Horizontal Autoscaler and Cluster Autoscaler. Initially the cluster has 2 nodes (which host one replica of the model). Then due to traffic increase, the Horizontal Autoscaler set the replicas to 3, in turn, the Cluster Autoscaler requests for 4 more nodes. However, the on-demand pool only has 2 nodes available. If the controller schedules 2 new leader pods to the 2 new nodes, then the 2 worker pods will be pending until other users release 2 nodes to the on-demand pool, which may happen late and impact availability.
Example user scenario 2: As an user, I have reserved hosts to ensure high level of assurance in obtaining capacity, and the reservation pool is shared by multiple clusters, i.e. I have created my own "on-demand pool" for my clusters, and the same issue can happen when the pool is fully utilized.

Both are real use cases.

xgchena commented 1 month ago

Hi Kante, thank you for the sharing and glad to know there is already a solution on the way.

Regarding the co-scheduling plugin, actually I have tried it, copied from the issue description

Anything else we need to know?: Also tried with the co-scheduling plugin, but grouping all the Pods by the same static pod-group label is the same as no PodGroup.

Based on the vllm example, see the screenshot below. The problem with the approach is that only one PodGroup can be defined/used to group to all the pods.

By "next release" I guess you mean the next release of Kubernetes. Before it is available, I'm wondering if it is doable to fork the lws controller and create a PodGroup for each replica at runtime, as a short-term workaround.

kerthcet commented 1 month ago

Thanks for your feedbacks, it values a lot.

By "next release" I guess you mean the next release of Kubernetes.

Yes, there're still some gaps need to fix.

I'm wondering if it is doable to fork the lws controller and create a PodGroup for each replica at runtime, as a short-term workaround.

I guess this is one approach available because based on the co-scheduling design, the podGroup needs to be created manually.

However, we still not quite work smoothly with co-scheduling plugin, we have some features like startup policy and exclusive placement, which requires to create the worker pods once leader pod is ready, this will lead to dead lock with gang, because leader pod will not be scheduled if minMember not meet. This is a valid use case for gang scheduling design.

kubernetes-sigs / lws

LeaderWorkerSet doesn't support gang-scheduling #167