kubernetes-sigs / lws

LeaderWorkerSet: An API for deploying a group of pods as a unit of replication
Apache License 2.0
118 stars 22 forks source link

LeaderWorkerSet doesn't support gang-scheduling #167

Open xgchena opened 2 months ago

xgchena commented 2 months ago

What happened:

It seems that the LeaderWorkerSet doesn't support gang-scheduling of a group of pods. If more replicas are scheduled at the same time, and there are not enough capacity to host them all, then the scheduler may prioritize scheduling of leader pods, and leave their worker pods pending forever.

What you expected to happen:

LeaderWorkerSet should support gang-scheduling, i.e. the pods of a group are either scheduled all together, or nothing.

How to reproduce it (as minimally and precisely as possible):

I have tried with the vllm example with an EKS cluster which has 4 nodes, each node has 1 GPU and enough resources to meet the requests of the pods. The example manifest uses size 2 and replicas 2, in total 4 pods.

$ kubectl apply -f lws.yaml
$ kubectl get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     1/1     Running   0          10m
vllm-0-1   1/1     Running   0          10m
vllm-1     1/1     Running   0          3m43s
vllm-1-1   1/1     Running   0          3m43s
$ kubectl scale --replicas=0 lws/vllm

$ kubectl get pods
No resources found in default namespace.

$ kubectl scale --replicas=4 lws/vllm
$ k get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     0/1     Running   0          7s
vllm-0-1   1/1     Running   0          7s
vllm-1     0/1     Running   0          7s
vllm-1-1   0/1     Pending   0          7s
vllm-2     0/1     Running   0          7s
vllm-2-1   0/1     Pending   0          7s
vllm-3     0/1     Pending   0          7s
vllm-3-1   0/1     Pending   0          7s
$ kubectl scale --replicas=0 lws/vllm

$ kubectl get pods
No resources found in default namespace.

$ kubectl scale --replicas=10 lws/vllm
$ k get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     0/1     Running   0          5s
vllm-0-1   0/1     Pending   0          4s
vllm-1     0/1     Running   0          5s
vllm-1-1   0/1     Pending   0          4s
vllm-2     0/1     Running   0          5s
vllm-2-1   0/1     Pending   0          4s
vllm-3     0/1     Running   0          5s
vllm-3-1   0/1     Pending   0          4s
vllm-4     0/1     Pending   0          4s
vllm-4-1   0/1     Pending   0          4s
vllm-5     0/1     Pending   0          4s
vllm-5-1   0/1     Pending   0          4s
vllm-6     0/1     Pending   0          4s
vllm-6-1   0/1     Pending   0          4s
vllm-7     0/1     Pending   0          4s
vllm-7-1   0/1     Pending   0          3s
vllm-8     0/1     Pending   0          4s
vllm-8-1   0/1     Pending   0          3s
vllm-9     0/1     Pending   0          4s
vllm-9-1   0/1     Pending   0          4s

The expected behavior is that the first two groups should be scheduled (pods vllm-0, vllm-0-1, vllm-1, and vllm-1-1).

Anything else we need to know?:

Also tried with the co-scheduling plugin, but grouping all the Pods by the same static pod-group label is the same as no PodGroup.

Environment:

liurupeng commented 1 month ago

@xgchena sry for the late reply since I was on vacation last week. It's expected that when you have 2 replicas with size 4 and only 4 nodes, then two leaders can be scheduled, causing some workers not be scheduled. In this case, it's recommended to scale the pod group based on the number of available nodes or use cluster autoscaler to provision new nodes automatically when there are unscheduled pods. We could improve scheduling to accommodate as many pod groups as possible to avoid the case you described but it wouldn't be simple. And we need a real use case that we must add the gang-scheduling support. Since there is a workaround for this, I will add a feature label and would wait for a use case to prioritize.

kerthcet commented 1 month ago

Generally, gang scheduling needs the support of scheduler. The upstream has the co-scheduliing plugin and an ongoing proposal about gang scheduling https://github.com/kubernetes/enhancements/issues/4671, which I will try to push next release.

xgchena commented 1 month ago

Thank you both for the responses.

xgchena commented 1 month ago

Hi Rupeng, regarding your comments,

It's expected that when you have 2 replicas with size 4 and only 4 nodes, then two leaders can be scheduled, causing some workers not be scheduled. In this case, it's recommended to scale the pod group based on the number of available nodes or use cluster autoscaler to provision new nodes automatically when there are unscheduled pods. We could improve scheduling to accommodate as many pod groups as possible to avoid the case you described but it wouldn't be simple. And we need a real use case that we must add the gang-scheduling support. Since there is a workaround for this, I will add a feature label and would wait for a use case to prioritize.

Multi-host inference is often used to resolve the problem that a model is too large to be deployed to a single instance, not even using the most advanced instance types (like those with 8 GPUs). In real world, there is capacity constraint on advanced instance types.

Both are real use cases.

xgchena commented 1 month ago

Hi Kante, thank you for the sharing and glad to know there is already a solution on the way.

Regarding the co-scheduling plugin, actually I have tried it, copied from the issue description

Anything else we need to know?: Also tried with the co-scheduling plugin, but grouping all the Pods by the same static pod-group label is the same as no PodGroup.

Based on the vllm example, see the screenshot below. The problem with the approach is that only one PodGroup can be defined/used to group to all the pods.

lws-podgroup

By "next release" I guess you mean the next release of Kubernetes. Before it is available, I'm wondering if it is doable to fork the lws controller and create a PodGroup for each replica at runtime, as a short-term workaround.

kerthcet commented 1 month ago

Thanks for your feedbacks, it values a lot.

By "next release" I guess you mean the next release of Kubernetes.

Yes, there're still some gaps need to fix.

I'm wondering if it is doable to fork the lws controller and create a PodGroup for each replica at runtime, as a short-term workaround.

I guess this is one approach available because based on the co-scheduling design, the podGroup needs to be created manually.

However, we still not quite work smoothly with co-scheduling plugin, we have some features like startup policy and exclusive placement, which requires to create the worker pods once leader pod is ready, this will lead to dead lock with gang, because leader pod will not be scheduled if minMember not meet. This is a valid use case for gang scheduling design.