kubernetes-sigs / lws

LeaderWorkerSet: An API for deploying a group of pods as a unit of replication
Apache License 2.0
91 stars 17 forks source link

Enhance scheduling capabilities for a group of pods #162

Open vie-serendipity opened 1 month ago

vie-serendipity commented 1 month ago

What would you like to be added: Add a field like ScheduleMode. For scheduling a set of pods we may need more strict control. When deploying a distributed inference service, a set of pods, that is, a head plus several workers should be scheduled to neighboring nodes to reduce the communication cost between them. Why is this needed: There are many types of resources in a k8s cluster, and if the scheduling to nodes is constrained only by requests field, it is possible that the final distributed inference may not be too good. An example is that a group of pods should be dispatched to nodes that have a special way of connecting to them, such as nvlink. Another example is that nodes in a cluster (a non-standard k8s cluster) may be across data centers, and a group of pods should be dispatched into the same data center. Completion requirements: I'm not sure about the eventual changes for the api, just an enhancement request. I'm also not sure if such a requirement is a reasonable enhancement request, and I'd be happy to contribute if it is. This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

googs1025 commented 1 month ago

I'm a bit curious, can such scenarios generally be resolved using node selectors or affinity? What are the shortcomings if this method is used?

vie-serendipity commented 1 month ago

@googs1025

An example is that a group of pods should be dispatched to nodes that have a special way of connecting to them, such as nvlink. Another example is that nodes in a cluster (a non-standard k8s cluster) may be across data centers, and a group of pods should be dispatched into the same data center.

In this scenario, each set of pods uses the same template, so the entire lws corresponding pods will be dispatched to the same data center. However, due to the fact that gpu is distributed in different data centers, so different groups of pods under lws should be scheduled to different data centers. So I think affinity or selector can't fulfill such requirement.

googs1025 commented 1 month ago

@googs1025

An example is that a group of pods should be dispatched to nodes that have a special way of connecting to them, such as nvlink. Another example is that nodes in a cluster (a non-standard k8s cluster) may be across data centers, and a group of pods should be dispatched into the same data center.

In this scenario, each set of pods uses the same template, so the entire lws corresponding pods will be dispatched to the same data center. However, due to the fact that gpu is distributed in different data centers, so different groups of pods under lws should be scheduled to different data centers. So I think affinity or selector can't fulfill such requirement.

Do different data centers signify the same cluster, or are they assigned to different clusters? If they are within the same cluster, it might still be feasible to utilize node selectors or affinity methods. However, if they are distributed across different clusters, there might be concerns regarding communication overhead.

vie-serendipity commented 1 month ago

If they are within the same cluster, it might still be feasible to utilize node selectors or affinity methods.

They belong to the same cluster, but how to schedule lws different groups of pods to multiple datacenters and each group of pods in the same datacenter by using affinity? I can't figure out a way to do this.

googs1025 commented 1 month ago

So, if I understand correctly, you want to schedule multiple pods under the workerTemplate to nodes in different data centers, is that correct? I cannot determine whether it is possible to achieve the desired functionality using node selectors or affinity. Additionally, I cannot assess whether it is a reasonable design to have scheduling decisions for the higher-level workload.

vie-serendipity commented 1 month ago

This higher-level workload don't need to make any scheduling decisions , it just needs to make sure the workers are scheduled in the same data center as the head. The head's scheduling is entirely decided by the scheduler.

One possible implementation could be to wait until the head is scheduled to a node, then retrieve a label from that node (this label's key is predefined on the lws yaml). Afterwards, the workers would add similar affinities, ensuring they get scheduled to similar nodes.

kerthcet commented 1 month ago

Thanks for bring this to the community, is this you need? https://github.com/kubernetes-sigs/lws/blob/main/docs/examples/sample/README.md#exclusive-placement

kerthcet commented 1 month ago

But this is exclusive, which means two groups can not be located at the same topology.

vie-serendipity commented 1 month ago

Thanks, that's what I'm looking for.

But this is exclusive, which means two groups can not be located at the same topology.

LeaderWorkerSet supports exclusive placement through pod affinity/anti-affinity where pods in the same group will be scheduled on the same accelerator island (such as a TPU slice or a GPU clique), but on different nodes. This ensures 1:1 LWS replica to accelerator island placement.

But I have a question, in the example, there are GPU and TPU accelerator islands. Does that mean I can only have two groups of pods? I feel like it should allow for multiple groups of pods within the same topology. Is it more reasonable to just ensure that the head and workers have the same topology key?

kerthcet commented 1 month ago

in the example, there are GPU and TPU accelerator islands. Does that mean I can only have two groups of pods?

Accelerator is only used for integrations with cloud providers, like TPU with google cloud.

So what you need is slightly different with exclusive placement. Do you have a real use case from your side?

vie-serendipity commented 1 month ago

My usage scenario is that there are many nodes in a k8s scenario, some of them are user's and some of them are cloud vendor's, to make it simple there are two datacenters, one for the user and one for the cloud provider. (This is not a standard k8s cluster, but I wonder if there is a need to schedule a group of pods to nodes of the same gpu type if there are many gpu resource types in a cluster)

I want to use lws to deploy inference services to the user's datacenter and the datacenter on the cloud. Although the network of datacenter on the cloud and the user's datacenter is interoperable, they are more expensive to communicate with and may require a public network, which is unstable.

So I want to make sure that a group of pods are dispatched to a single data center, so they can communicate easily. And multiple sets of pods are supported in one data center, and business peaks also require scaling.

kerthcet commented 1 month ago

Make sense to me as a group of Pods should be located at the same topology, and the group number should not be limited.

cc @ahg-g @liurupeng thoughts?

ahg-g commented 1 month ago

yes, we can support that by simply removing the exclusive anti-affinity term that is currently getting added. But we need to come up with a proper API first, similar to https://github.com/kubernetes-sigs/jobset/issues/75

vie-serendipity commented 1 month ago

@ahg-g I would like to contribute to this feature. I can propose a KEP later. Is this good for you?

liurupeng commented 2 weeks ago

@vie-serendipity could you start the KEP so that we could start the review?

vie-serendipity commented 2 weeks ago

@liurupeng Okay, I will propose a KEP recently.