kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.28k stars 225 forks source link

Support Kserve #1603

Open tenzen-y opened 6 months ago

tenzen-y commented 6 months ago

What would you like to be added: I would like to support the serverless ML Inference tool, Kserve.

Why is this needed: In the hybrid workload (which means training jobs and inference servers and so on) cluster, users often want to manage all cluster capacities by the kueue's flavorQuotas. So, as the first step to support the inference server, supporting Kserve in kueue is nice to have.

Completion requirements:

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments. We will probably implement suspend semantics on the Kserve side. Additionally, we need to move #77 forward together to support the inference server's autoscaling semantics.

alculquicondor commented 6 months ago

I think I talked about this with @astefanutti Also cc @mwielgus @ahg-g

ahg-g commented 6 months ago

How do you envision that working? Can you list a couple of CUJs?

kerthcet commented 4 months ago

Online inference service is somehow latency sensitive, scalability is highly required, reclaim/preempt the kserve managed services looks not right. I guess Kserve is not that good at offline inference, which in my mind maybe helpful. cc @terrytangyuan

lizzzcai commented 4 months ago

I would like to see possible support for this as I am looking for a unified way of managing resources for both model training and serving and Kueue looks like it has this capability. In our case, both training and serving are running in the same cluster. And how it can integrate with the recent MultiKueue feature to schedule workload to clusters with available GPU (sometimes there is a shortage of GPU in certain regions). As KServe deployment has min and max replicas, it should be scheduled to cluster that can meet the max replicas.

tenzen-y commented 4 months ago

How do you envision that working? Can you list a couple of CUJs?

I imagined that the similar approach as RayCluster.

So, I would like to add Suspend field to InferenceService resource.

tenzen-y commented 4 months ago

Online inference service is somehow latency sensitive, scalability is highly required, reclaim/preempt the kserve managed services looks not right. I guess Kserve is not that good at offline inference, which in my mind maybe helpful. cc @terrytangyuan

@kerthcet I believe that lending limit would allow us to guarantee capacities for latency sensitive Workloads.

tenzen-y commented 4 months ago

I would like to see possible support for this as I am looking for a unified way of managing resources for both model training and serving and Kueue looks like it has this capability. In our case, both training and serving are running in the same cluster. And how it can integrate with the recent MultiKueue feature to schedule workload to clusters with available GPU (sometimes there is a shortage of GPU in certain regions). As KServe deployment has min and max replicas, it should be scheduled to cluster that can meet the max replicas.

Yes, that's right. Actually, I also deploy Job and Inference Server into a single cluster.

tenzen-y commented 4 months ago

Let me try to design this integrations.

/assign

terrytangyuan commented 4 months ago

Thanks! Great to see this. Looking forward to your proposal. @tenzen-y

tenzen-y commented 4 months ago

Thanks! Great to see this. Looking forward to your proposal. @tenzen-y

I will create a dedicated issue later in Kserve side as well.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tenzen-y commented 1 month ago

/remove-lifecycle stale