Open tenzen-y opened 6 months ago
I think I talked about this with @astefanutti Also cc @mwielgus @ahg-g
How do you envision that working? Can you list a couple of CUJs?
Online inference service is somehow latency sensitive, scalability is highly required, reclaim/preempt the kserve managed services looks not right. I guess Kserve is not that good at offline inference, which in my mind maybe helpful. cc @terrytangyuan
I would like to see possible support for this as I am looking for a unified way of managing resources for both model training and serving and Kueue looks like it has this capability. In our case, both training and serving are running in the same cluster. And how it can integrate with the recent MultiKueue
feature to schedule workload to clusters with available GPU (sometimes there is a shortage of GPU in certain regions). As KServe deployment has min and max replicas, it should be scheduled to cluster that can meet the max replicas.
How do you envision that working? Can you list a couple of CUJs?
I imagined that the similar approach as RayCluster.
So, I would like to add Suspend field to InferenceService resource.
Online inference service is somehow latency sensitive, scalability is highly required, reclaim/preempt the kserve managed services looks not right. I guess Kserve is not that good at offline inference, which in my mind maybe helpful. cc @terrytangyuan
@kerthcet I believe that lending limit would allow us to guarantee capacities for latency sensitive Workloads.
I would like to see possible support for this as I am looking for a unified way of managing resources for both model training and serving and Kueue looks like it has this capability. In our case, both training and serving are running in the same cluster. And how it can integrate with the recent MultiKueue feature to schedule workload to clusters with available GPU (sometimes there is a shortage of GPU in certain regions). As KServe deployment has min and max replicas, it should be scheduled to cluster that can meet the max replicas.
Yes, that's right. Actually, I also deploy Job and Inference Server into a single cluster.
Let me try to design this integrations.
/assign
Thanks! Great to see this. Looking forward to your proposal. @tenzen-y
Thanks! Great to see this. Looking forward to your proposal. @tenzen-y
I will create a dedicated issue later in Kserve side as well.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
What would you like to be added: I would like to support the serverless ML Inference tool, Kserve.
Why is this needed: In the hybrid workload (which means training jobs and inference servers and so on) cluster, users often want to manage all cluster capacities by the kueue's flavorQuotas. So, as the first step to support the inference server, supporting Kserve in kueue is nice to have.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments. We will probably implement
suspend
semantics on the Kserve side. Additionally, we need to move #77 forward together to support the inference server's autoscaling semantics.