Open sathyanarays opened 2 months ago
Hi @sathyanarays, probably you are aware of this, but there is an effort upstream at the K8s Serving workgroup to define such a solution as you mentioned via LW sets, see meeting notes. There are more efforts happening at the moment in parallel eg. DRA, autoscaling for LLMs and I suspect they will impose new requirements in the long run or even earlier. Now for the multi-pod thing a few notes: a) Changing the spec is not a fast process, Cloud run depends on that etc. It is far easier to extend stuff via annotations (for good or bad). b) Service coordination is an open problem that other suffers from eg. a request that hits several microservices. I guess some mechanism could be re-used. c) Do we need a statefull set? In the LWS design doc they mention:
The advantage of sts is that it offers significant functionality around pod lifecycle management. This resulted in a much simpler implementation backed by a reliable and mature API that already offers a rich set of features related to index management, rollouts, storage templates etc.
Knative does not support statefull sets, it was designed for stateless apps but has implemented it is own rolling strategy that still has some gaps we are trying to fill.
If the community feels this would be a good addition to knative-serving, we can write down more details.
Imho, if we can enable an AI use case with our primitives it would be great. This influences Kserve as well. @sathyanarays Have you already done some POC, do you use Knative already?
cc @dprotaso @houshengbo @ReToCode @dsimansk @terrytangyuan
We are working on multi-host/multi-node support in KServe https://github.com/kserve/kserve/pull/3871
@sathyanarays Are you in the same team as @ArangoGutierrez @supertetelman?
Describe the feature
Some large language models are too big such that they don't fit in a single node. As a result, the serving workload should be represented as multiple pods; one of these pods act as a head node where the incoming requests are handled and the other pods act as workers. vLLM multi-node follows this pattern. There are attempts to model this workload as Leader Worker sets.
Supporting this requires the following changes in the
Service
API.spec.template.workerSpec
A new section is required for defining the containers for the worker pods.
spec.template.workerSpec.replicas
define the number of workers required for one instance of workload.spec.template.workerSpec.containers
is similar tospec.template.spec.containers
; they define the set of containers for worker pods.Following is an example of
Service
with new APIDesired behaviors
Scale workload as a unit
The leader and workers scale together as a unit. Based on the service example given above, the following table gives an idea on how the scale should operate.
Mark workload unit ready
The workload unit should be marked ready to accept requests only when the leader and all the workers are in ready state
Fail workload unit
Fail and deallocate all the resources if any of the workload pods fail consistently.
This feature description is at very high level. If the community feels this would be a good addition to knative-serving, we can write down more details. Please feel free to provide your feedbacks.
/area API /area autoscale