Support for multi-pod workloads in knative-serving

sathyanarays commented 2 months ago

Describe the feature

Some large language models are too big such that they don't fit in a single node. As a result, the serving workload should be represented as multiple pods; one of these pods act as a head node where the incoming requests are handled and the other pods act as workers. vLLM multi-node follows this pattern. There are attempts to model this workload as Leader Worker sets.

Supporting this requires the following changes in the Service API.

spec.template.workerSpec

A new section is required for defining the containers for the worker pods.

spec.template.workerSpec.replicas define the number of workers required for one instance of workload.
spec.template.workerSpec.containers is similar to spec.template.spec.containers; they define the set of containers for worker pods.

Following is an example of Service with new API

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default    
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containerConcurrency: 0
      containers:
      - env:
        - name: TARGET
          value: Go Sample v1
        image: ghcr.io/knative/helloworld-go-leader:latest
        name: user-container
        readinessProbe:
          successThreshold: 1
          tcpSocket:
            port: 0
        resources: {}
      enableServiceLinks: false
      timeoutSeconds: 300
    workerSpec:
      replicas: 2
      containers:
      - env:
        - name: TARGET
          value: Go Sample v1 Worker
        image: ghcr.io/knative/helloworld-go-worker:latest
        name: user-worker-container
        readinessProbe:
          successThreshold: 1
          tcpSocket:
            port: 0
        resources: {}
      enableServiceLinks: false
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100

Desired behaviors

Scale workload as a unit

The leader and workers scale together as a unit. Based on the service example given above, the following table gives an idea on how the scale should operate.

Workload Units	leader pods	worker pods
0	0	0
1	1	2
3	3	6

Mark workload unit ready

The workload unit should be marked ready to accept requests only when the leader and all the workers are in ready state

Fail workload unit

Fail and deallocate all the resources if any of the workload pods fail consistently.

This feature description is at very high level. If the community feels this would be a good addition to knative-serving, we can write down more details. Please feel free to provide your feedbacks.

/area API /area autoscale

skonto commented 2 months ago

Hi @sathyanarays, probably you are aware of this, but there is an effort upstream at the K8s Serving workgroup to define such a solution as you mentioned via LW sets, see meeting notes. There are more efforts happening at the moment in parallel eg. DRA, autoscaling for LLMs and I suspect they will impose new requirements in the long run or even earlier. Now for the multi-pod thing a few notes: a) Changing the spec is not a fast process, Cloud run depends on that etc. It is far easier to extend stuff via annotations (for good or bad). b) Service coordination is an open problem that other suffers from eg. a request that hits several microservices. I guess some mechanism could be re-used. c) Do we need a statefull set? In the LWS design doc they mention:

The advantage of sts is that it offers significant functionality around pod lifecycle management. This resulted in a much simpler implementation backed by a reliable and mature API that already offers a rich set of features related to index management, rollouts, storage templates etc.

Knative does not support statefull sets, it was designed for stateless apps but has implemented it is own rolling strategy that still has some gaps we are trying to fill.

If the community feels this would be a good addition to knative-serving, we can write down more details.

Imho, if we can enable an AI use case with our primitives it would be great. This influences Kserve as well. @sathyanarays Have you already done some POC, do you use Knative already?

cc @dprotaso @houshengbo @ReToCode @dsimansk @terrytangyuan

terrytangyuan commented 2 months ago

We are working on multi-host/multi-node support in KServe https://github.com/kserve/kserve/pull/3871

terrytangyuan commented 2 months ago

@sathyanarays Are you in the same team as @ArangoGutierrez @supertetelman?

knative / serving