kubernetes-sigs / lws

LeaderWorkerSet: An API for deploying a group of pods as a unit of replication
Apache License 2.0
141 stars 27 forks source link

support leader being in its own subgroup #257

Open avrittrohwer opened 1 week ago

avrittrohwer commented 1 week ago

What would you like to be added:

Add the ability for the leader Pod to be in its own affinity group when using the subgroup feature. For example, when deploying a leader Pod that should be scheduled on a CPU-only VM and worker Pods that should be scheduled on multiple TPU slices:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: my-lws
  annotations:
    leaderworkerset.sigs.k8s.io/subgroup-exclusive-topology: cloud.google.com/gke-nodepool
spec:
  replicas: 1
  leaderWorkerTemplate:
    subGroupPolicy:
      subGroupSize: 2
    size: 5
    leaderTemplate:
      spec:
        nodeSelector:
          cloud.google.com/machine-familty: n2
          node.kubernetes.io/instance-type: n2-standard-8
        containers:
        - name: leader
        ...
    workerTemplate:
      spec:
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5p-slice
          cloud.google.com/gke-tpu-topology: 2x2x2
        containers:
        - name: worker
          ...
          resources:
            limits:
              google.com/tpu: "4"

Currently the leader Pod is put in subgroup 0 which causes it to have the same affinity key as the workers in subgroup 0: https://github.com/kubernetes-sigs/lws/blob/main/pkg/webhooks/pod_webhook.go#L132. This causes the leader Pod in my example to be unscheduable because of the CPU instance type node selectors.

Why is this needed:

To support deploying leader-worker architectures where the leader should be scheduled in separate topologies from the worker groups.

Completion requirements:

An option in subGroupPolicy that causes the leader to have its own affinity key.

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

avrittrohwer commented 1 week ago

@ahg-g

ahg-g commented 1 week ago

@Edwinhr716