kubernetes-sigs / jobset

JobSet: a k8s native API for distributed ML training and HPC workloads
https://jobset.sigs.k8s.io/
Apache License 2.0
140 stars 46 forks source link

Support for JobSet Preemption #682

Open ahg-g opened 2 weeks ago

ahg-g commented 2 weeks ago

What would you like to be added:

Preemption at the whole JobSet level.

The user would like to run a training workload using JobSet on one or more accelerator island (e.g., TPU slices). To do this, the user creates a JobSet with a replicatedJob of one or more replicas, and uses exclusive placement to ensure that each child Job lands on one accelerator island.

Consider the case where the user would like to run multiple training workloads as the ones described above with different priorities, and would like to ensure that the high priority training job preempts the low priority ones when there is not enough capacity to run both.

Currently this doesn't work because of the anti-affinity rules that implement exclusive placement: 1) Low priority workload is running 2) High priority workload comes in, the leader pod is created first. 3) The leader pod of the high priority workload can't schedule because it has anti-affinity against all the pods on the island, and cross-node preemption is not supported: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#cross-node-preemption

Exclusivity is currently implemented against any pod created by Job that doesn't belong to the same JobSet, specifically it has the following anti-affinity constraint:

      podAntiAffinity: # ensures only this job lands on the rack
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: job-name
                operator: NotIn
                values:
                - my-job
              - key: job-name
                operator: Exists
            namespaceSelector: {}
            topologyKey: rack

The solution to the above problem is to limit exclusivity to the same priority level, and let pod preemption address race conditions if two jobs from different priority levels race to the same slice.

If exclusivity is limited to the same priority level, then in the above example, the leader pod of the higher priority workload will be able to preempt any of the pods of the lower priority workload, and once it does, the worker pods of the higher priority workload will be created and assigned to the same slice and will preempt the rest of the lower priority workers (the low priority worker may also not exist anymore if the low priority workload is restarting because of the initial preemption caused by the leader pod of the high priority workload).

To do this, we need to allow injecting another selector for the priority to the anti-affinity term that JobSet injects automatically.

      podAntiAffinity: # ensures only this job lands on the rack
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: job-name
                operator: NotIn
                values:
                - my-job
              - key: job-name
                operator: Exists
            mismatchLabelKeys
            - priority
            namespaceSelector: {}
            topologyKey: rack

There are two approaches to do this: Option 1: Update the exclusivity API to allow tweaking the anti-affinity rules as discussed in https://github.com/kubernetes-sigs/jobset/issues/75; and then the user explicitly sets the priority label on the jobs and tweaks the anti-affinity rule as discussed above.

Option 2: JobSet does all of that automatically and we make it part of the API, basically JobSet

I prefer option 2.

Why is this needed:

Better utilization of infra and faster restart of training workloads: low priority workloads can use spare capacity whenever the high priority ones don't need them, but the capacity is quickly returned to the high priority workloads when they do need them.

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

danielvegamyhre commented 2 weeks ago

This seems like a useful feature, quick question: historically for customers doing TPU multislice training with JobSet we've recommended using Kueue to handle workload priorities and preemption (link). Is the idea here to support this natively in JobSet for customers/users who do not want to use Kueue for whatever reason (additional complexity, etc)?

Seems like we'll also need to think about the interaction between JobSet support for default scheduler preemption, and Kueue priority classes and workload preemption. I'm not familiar with how Kueue implements its workload preemption under the hood, would these changes interfere with Kueue's current preemption implementation?

Also, I prefer option 2 as well, since the implementation will be much more straightforward and does not rely on the new placement policy API, the scope of which has been a point of contention within WG Batch (as far as I know we haven't yet achieved alignment with the Kueue folks on this).

ahg-g commented 6 days ago

Kueue doesn't monitor the status of already dispatched workloads. So if a slice of a multi-slice high priority job fails, there is no mechanism to preempt a lower priority job.

What we are proposing here is traditional kube-scheduler preemption, so the semantics are compatible with Kueue.