volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.07k stars 940 forks source link

Why isn't there a job level preemption when gang and priority is enabled #3595

Open ad2001 opened 2 months ago

ad2001 commented 2 months ago

Please provide an in-depth description of the question you have: When resource is constraint, I would like high priority jobs to preempt all pods from 1 or more low priority job(s) when gang, priority and preempt (see config below).

For example, I have 5 CPUs, and 1 running low-priority MPIJob that requires 1 launcher (1 CPU) and 2 workers (each needs 1 CPU) with minAvailable set to 3. At this point, only 2 CPU is left. When I submit the same MPIJob with high-priority, I would expect that all 3 pods created for the low-priority job to be evicted. However, volcano seems to only evict enough pod to fulfill the minAvailable of the high-priority job. That means that 1 of the pods from the low-priority job is evicted while the other pods keep running. This behavior is happens when enablePreemptable is set to false for the gang plugin.

If enablePreemptable is set to true for gang, then none of the pods from the low-priority job is evicted because of the checks of minAvailable in the preemptableFn in gang plugin here.

Is this the expected behavior?

apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, preempt"
    tiers:
    - plugins:
      - name: priority
        enablePreemptable: true
      - name: gang
        enablePreemptable: false
    - plugins:
      - name: predicates

with job like

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  generateName: test-high-priority-job-
spec:
  runPolicy:
    cleanPodPolicy: Running
    priorityClassName: high-priority # or low-priority
    schedulingPolicy:
      minAvailable: 2
      scheduleTimeoutSeconds: 18000
    ttlSecondsAfterFinished: 500
  slotsPerWorker: 1
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - name: sleep-launcher
            image: ubuntu:latest
            args:
            - -c
            - sleep infinity
            command:
            - /bin/sh
            - -c
            resources:
              limits:
                cpu: 1
          restartPolicy: Never
    Worker:
      replicas: 1
      template:
        spec:
          containers:
          - name: sleep-worker
            image: ubuntu:latest
            args:
            - -c
            - sleep infinity
            command:
            - /bin/sh
            - -c
            resources:
              limits:
                cpu: 1
          restartPolicy: Never

What do you think about this question?: I would like to be able to preempt the entire low-priority job when high priority job is in the queue in resource constraint cluster. Without such preemption behavior, jobs that need more resources can be starved by smaller, lower priority jobs that only need little resources.

Environment:

lowang-bh commented 2 months ago

First of all, gang means "all or nothing".

To achieve preemption when enablePreemptable is true, the low priority job's minAvailable should be less than its total task number. This will ensure that there are tasks available to be preempted. The remaining tasks beyond minAvailable can be considered as "elastic" tasks, which can be preempted when needed.

The "elastic" tasks can be preempted by a single job at once or preempted by multiple jobs as required. This flexibility allows for efficient resource management and allocation based on demand.

Preempt the entire low-priority job is not supported now. But it has nothing to do with your situation. If you want low-priority job's all pods to be preempted when enablePreemptable is true, set its minAvailable to zero is ok.