Gang scheduling job with high-priority not preempting lower priority jobs

talcoh2x commented 2 years ago

we run gang scheduling job with high-priority but we dont see that the default priory jobs releasing once we don't enough resources.

expected: we expect that in such cases lower priority jobs are getting deleted.

Volcano version 1.6.0 K8s version 1.22/1.21

william-wang commented 2 years ago

Thanks for your reporting, It seems the same issue with #2034, we are dealing with it.

william-wang commented 2 years ago

@talcoh2x Would you like to provide the job yaml?

snirkop89 commented 2 years ago

# High priority job
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1
  kind: MPIJob
  metadata:
    creationTimestamp: "2022-07-07T10:33:13Z"
    generation: 1
    name: high-priority-mpijob
    namespace: app
    resourceVersion: "5940716"
    uid: 67724b4f-84d0-473d-bc62-317bf686fa90
  spec:
    mpiReplicaSpecs:
      Launcher:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: high-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
            - args:
              - sleep 1d
              # command shortened
              command:
              - mpirun
              - ...
              image: goodimage
              name: high-priority-mpijob
              resources:
                limits:
                  cpu: "2"
                  memory: 2Gi
                requests:
                  cpu: "2"
                  memory: 2Gi
              volumeMounts:
              - mountPath: /software
                name: software
            initContainers:
            - args:
              - mkdir -p /root/logs/launcher && ./dnswaiter
              command:
              - /bin/bash
              - -c
              image: goodimage
              imagePullPolicy: Always
              name: wait-dns
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: 100m
                  memory: 500Mi
              volumeMounts:
              - mountPath: /etc/mpi
                name: mpi-job-config
              - mountPath: /root/logs
                name: logs
              workingDir: /root
            restartPolicy: Never
      Worker:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: high-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
              image: goodimage
              name: high-priority-mpijob
              resources:
                limits:
                  cpu: "50"
                  habana.ai/gaudi: "4" # gpu
                  hugepages-2Mi: 100000Mi
              securityContext:
                privileged: true
              volumeMounts:
                # some mounts
              workingDir: /root
            initContainers:
            - args:
              - mkdir -p /root/logs/$HOSTNAME
              command:
              - /bin/bash
              - -c
              env:
              - name: DRIVER_WITH_NETWORK
                value: "false"
              image: goodimage
              imagePullPolicy: IfNotPresent
              name: prepare-node
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: "5"
                  memory: 5Gi
              securityContext:
                privileged: false
              volumeMounts:
              # mounts
              workingDir: /root
            priorityClassName: high
            schedulerName: volcano
            volumes:
            # some volumes
    runPolicy:
      backoffLimit: 0
      cleanPodPolicy: All
      ttlSecondsAfterFinished: 300
    slotsPerWorker: 8
  status:
    conditions:
    - lastTransitionTime: "2022-07-07T10:33:13Z"
      lastUpdateTime: "2022-07-07T10:33:13Z"
      message: MPIJob xxxx is created.
      reason: MPIJobCreated
      status: "True"
      type: Created
    replicaStatuses:
      Launcher: {}
      Worker: {}
    startTime: "2022-07-07T10:33:13Z"

# No priority jobs
- apiVersion: kubeflow.org/v1
  kind: MPIJob
  metadata:
    creationTimestamp: "2022-07-07T10:32:17Z"
    generation: 1
    name: no-priority-mpijob
    namespace: app
    resourceVersion: "5940444"
    uid: 9359f6ef-1bde-427d-b4bf-74a86fe3467a
  spec:
    mpiReplicaSpecs:
      Launcher:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: no-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
            - args:
              - sleep 1d
              command:
              - mpirun
              - --allow-run-as-root
              # ....
              image: goodimage
              name: no-priority-mpijob
              resources:
                limits:
                  cpu: "2"
                  memory: 2Gi
                requests:
                  cpu: "2"
                  memory: 2Gi
              volumeMounts:
              - mountPath: /software
                name: software
            initContainers:
            - args:
              - mkdir -p /root/logs/launcher && ./dnswaiter
              command:
              - /bin/bash
              - -c
              image: goodimage
              imagePullPolicy: Always
              name: wait-dns
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: 100m
                  memory: 500Mi
              volumeMounts:
              #mounts
              workingDir: /root
            restartPolicy: Never
            volumes:
            # some volumes
      Worker:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: no-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
              image: goodimage
              name: no-priority-mpijob
              resources:
                limits:
                  cpu: "50"
                  habana.ai/gaudi: "4" # gpu
                  hugepages-2Mi: 100000Mi
              securityContext:
                privileged: true
              volumeMounts:
              #mounts
              workingDir: /root
            initContainers:
            - args:
              - mkdir -p /root/logs/$HOSTNAME
              command:
              - /bin/bash
              - -c
              env:
              - name: DRIVER_WITH_NETWORK
                value: "false"
              image: goodimage
              imagePullPolicy: IfNotPresent
              name: prepare-node
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: "5"
                  memory: 5Gi
              securityContext:
                privileged: false
              volumeMounts:
              # mounts
              workingDir: /root
            schedulerName: volcano
            volumes:
            # volumes
    runPolicy:
      backoffLimit: 0
      cleanPodPolicy: All
      ttlSecondsAfterFinished: 300
    slotsPerWorker: 8
  status:
    conditions:
    - lastTransitionTime: "2022-07-07T10:32:17Z"
      lastUpdateTime: "2022-07-07T10:32:17Z"
      message: MPIJob xxx is created.
      reason: MPIJobCreated
      status: "True"
      type: Created
    replicaStatuses:
      Launcher: {}
      Worker:
        active: 1
    startTime: "2022-07-07T10:32:17Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

william-wang commented 2 years ago

/assign @waiterQ

volcano-sh-bot commented 2 years ago

@william-wang: GitHub didn't allow me to assign the following users: waiterQ.

Note that only volcano-sh members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/volcano-sh/volcano/issues/2337#issuecomment-1180014296): >/assign @waiterQ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

Thor-wl commented 2 years ago

@talcoh2x @snirkop89 I'm also taking a test about this bug. Can you provide your scheduler configuration?

snirkop89 commented 2 years ago

sure: I tried a few variations (found them in the issue here), both yielded the same result - which preemption doesn't occur:

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  creationTimestamp: "2022-06-29T11:24:44Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "5116689"
  uid: a60fff7a-6da2-4f7b-922c-da3447fae82f

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  creationTimestamp: "2022-06-29T11:24:44Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "5116689"
  uid: a60fff7a-6da2-4f7b-922c-da3447fae82f

Thor-wl commented 2 years ago

@snirkop89 Hi, Snir. I've taken a look at the bug and preemption was broken indeed. There are several reasons about that. Firstly, the podgroup for job with high priority cannot convert from pending to inqueue. So the job has no chance to get resources. You can configure the scheduler as follows to disable jobEnqueued functions.

    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
        enableJobEnqueued: false  ## disable jobEnqueued function for overcommit plugin 
      - name: drf
      - name: predicates
      - name: proportion
        enableJobEnqueued: false  ## disable jobEnqueued function for proportion plugin
      - name: nodeorder
      - name: binpack

As what I tested locally, it can make podgroup with high priority enter inqueue status. But preeption was still not working. I'll give more feedback as soon as the root reason is found.

zhypku commented 2 years ago

@Thor-wl Hi, I'm also studying the preemption behavior of Volcano, and found the same problem. It seems that the JobStarvingFn of the gang plugin forbids preemption from a job where ji.CheckTaskMinAvailablePipelined() is false. I did find the log from the scheduler's log (in my test , the job default/priority-job has a higher priority but is waiting). I0721 03:40:52.353591 1 job_info.go:773] Job default/priority-job Task default-nginx occupied 0 less than task min avaliable Then I disabled the JobStarvingFn from the gang plugin by setting enableJobStarving: false. Then the preemption worked. So is this a by-design feature or a bug? Why a false return value of CheckTaskMinAvailablePipelined prohibits preemption?

Thor-wl commented 2 years ago

@Thor-wl Hi, I'm also studying the preemption behavior of Volcano, and found the same problem. It seems that the JobStarvingFn of the gang plugin forbids preemption from a job where ji.CheckTaskMinAvailablePipelined() is false. I did find the log from the scheduler's log (in my test , the job default/priority-job has a higher priority but is waiting). I0721 03:40:52.353591 1 job_info.go:773] Job default/priority-job Task default-nginx occupied 0 less than task min avaliable Then I disabled the JobStarvingFn from the gang plugin by setting enableJobStarving: false. Then the preemption worked. So is this a by-design feature or a bug? Why a false return value of CheckTaskMinAvailablePipelined prohibits preemption?

Thanks for the feedback. That's what I also found yesterday. IMO, it's not something as expected. I'm tracking which commit and when this behavior is introduced.

HecarimV commented 2 years ago

https://github.com/volcano-sh/volcano/blob/1b2630605b3f6669b5db9a77abd12d08f923e6d3/pkg/scheduler/actions/preempt/preempt.go#L124-L126 Pods with Preemptable = false will not be preempted, but it seems that task.Preemptable is false by default if we don't set annotation or label. https://github.com/volcano-sh/volcano/blob/1b2630605b3f6669b5db9a77abd12d08f923e6d3/pkg/scheduler/api/pod_info.go#L101 @Thor-wl I don't know if this could be the problem. Similar reclaim action may have problem as well, #2340 https://github.com/volcano-sh/volcano/blob/1b2630605b3f6669b5db9a77abd12d08f923e6d3/pkg/scheduler/actions/reclaim/reclaim.go#L135-L137

Thor-wl commented 2 years ago

https://github.com/volcano-sh/volcano/blob/1b2630605b3f6669b5db9a77abd12d08f923e6d3/pkg/scheduler/actions/preempt/preempt.go#L124-L126

Pods with Preemptable = false will not be preempted, but it seems that task.Preemptable is false by default if we don't set annotation or label. https://github.com/volcano-sh/volcano/blob/1b2630605b3f6669b5db9a77abd12d08f923e6d3/pkg/scheduler/api/pod_info.go#L101

@Thor-wl I don't know if this could be the problem. Similar reclaim action may have problem as well, #2340 https://github.com/volcano-sh/volcano/blob/1b2630605b3f6669b5db9a77abd12d08f923e6d3/pkg/scheduler/actions/reclaim/reclaim.go#L135-L137

In order to keep compatible with the former versions, task.Preemptable should be true by default. I've tracked the commits and the default value false was introduced here: https://github.com/volcano-sh/volcano/blob/2bb5ac7a7c593da6475e51118cf7a69e117ceafa/pkg/scheduler/api/pod_info.go#L76

@wpeng102 It seems that TDM plugin introduced the behavior. Let's take a review. Thanks!

snirkop89 commented 2 years ago

That's great to hear. Thank you for the fast response and feedback.

Thor-wl commented 2 years ago

That's great to hear. Thank you for the fast response and feedback.

No worries. The fix is under discussion.

snirkop89 commented 2 years ago

Hi, is there an update about this?

talcoh2x commented 2 years ago

@william-wang @Thor-wl
Hi Guys do you have something new to update we are really stuck and need help

talcoh2x commented 2 years ago

Hi, there is something new to update ?

talcoh2x commented 2 years ago

@zhypku Hi, can you share with us the Volcano configuration you have and worked for you ? I mean the preemption flow

stale[bot] commented 1 year ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] commented 1 year ago

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

volcano-sh / volcano

Gang scheduling job with high-priority not preempting lower priority jobs #2337