volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.11k stars 949 forks source link

job in two queue will reclaim each other's tasks in dead loop #3729

Closed lowang-bh closed 1 week ago

lowang-bh commented 2 weeks ago

Description

With a cluster has 11C CPU, Queue-a has a deserved=5C and capability=10C, same as queue-b. First create job-a with replicas=5, and minAvailable=2. job a will take 10C. Then create a job-b same as job-a, it will reclaim and evict job-a's two tasks. But now queue-a used is less than deserved, it will also reclaim from queue-b, and so on.

image image image

Steps to reproduce the issue

with scheduler cm

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, reclaim, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: capacity
      - name: nodeorder
      - name: binpack
kind: ConfigMap
  1. apply queue-a, queue-b with yaml
    apiVersion: scheduling.volcano.sh/v1beta1
    kind: Queue
    metadata:
    name: queue-a
    spec:
    reclaimable: true
    deserved:
    cpu: 5
    memory: 2Gi
    capability:          
    cpu: 10
    memory: 5Gi
    ---
    apiVersion: scheduling.volcano.sh/v1beta1
    kind: Queue
    metadata:
    name: queue-b
    spec:
    reclaimable: true
    deserved:
    cpu: 5
    memory: 2Gi
    capability:          
    cpu: 10
    memory: 5Gi
  2. apply job-a.yaml
    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
    name: job-a
    spec:
    schedulerName: volcano
    queue: queue-a
    tasks:
    - replicas: 5
      minAvailable: 2
      name: "master"
      template:
        metadata:
          annotations:
            volcano.sh/preemptable: "true"
        spec:
          containers:
            - image: nginx:1.14.2
              name: nginx
              resources:
                requests:
                  cpu: "2"
                  memory: "50Mi"
          restartPolicy: OnFailure
  3. apply job-b.yaml
    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
    name: job-b
    spec:
    schedulerName: volcano
    queue: queue-b
    tasks:
    - replicas: 5
      minAvailable: 2
      name: "worker"
      template:
        metadata:
          annotations:
            volcano.sh/preemptable: "true"
        spec:
          containers:
            - image: nginx:1.14.2
              name: nginx
              resources:
                requests:
                  cpu: "2"
                  memory: "50Mi"
          restartPolicy: OnFailure

Describe the results you received and expected

After upgrade image to https://github.com/volcano-sh/volcano/pull/3696, it keeps stable.

image

What version of Volcano are you using?

master

Any other relevant information

master branch at 95d5a923056b9833bf27f0ccdfb28b59cba28c2d

Monokaix commented 1 week ago

Good catch.