volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.13k stars 953 forks source link

Preemption between the jobs in the same queue is not work well when enable gang plugin #3567

Open zhifanggao opened 3 months ago

zhifanggao commented 3 months ago

What happened: The expected preemption did not happened What you expected to happen: the higher priority jobs can preempt the lower priority jobs when enable gang plugin How to reproduce it (as minimally and precisely as possible):

  1. volcano-scheduler.conf

    WechatIMG1460
  2. create a queue test-kyuubi with capacity, 4 cpu, 512 memory

  3. helm chart install lower vcjob1 , with 2 cpu requests

  4. helm chart install higher vcjob2, with 3 cpu requests

Anything else we need to know?: logs:

WechatIMG19 WechatIMG18

Environment:

zhifanggao commented 3 months ago

The preempt in the case works without gang plugin

lowang-bh commented 2 months ago

You have disable preemption in gang plugin from your config.

Monokaix commented 2 months ago

Hi, please adjust log level to 5 and paste the logs.

zhifanggao commented 2 months ago

the log level is already 5, logs referred to vcjob2 are all here. I think the GANG scheduler return "all node are unavailable. " the scheduler session is closed

Monokaix commented 2 months ago

Can you paste your vcjobs and queue yaml?

zhifanggao commented 2 months ago

@lowang-bh It does not matter wether enablePreemptable is true or false. I put the gang in the last line of plugins. The preemption works well

zhifanggao commented 2 months ago

@Monokaix

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    meta.helm.sh/release-name: low
    meta.helm.sh/release-namespace: preempt
    volcano.sh/preemptable: "true"
  creationTimestamp: "2024-07-09T06:50:30Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
  name: vc-job1
  namespace: preempt
  resourceVersion: "87091459"
  uid: 8ec099ae-470c-48b4-afdd-af4c4371049b
spec:
  maxRetry: 3
  minAvailable: 1
  policies:
  - action: RestartJob
    event: PodEvicted
  queue: test-kyuubi
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 1
    name: job1
    policies:
    - action: CompleteJob
      event: TaskCompleted
    replicas: 1
    template:
      metadata:
        annotations:
          volcano.sh/preemptable: "true"
      spec:
        containers:
        - command:
          - sleep
          - 10m
          image: nginx:latest
          imagePullPolicy: IfNotPresent
          name: nginx
          resources:
            limits:
              cpu: "32"
            requests:
              cpu: "32"
        restartPolicy: OnFailure

vcjob3

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    meta.helm.sh/release-name: high
    meta.helm.sh/release-namespace: preempt
    volcano.sh/Preemptable: "true"
  creationTimestamp: "2024-07-09T06:50:41Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
  name: vc-job3
  namespace: preempt
  resourceVersion: "87088972"
  uid: cc16806a-14b5-429d-97f3-136c82c0ba5e
spec:
  maxRetry: 3
  minAvailable: 1
  policies:
  - action: RestartJob
    event: PodEvicted
  priorityClassName: system-cluster-critical
  queue: test-kyuubi
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 1
    name: job3
    policies:
    - action: CompleteJob
      event: TaskCompleted
    replicas: 1
    template:
      metadata:
        annotations:
          volcano.sh/preemptable: "true"
      spec:
        containers:
        - command:
          - sleep
          - 10m
          image: nginx:latest
          name: nginx
          resources:
            limits:
              cpu: "32"
            requests:
              cpu: "32"
        priorityClassName: system-cluster-critical
        restartPolicy: OnFailure

queue

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  creationTimestamp: "2024-06-05T03:28:14Z"
  generation: 2
  name: test-kyuubi
  resourceVersion: "87091460"
  uid: a764d42f-f8fb-49f5-b281-9b9076bb6973
spec:
  capability:
    cpu: "32"
    memory: 40960000Mi
  reclaimable: true
  weight: 1