kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.61k stars 700 forks source link

Custom Volcano Queues not working with MPIJob V1 #2325

Open ameya-parab opened 1 week ago

ameya-parab commented 1 week ago

What happened?

I am unable to use any custom queues created for use with the Volcano Scheduler for Kubeflow MPIJobs. When Volcano creates a PodGroup, it is automatically assigned to the default queue rather than the custom queue mentioned as part of runPolicy.schedulingPolicy.queue spec.

The following MPIjob should use the custom queue production, but it instead uses the default queue.

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: horovod-mnist-high
  namespace: my-namespace
spec:
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
          labels:
            Owner: my-namespace
            pipelines.kubeflow.org/pipeline-sdk-type: kfp
            training.kubeflow.org/type: mpijobs
        spec:
          schedulerName: volcano
          containers:
          - args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /horovod/examples/pytorch/pytorch_mnist.py
            - --epochs
            - "100"
            command:
            - mpirun
            env:
            - name: OMPI_JOB_NAME
              value: horovod-mnist-high
            - name: KUBERNETES_NAMESPACE
              value: my-namespace
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            image: horovod/horovod:0.28.1
            imagePullPolicy: Always
            name: launcher
            resources:
              limits:
                cpu: "2"
                memory: 4Gi
              requests:
                cpu: "1"
                memory: 2Gi
            securityContext:
              privileged: true
          hostNetwork: false
          serviceAccountName: default-editor
    Worker:
      replicas: 2
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
          labels:
            Owner: my-namespace
            pipelines.kubeflow.org/pipeline-sdk-type: kfp
            training.kubeflow.org/type: mpijobs
        spec:
          schedulerName: volcano
          containers:
          - env:
            - name: OMPI_JOB_NAME
              value: horovod-mnist-high
            - name: KUBERNETES_NAMESPACE
              value: my-namespace
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            image: horovod/horovod:0.28.1
            imagePullPolicy: Always
            name: worker
            resources:
              limits:
                memory: 10Gi
                nvidia.com/gpu: "1"
              requests:
                cpu: "2"
                memory: 5Gi
            securityContext:
              privileged: true
            volumeMounts:
            - mountPath: /usr/local/bin/kubectl
              name: mpi-job-kubectl
              subPath: kubectl
            - mountPath: /dev/shm
              name: dshm
          hostNetwork: false
          initContainers:
          - command:
            - cp
            - /opt/bitnami/kubectl/bin/kubectl
            - /shared/kubectl
            env:
            - name: OMPI_JOB_NAME
              value: horovod-mnist-high
            - name: KUBERNETES_NAMESPACE
              value: my-namespace
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            image: bitnami/kubectl:1.30.6
            imagePullPolicy: Always
            name: kubectl-delivery
            resources:
              requests:
                cpu: 100m
                memory: 128Mi
            volumeMounts:
            - mountPath: /shared
              name: mpi-job-kubectl
          serviceAccountName: default-editor
          volumes:
          - emptyDir:
              medium: Memory
            name: dshm
          - emptyDir: {}
            name: mpi-job-kubectl
  runPolicy:
    cleanPodPolicy: Running
    schedulingPolicy:
      queue: production
      minAvailable: 3
  slotsPerWorker: 1

Resultant PodGroup:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: podgroup-1d7357bc-880f-4334-97e7-d3e3c06e0f47
  namespace: my-namespace
  status:
  conditions:
    - lastTransitionTime: '2024-11-09T19:55:01Z'
      reason: tasks in gang are ready to be scheduled
      status: 'True'
      transitionID: 01d05369-859e-45c4-b20c-5801e577552e
      type: Scheduled
  phase: Running
  running: 1
spec:
  minMember: 1
  minResources:
    count/pods: '1'
    cpu: '1'
    limits.cpu: '2'
    limits.memory: 4Gi
    memory: 2Gi
    pods: '1'
    requests.cpu: '1'
    requests.memory: 2Gi
  queue: default

What did you expect to happen?

If the runPolicy.schedulingPolicy.queue specifies a custom queue. The Volcano PodGroup should be assigned to that specific group, not the default Volcano Queue.

Environment

Kubernetes version: 1.25 Training Operator version: kubeflow/training-operator:v1-855e096 Training Operator Python SDK version: NA Volcano version: 1.10.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Electronic-Waste commented 2 days ago

Do you install training-operator with gang-scheduler-name arg specified as volcano? If not, you can check this as reference: https://www.kubeflow.org/docs/components/training/user-guides/job-scheduling/#volcano-scheduler

In default, we use kueue as gang-scheduler. So even if you specified schedulerName in the pod template, training-operator will still take your runPolicy field for kueue config.

cc👀 @kubeflow/wg-training-leads