kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
430 stars 216 forks source link

specifying minAvailable for volcano doesn't work #468

Closed snirkop89 closed 2 years ago

snirkop89 commented 2 years ago

I've built a CLI tool to create a mpi job based on our company's needs.

we want to use the volcano scheduler and give it the minAvailable parameter. the package "github.com/kubeflow/mpi-operator/v2/pkg/apis/kubeflow/v2beta1" shows the 'schedulingPolicy' options under 'runPolicy' but when I job, it does exist when I describe the job.

If I create the yaml manually I get the error: error validating data: ValidationError(MPIJob.spec.runPolicy): unknown field "schedulingPolicy" in org.kubeflow.v2beta1.MPIJob.spec.runPolicy; if you choose to igre these errors, turn validation off with --validate=false

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: demo-resnet-keras
spec:
  slotsPerWorker: 8
  runPolicy:
    backoffLimit: 1
    schedulingPolicy:
      minAvailable: 1
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          resources:
            cpu: 500mi
            memory: 200Mi
          containers:
            - image: some-docker-image-with-mpi-run
              name: demo-launcher
              command:
                [
                  "mpirun",
                  "--allow-run-as-root",
                  "--bind-to",
                  "core",
                  "-np",
                  "8",
                  "--map-by",
                  "socket:PE=4",
                  "bash",
                  "-c",
                ]
              args:
                - 'echo hello there && sleep 100'
    Worker:
      replicas: 5
      template:
        spec:
          containers:
            - image: some-docker-image-with-mpi-run
              name: demo-worker
              resources:
                limits:
                  cpu: 1
                  memory: 5Gi

mpi operator deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: mpi-operator
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator
  namespace: mpi-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mpi-operator
      app.kubernetes.io/component: mpijob
      app.kubernetes.io/name: mpi-operator
      kustomize.component: mpi-operator
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      labels:
        app: mpi-operator
        app.kubernetes.io/component: mpijob
        app.kubernetes.io/name: mpi-operator
        kustomize.component: mpi-operator
    spec:
      containers:
      - args:
        - -alsologtostderr
        - --lock-namespace
        - mpi-operator
        - --gang-scheduling=volcano
        image: mpioperator/mpi-operator:latest
        name: mpi-operator
      serviceAccountName: mpi-operator

No errors in the mpi operator logs only success:

 1 mpi_job_controller.go:454] Finished syncing job "default/demo-resnet-keras" (29.487853ms)
 1 mpi_job_controller.go:436] Successfully synced 'default/demo-resnet-keras'

How do we use the volcano scheduler correctly, and how do we pass the scheduling policy parameters?

alculquicondor commented 2 years ago

schedulingPolicy was never supported. It's no listed in the CRD definition https://github.com/kubeflow/mpi-operator/blob/master/manifests/base/crd.yaml

But you can still setup volcano to apply to all your jobs https://github.com/kubeflow/mpi-operator/blob/master/v2/cmd/mpi-operator/app/options/options.go#L65

snirkop89 commented 2 years ago

Ok, Thanks. I looked under v2/crd and saw it - that was my mistake. Is there a plan to support it in the next release?

Just to clarify, we can use volcano just as it is, without sending it any options, right?

alculquicondor commented 2 years ago

There are no plans for further support. But you are always welcomed to submit contributions.

The v2/crd was automatically generated. But even if the type is accepted, there is nothing in the code that pays attention to that field.

snirkop89 commented 2 years ago

understood. thank you for you help.