kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
417 stars 209 forks source link

Cant get mpijob status when pod template is invalid #604

Open congpeiqing opened 7 months ago

congpeiqing commented 7 months ago

i created a mpijob with invalid pod template , i cant get mpijob status all the time ( i think the status should be Failed).
now i cant distinguish the mpijobs which are too new to get status and the mpijobs with invaild pod template

my mpijob shows below kubectl get mpijob ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod -oyaml the output is :

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  creationTimestamp: "2023-11-15T02:01:44Z"
  generation: 1
  labels:
    deadline: 2023-11-15_02-06-44
  name: ai62da0dbe-6406-4252-85d6-51ef87eab10d
  namespace: cpod
  resourceVersion: "2787007"
  uid: e5703c73-f27e-45ef-9049-fd40c152d4d6
spec:
  launcherCreationPolicy: WaitForWorkersReady
  mpiImplementation: OpenMPI
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: "111"
            imagePullPolicy: IfNotPresent
            name: launcher
          hostIPC: true
    Worker:
      replicas: 1
      template:
        spec:
          containers:
          - image: "111"
            imagePullPolicy: IfNotPresent
            name: worker
            resources:
              limits:
                nvidia.com/gpu: 1
            volumeMounts:
            - mountPath: "111"
              name: ckpt-pv
            - mountPath: "111"
              name: saved-model-pv
          hostIPC: true
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-GeForce-RTX-3090
          volumes:
          - name: ckpt-pv
            persistentVolumeClaim:
              claimName: ai62da0dbe-6406-4252-85d6-51ef87eab10d-ckpt
              readOnly: false
          - name: saved-model-pv
            persistentVolumeClaim:
              claimName: ai62da0dbe-6406-4252-85d6-51ef87eab10d-modelsave
              readOnly: false
  runPolicy:
    cleanPodPolicy: Running
    schedulingPolicy:
      minAvailable: 1
    suspend: false
  slotsPerWorker: 1
  sshAuthMountPath: /root/.ssh

when describe the mpijob

kubectl describe mpijob  ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod 

output is :

Name:         ai62da0dbe-6406-4252-85d6-51ef87eab10d
Namespace:    cpod
Labels:       deadline=2023-11-15_02-06-44
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2023-11-15T02:01:44Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v2beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:deadline:
      f:spec:
        .:
        f:launcherCreationPolicy:
        f:mpiImplementation:
        f:mpiReplicaSpecs:
          .:
          f:Launcher:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
                f:hostIPC:
          f:Worker:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
                f:hostIPC:
                f:nodeSelector:
                f:volumes:
        f:runPolicy:
          .:
          f:cleanPodPolicy:
          f:schedulingPolicy:
            .:
            f:minAvailable:
          f:suspend:
        f:slotsPerWorker:
        f:sshAuthMountPath:
    Manager:         cpodmanager
    Operation:       Update
    Time:            2023-11-15T02:01:44Z
  Resource Version:  2787007
  UID:               e5703c73-f27e-45ef-9049-fd40c152d4d6
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Image:              111
            Image Pull Policy:  IfNotPresent
            Name:               launcher
          Host IPC:             true
    Worker:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Image:              111
            Image Pull Policy:  IfNotPresent
            Name:               worker
            Resources:
              Limits:
                nvidia.com/gpu:  1
            Volume Mounts:
              Mount Path:  111
              Name:        ckpt-pv
              Mount Path:  111
              Name:        saved-model-pv
          Host IPC:        true
          Node Selector:
            nvidia.com/gpu.product:  NVIDIA-GeForce-RTX-3090
          Volumes:
            Name:  ckpt-pv
            Persistent Volume Claim:
              Claim Name:  ai62da0dbe-6406-4252-85d6-51ef87eab10d-ckpt
              Read Only:   false
            Name:          saved-model-pv
            Persistent Volume Claim:
              Claim Name:  ai62da0dbe-6406-4252-85d6-51ef87eab10d-modelsave
              Read Only:   false
  Run Policy:
    Clean Pod Policy:  Running
    Scheduling Policy:
      Min Available:    1
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Events:
  Type     Reason         Age                   From                Message
  ----     ------         ----                  ----                -------
  Normal   MPIJobCreated  5m48s (x12 over 27m)  mpi-job-controller  MPIJob cpod/ai62da0dbe-6406-4252-85d6-51ef87eab10d is created.
  Warning  MPIJobFailed   5m48s (x12 over 27m)  mpi-job-controller  worker pod created failed: Pod "ai62da0dbe-6406-4252-85d6-51ef87eab10d-worker-0" is invalid: spec.containers[0].volumeMounts[1].mountPath: Invalid value: "111": must be unique
terrytangyuan commented 7 months ago

The controller currently the item when there are errors during worker pod creation. It might be problematic to requeue regardless what the error is. If there are pod spec errors, then it should just fail instead of requeueing repeatedly.

https://github.com/kubeflow/mpi-operator/blob/master/pkg/controller/mpi_job_controller.go#L964

alculquicondor commented 7 months ago

Ideally, we should have a webhook, but this was never prioritized.

Alternatively, we can add a CEL validator https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#validation-expression

Happy to review a PR if you are interested in working on it.

tenzen-y commented 7 months ago

Previously, I have tried to introduce CEL validation to the traininig-operator:

https://github.com/kubeflow/training-operator/issues/1708

However, I gave up introducing it since it is hard to validate podTemplate due to the cost budget of the CEL validations.

https://github.com/kubeflow/training-operator/issues/1708#issuecomment-1661876525

Hence, we must introduce webhooks if we want to validate the podTemplates.

alculquicondor commented 7 months ago

You mean that CEL was too slow or what exactly?

tenzen-y commented 7 months ago

You mean that CEL was too slow or what exactly?

No, I meant CEL validation can not work due to the following errors:

Forbidden: contributed to estimated rule cost total exceeding cost limit for entire OpenAPIv3 schema, spec.validation.openAPIV3Schema: Forbidden: x-kubernetes-validations estimated rule cost total for entire OpenAPIv3 schema exceeds budget by factor of more than 100x (try simplifying the rule, or adding maxItems, maxProperties, and maxLength where arrays, maps, and strings are declared)]

This was caused by cost budget.

alculquicondor commented 7 months ago

Oh, so too many validation rules :)

tenzen-y commented 7 months ago

Oh, so too many validation rules :)

I guess that these exceedings are caused by replicaSpecs are defined by map because we can not set a limitation of the number of replicas and the search depth is infinity :(

alculquicondor commented 7 months ago

Ah, we shot ourselves in the foot by using a map instead of explicit fields.

congpeiqing commented 7 months ago

The controller currently the item when there are errors during worker pod creation. It might be problematic to requeue regardless what the error is. If there are pod spec errors, then it should just fail instead of requeueing repeatedly.

https://github.com/kubeflow/mpi-operator/blob/master/pkg/controller/mpi_job_controller.go#L964

https://github.com/kubeflow/mpi-operator/pull/606 @terrytangyuan PR submitted , works in our environment .