polyaxon / polyaxon

MLOps Tools For Managing & Orchestrating The Machine Learning LifeCycle
https://polyaxon.com
Apache License 2.0
3.57k stars 314 forks source link

How does Polyaxon track and report error when MPI Job workers cannot spawn? #1219

Open asahalyft opened 3 years ago

asahalyft commented 3 years ago

Hello Team,

I have used Kubeflow MPI Job Operator before and I am evaluating Polyaxon Operators. One issue that I faced in the past that when I applied a similar MPI Job yaml like below where I mistakenly mentioned the slotsPerWorker:4 but the resource per worker is only 2 cpus; the launcher and all the workers Pods came up but worker could not actually start running the python program because there were not enough slotsPerWorker. However, I saw that the launcer and the workers pods kept on running despite that the actual mpirun did not kick off.

Does Polyaxon help to monitor the MPIJob Status/Errors and then take preventive actions. How does it track the worker errors and terminate the job. I could not find such at discussion https://polyaxon.com/integrations/mpijob/.

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: tf2-keras-mnist-mpi-cpu
spec:
  slotsPerWorker: 4
  cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: docker.io/horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
            name: keras-mnist-mpi-launcher
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /examples/tensorflow2_keras_mnist.py
            resources:
              limits:
                cpu: 1
                memory: 2Gi
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: docker.io/horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
            name: keras-mnist-mpi-worker
            resources:
              limits:
                cpu: 2
                memory: 4Gi
github-actions[bot] commented 3 years ago

This issue has not seen any recent activity.