I have used Kubeflow MPI Job Operator before and I am evaluating Polyaxon Operators. One issue that I faced in the past that when I applied a similar MPI Job yaml like below where I mistakenly mentioned the slotsPerWorker:4 but the resource per worker is only 2 cpus; the launcher and all the workers Pods came up but worker could not actually start running the python program because there were not enough slotsPerWorker. However, I saw that the launcer and the workers pods kept on running despite that the actual mpirun did not kick off.
Does Polyaxon help to monitor the MPIJob Status/Errors and then take preventive actions. How does it track the worker errors and terminate the job. I could not find such at discussion https://polyaxon.com/integrations/mpijob/.
Hello Team,
I have used Kubeflow MPI Job Operator before and I am evaluating Polyaxon Operators. One issue that I faced in the past that when I applied a similar MPI Job yaml like below where I mistakenly mentioned the
slotsPerWorker:4
but the resource per worker is only 2 cpus; the launcher and all the workers Pods came up but worker could not actually start running the python program because there were not enoughslotsPerWorker
. However, I saw that the launcer and the workers pods kept on running despite that the actual mpirun did not kick off.Does Polyaxon help to monitor the MPIJob Status/Errors and then take preventive actions. How does it track the worker errors and terminate the job. I could not find such at discussion https://polyaxon.com/integrations/mpijob/.