Closed darthsuogles closed 4 years ago
CC: @kiukchung
Thanks for the question. This is actually something that can be improved in our current k8s controller. Since we don't rely on any type of gang scheduling but rather deploy each container as a Pod there isn't a scheduler provided "done" signal that we can use when you try to use an elastic job as part of a workflow. You can use certain heuristics, for instance, if you are in "non-elastic" mode (min == max), you could use the number of succeeded workers. Or have your job touch a "COMPLETE" file on S3 (or the likes) and kick off a downstream dependency based on that.
FWIW, we will integrate elastic into the existing pt operator in kubeflow (https://github.com/pytorch/elastic/issues/117).
❓ Questions and Help
How to programmatically determine if a training job has finished using
kubectl
? The fieldstatus.replicaStatuses.Worker.succeeded
seems to indicate the number of succeeded pods. How does one determine if the whole job has succeeded? This is useful when the training job is part of a workflow (e.g. orchestrated by argo or airflow).Please note that this issue tracker is not a help form and this issue will be closed.
Before submitting, please ensure you have gone through our documentation. Here are some links that may be helpful:
Question