pytorch / elastic

PyTorch elastic training
BSD 3-Clause "New" or "Revised" License
730 stars 98 forks source link

How to programmatically determine if a training job has finished using `kubectl`? #130

Closed darthsuogles closed 4 years ago

darthsuogles commented 4 years ago

❓ Questions and Help

How to programmatically determine if a training job has finished using kubectl? The field status.replicaStatuses.Worker.succeeded seems to indicate the number of succeeded pods. How does one determine if the whole job has succeeded? This is useful when the training job is part of a workflow (e.g. orchestrated by argo or airflow).

Please note that this issue tracker is not a help form and this issue will be closed.

Before submitting, please ensure you have gone through our documentation. Here are some links that may be helpful:

Question

darthsuogles commented 4 years ago

CC: @kiukchung

kiukchung commented 4 years ago

Thanks for the question. This is actually something that can be improved in our current k8s controller. Since we don't rely on any type of gang scheduling but rather deploy each container as a Pod there isn't a scheduler provided "done" signal that we can use when you try to use an elastic job as part of a workflow. You can use certain heuristics, for instance, if you are in "non-elastic" mode (min == max), you could use the number of succeeded workers. Or have your job touch a "COMPLETE" file on S3 (or the likes) and kick off a downstream dependency based on that.

FWIW, we will integrate elastic into the existing pt operator in kubeflow (https://github.com/pytorch/elastic/issues/117).