kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

[feature] Rethink distributed Pytorch backoff retry #270

Open czheng94 opened 4 years ago

czheng94 commented 4 years ago

Currently, the backoff retries in all replicas are controlled separately. Either directly restarted by pod controller (restartPolicy will be propagated from ReplicaSpec to PodTemplateSpec) https://github.com/kubeflow/pytorch-operator/blob/037cd1b18eb77f657f2a4bc8a8334f2a06324b57/pkg/controller.v1/pytorch/pod.go#L283-L289 or restarted in the pytorchjob controller https://github.com/kubeflow/pytorch-operator/blob/037cd1b18eb77f657f2a4bc8a8334f2a06324b57/pkg/controller.v1/pytorch/pod.go#L91-L109

And the backOffLimit controls the sum of total failures in the replicas that should be restarted. https://github.com/kubeflow/pytorch-operator/blob/037cd1b18eb77f657f2a4bc8a8334f2a06324b57/pkg/controller.v1/pytorch/controller.go#L518-L556

I'm not sure how this separated backoff retry will help retrying distributed pytorch. Usually ppl will just do AllReduce across all replicas. If either a master or any worker fails, other pods are also going to fail all together because master process loses the connection to a worker and fails, and hence all workers lose connection to master. Then, the backOffLimit that controls the sum of total failures in the replicas becomes weird here. If there are 1 master and n workers, backOffLimit will need to be set to (1+n) * k, if we want the distributed pytorch job to restart k times.

I believe we should restart all replicas together, and use backOffLimit to control the number of all replica restarts.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.63

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

gaocegege commented 4 years ago

/cc @johnugeorge

gaocegege commented 4 years ago

@czheng94 I think elastics training is a hot topic this year. Ref https://github.com/pytorch/elastic

Thus I am not sure if we should do it.

johnugeorge commented 4 years ago

@gaocegege Do you mean change the distributed launch to use the new "elastic" way?

czheng94 commented 4 years ago

@gaocegege elastic training is really cool!

I'm not very familiar with torch elastic (but interested to learn more). Are we thinking about supporting a separate operator for pytorch elastic like this one https://github.com/pytorch/elastic/blob/master/kubernetes/api/v1alpha1/elasticjob_types.go? Or are we going to integrate this into pytorch-operator?

In addition, I'm not sure if torch elastic is going to eventually replace current distributed pytorch. The latter seems to have more dependencies (etcd, user needing to change their code to always preserve the training state, user needing to tolerate varying batch size, etc.).

Jeffwan commented 4 years ago

I wrote the pytorch ealstic controller in pytorch/elastic repo. Overall, it's very similar to existing operators. However, we can not use it seamlessly here because it requires a strong consistency store and the way to setup clusterEnv is a little bit different.

I suggest to use it separately at this moment. While, we should consider elasticity of all frameworks and fine a generate way to handle this and hide framework details