kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

Support Torch Elastic in pytorch operator #296

Open Jeffwan opened 4 years ago

Jeffwan commented 4 years ago

TorchElastic enables distributed PyTorch training jobs to be executed in a fault tolerant and elastic manner.

Use cases:

We want to bring this feature to pytorch-operator. I was working on https://github.com/pytorch/elastic/tree/master/kubernetes and create a dedicate operator for this. I think we discuss this feature in https://github.com/pytorch/elastic/issues/117. This issue is to track this engineer work to add elastic support.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.98

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Jeffwan commented 3 years ago

It's blocked on the testing infra now. If it can be not resolved in one week. I will pause tests on this repo and move forward development work