Open Jeffwan opened 4 years ago
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/feature | 0.98 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
It's blocked on the testing infra now. If it can be not resolved in one week. I will pause tests on this repo and move forward development work
TorchElastic enables distributed PyTorch training jobs to be executed in a fault tolerant and elastic manner.
Use cases:
Fault Tolerance: jobs that run on infrastructure where nodes get replaced frequently, either due to flaky hardware or by design. Or mission critical production grade jobs that need to be run with resilience to failures.
Dynamic Capacity Management: jobs that run on leased capacity that can be taken away at any time (e.g. AWS spot instances) or shared pools where the pool size can change dynamically based on demand.
We want to bring this feature to pytorch-operator. I was working on https://github.com/pytorch/elastic/tree/master/kubernetes and create a dedicate operator for this. I think we discuss this feature in https://github.com/pytorch/elastic/issues/117. This issue is to track this engineer work to add elastic support.