microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.38k stars 4.11k forks source link

[QUESTION] Does DS automatically scale across different archs #2484

Open pablogranolabar opened 2 years ago

pablogranolabar commented 2 years ago

Are there any redundancy options, backup nodes etc in the event of a node failure?

TIA

tjruwase commented 1 year ago

@pablogranolabar, thanks for your question. Please see response below:

  1. DeepSpeed was designed with homogenous GPUs in mind and has only been tested in that environment.
  2. I think DeepSpeed should launch even if the nodes in the hostfile are different. Please share your experience with this.
  3. DeepSpeed currently does not provide fault tolerance.
pablogranolabar commented 1 year ago

Hi again @tjruwase! I am curious, what are the differences in architecture between DeepSpeed and DeepSpeed-MII, or are we talking about the same platform? Or does MII only support homogenous clusters on the same hardware host?

pablogranolabar commented 1 year ago

Hi again @tjruwase! I am curious, what are the differences in architecture between DeepSpeed and DeepSpeed-MII, or are we talking about the same platform? Or does MII only support homogenous clusters on the same hardware host?

tjruwase commented 1 year ago

@pablogranolabar, thanks for your question. However, I am not the best person to answer it. Also, it is best to create a new question since it is unrelated to the original question. I will link the right people to the new question.