microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.72k stars 4.05k forks source link

[Question]How to training on different node with specific gpu in each node #3625

Open sxqqslf opened 1 year ago

sxqqslf commented 1 year ago

Hi, I try to training the BingBertSquad on 2 nodes (each has 4 gpus) with specific gpus (2 gpus, another 2 gpus are occupied by others), and my hostfile's content is: work1 slots=2 work2 slots=2 and part of my launch script on one node is deepspeed --hostfile hostfile --include=work1:2,3 training.py ...., but it reports the error No slot 2 specificied on host work1. what is correct way to train in this situation?

tjruwase commented 1 year ago

Let your hostfile correctly reflect the hardware, then use the include flag to restrict gpus to use.

 work1 slots=4
 work2 slots=4