Hi, I try to training the BingBertSquad on 2 nodes (each has 4 gpus) with specific gpus (2 gpus, another 2 gpus are occupied by others), and my hostfile's content is:
work1 slots=2work2 slots=2
and part of my launch script on one node is
deepspeed --hostfile hostfile --include=work1:2,3 training.py ....,
but it reports the error
No slot 2 specificied on host work1.
what is correct way to train in this situation?
Hi, I try to training the BingBertSquad on 2 nodes (each has 4 gpus) with specific gpus (2 gpus, another 2 gpus are occupied by others), and my hostfile's content is:
work1 slots=2
work2 slots=2
and part of my launch script on one node isdeepspeed --hostfile hostfile --include=work1:2,3 training.py ....
, but it reports the errorNo slot 2 specificied on host work1
. what is correct way to train in this situation?