Open gajagajago opened 1 year ago
I used deepspeed to launch with hostfiles, maybe this can solve your problem.
Have you already solved this problem? If any, please let me know your solution. Thank you so much! @gajagajago
@CChBen Sorry, no solutions under DeepSpeed implementation. Seems PP is just a naive-support only feature in DeepSpeed since their main functionality is ZeRO. However, I am currently developing a pipeline parallel project that include the feature you want to execute! I will let you know when it is released.
@gajagajago Any update about your project?
@puppet101 Comming up in few weeks now. I will post the link here soon
Hello DeepSpeed :)
I am trying to use Pipeline module to train a pipeline parallel model on multiple nodes. I am using Slurm as the cluster scheduler, so I initialized the following ENV variables according to Slurm configuration as below, and observed that the model layers get partitioned well and each partition gets placed on correct devices.
However, when I call
deepspeed.initialize
, the processes in the first node hangs waiting for the processes in the second node.I suspect it is because of this L152 in PipelineEngine which initialize p2p communication among the group. So I am wondering whether DeepSpeed pipeline module supports pipeline parallel training using multiple nodes.
If it does, please give me an advice on where I might have overlooked. Thanks!