Pipeline parallel support for multi-node training?

gajagajago commented 1 year ago

Hello DeepSpeed :)

I am trying to use Pipeline module to train a pipeline parallel model on multiple nodes. I am using Slurm as the cluster scheduler, so I initialized the following ENV variables according to Slurm configuration as below, and observed that the model layers get partitioned well and each partition gets placed on correct devices.

# Initializing distributed process group 

os.environ['MASTER_ADDR'] = f'{slurm_handler.master_addr}' # host address of root process
os.environ['MASTER_PORT'] = f'{slurm_handler.master_port}' # free master port of the above host 
os.environ['RANK'] = os.environ['SLURM_PROCID'] # global rank
os.environ['LOCAL_RANK'] = '0' # since Slurm assigns one device per process, each process recognize its assigned device with local rank 0

deepspeed.init_distributed(dist_backend=args.backend)

However, when I call deepspeed.initialize, the processes in the first node hangs waiting for the processes in the second node.

net = PipelineModule(layers=model_ds.to_layers(),
                     loss_fn=model_ds.loss_fn, num_stages=pp_stage)

### Entrypoint for training w/ DeepSpeed
# TODO: Hangs at p2p.init_process_groups (https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/engine.py)
engine, _, _, _ = deepspeed.initialize(
    args=args,
    model=net,
    model_parameters=[p for p in net.parameters() if p.requires_grad],
    optimizer=optimizer_ds)

I suspect it is because of this L152 in PipelineEngine which initialize p2p communication among the group. So I am wondering whether DeepSpeed pipeline module supports pipeline parallel training using multiple nodes.

#initialize peer-2-peer communication and allreduce groups
if self.is_pipe_parallel:
    p2p.init_process_groups(self.grid)

If it does, please give me an advice on where I might have overlooked. Thanks!

sharlec commented 1 year ago

I used deepspeed to launch with hostfiles, maybe this can solve your problem.

BastianChen commented 10 months ago

Have you already solved this problem? If any, please let me know your solution. Thank you so much! @gajagajago

gajagajago commented 10 months ago

@CChBen Sorry, no solutions under DeepSpeed implementation. Seems PP is just a naive-support only feature in DeepSpeed since their main functionality is ZeRO. However, I am currently developing a pipeline parallel project that include the feature you want to execute! I will let you know when it is released.

puppet101 commented 6 months ago

@gajagajago Any update about your project?

gajagajago commented 6 months ago

@puppet101 Comming up in few weeks now. I will post the link here soon

microsoft / DeepSpeed

Pipeline parallel support for multi-node training? #2846