A request for clarity around 3D Parallelism in DeepSpeed

Let's start with saying that based on my reading of various papers Model Parallelism (MP) is a very inconsistent term. One can slice vertically or horizontally. One can implement a naive slow version or speed it up with pipelining, and almost none of them is really parallel. I tried to summarize and demo a few of the basic options here: https://github.com/huggingface/transformers/issues/8771#issuecomment-758250421

So then DeepSpeed talks a lot about 3D parallelism, the blog posts like this state multiple times that DeepSpeed uses 3D parallelism.

Then @samyam kindly reviewed the draft of the upcoming blog post about DeepSpeed integration in transformers, where he suggests that no, DeepSpeed doesn't do 3D parallelism.

To me it does look like DeepSpeed implements all 3:

DP - yes
PP - yes
MP - yes, the key innovation of sharding is a form of horizontal MP.

So please correct me if I'm wrong in that DeepSpeed isn't already doing 3D.

Quotes from the blog post: https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/

Trillion parameter model training with 3D parallelism: DeepSpeed enables a flexible combination of three parallelism approaches—ZeRO-powered data parallelism, pipeline parallelism, and tensor-slicing model parallelism. 3D parallelism adapts to the varying needs of workload requirements to power extremely large models with over a trillion parameters while achieving near-perfect memory-scaling and throughput-scaling efficiency. In addition, its improved communication efficiency allows users to train multi-billion-parameter models 2–7x faster on regular clusters with limited network bandwidth.

and then later:

DeepSpeed has combined three powerful technologies to enable training trillion-scale models and to scale to thousands of GPUs: data parallel training, model parallel training, and pipeline parallel training.

I guess the discussion continued at https://github.com/huggingface/blog/pull/71#pullrequestreview-569704590, but it's best to avoid discussing important things that others may benefit from in code suggestions, because as soon as a suggestion is resolved github hides those comments.

So I will at least re-paste them here for posterity:

@jeffra wrote:

Maybe let me describe each feature in the parallelism space separately to hopefully clear up some confusion.

ZeRO is the technique we developed to partition optimizer, gradient, and/or parameter states across data parallel ranks.

Pipeline parallelism partitions layers of a model into stages that can then be processed in parallel. It can be used along side traditional data parallelism or can be also be combined with ZeRO.

Tensor slicing/op sharding as done in Megatron-LM allows for specific ops in a model to be split across multiple ranks, thus reducing the memory footprint of specific components of a model (e.g., RowParallelLinear).

3D parallelism is the concept we introduced in the press release you quote above. This concept is the combination of all 3 of the above types of parallelism.

I believe the focus on the integration so far has been primarily in enabling number 1 from above, correct?

And then @samyam expanded:

Adding to @jeffra, one way I find it helpful to think about ZeRO is simply as data parallel training but with distributed data storage. The computation on each GPU is exactly the same as data parallel training, but the parameter, gradients and optimizer states are stored in a distributed/partitioned fashion across all the GPUs and fetched only when needed. It doesn't really matter if the partition is horizontal or vertical. This is quiet different from model parallel techniques such as tensor slicing or pipeline parallelism which divide not just the parameter, gradients and optimizer states, but also the computation vertically and horizontally, respectively. From that perspective, I find it more helpful to think of ZeRO as data parallel training than model parallel training.

Thank you for these clarifications, @jeffra and @samyam!

Yes, we have Zero stage 1+2 and some of its other parts integrated (e.g. cpu offload). So we have ZeRO - check!
So you're saying that PP isn't being used by default? So train_micro_batch_size_per_gpu is only relevant if we explicitly implement use PP, right? which I don't think can be used with ZeRO, or am I wrong?
Tensor slicing/op sharding - I guess this is where I'm most confused - isn't param/optim/gradient sharding the same as tensor slicing/op sharding? In both cases you partition tensors and spread them out to different GPUs - so there must be a nuance that I'm missing here. I probably will understand it once I study Megatron-LM.

I think DeepSpeed would benefit a lot from having a sort of a visual map showing how the different components fit together.

Which can inter-operate, which cannot.
Which are part of ZeRO, which are stand-apart.
Which parts stack up to increase speed, which to reduce memory footprint

The blog posts have a lot of diagrams, but they only convey a partial picture.

Hi @stas00 , thanks for the great suggestions :-). A detailed map of DeepSpeed would be great.

Pipeline parallelism is not used unless you provide a PipelineModule to deepspeed.initialize(). The train_micro_batch_size_per_gpu configuration tells DeepSpeed how many samples are in each micro-batch provided to forward(). The total global batch size is therefore train_micro_batch_size_per_gpu * gradient_accumulation_steps * data parallelism degree. This is true with any DeepSpeed training, regardless of PP/ZeRO/etc.

Tensor slicing/op sharding - I guess this is where I'm most confused - isn't param/optim/gradient sharding the same as tensor slicing/op sharding? In both cases you partition tensors and spread them out to different GPUs - so there must be a nuance that I'm missing here. I probably will understand it once I study Megatron-LM.

This is a great question. The difference comes down to where the computation is performed. With ZeRO's sharding, we save memory by partitioning various tensors before and after the computations, but the actual forward/backward number crunching still happens in its full form. That means that you still need enough memory for the activation memory when computing a large layer.

In contrast, tensor slicing like Megatron-LM actually modifies the computations to work in a distributed manner. A rough example: instead of collecting the full data, doing a matrix multiplication, and then partitioning again, the module is inherently modified to work in a distributed manner. This has several very key advantages including reducing the size of activations for layers and also reducing the pressure on global batch size by splitting individual samples across model-parallel groups.

Tensor slicing is a model-specific strategy, which is why DeepSpeed supports, but does not provide it. @samyam phrased it in a great way, we like to think of sharding as optimizations for data parallel training, because we don't need the specifics of the model's computation.

Pipeline parallelism is not used unless you provide a PipelineModule to deepspeed.initialize().

I understood this part, thank you. I got mislead by the 3D parallelism discussion in the Microsoft blog post and assumed that DS does this out of the box. I think thanks to your clear answers I now have a pretty good understanding of the different parts.

The train_micro_batch_size_per_gpu configuration tells DeepSpeed how many samples are in each micro-batch provided to forward(). The total global batch size is therefore train_micro_batch_size_per_gpu * gradient_accumulation_steps * data parallelism degree. This is true with any DeepSpeed training, regardless of PP/ZeRO/etc.

Yes, of course, this is the DP micro-batch. Which is different from PP micro-batch.

Let me try to summarize my understanding so far:

DP splits the batch into micro-batches and across mulitple gpus, so each GPU only ever sees one micro-batch from the batch. Plus there is gradient_accumulation_steps so: total_batch_size = train_micro_batch_size_per_gpu * gradient_accumulation_steps * data parallelism degree.
PP splits the batch into its chunks, which you also call micro-batches. The split here is not based on the number of gpus in the pipeline, but on the optimal number of stages that leads to the best utilization of the gpus, while overcoming the naive model parallelization problem.

Unlike DP, PP feeds all micro-batches to frontal GPU (in the simple case) and the other GPUs in the pipeline stack get to that same micro-batch in their turn. Here each GPU gets to see each and every micro-batch.

So PP micro-batch is different from DP micro-batch.

Then we have 2D Parallelism where we stack DP and PP, how do you name things here? As now we need to split each DP micro-batch into nano-batches for PP?

Perhaps the first level split should be called mini, and second micro?

Let's look at an example to clearly see which is which.

Say, you have 4 GPUs - 2 for DP and 2 for PP and we want to run a BS=24 and let's ignore gradient accumulation for now.

a. DP gets first dibs on splitting the batch and since we have 2 stacks of GPUs visibile to DP, each stack will get a micro-batch of 12. b. Now within each PP stack, let's say we found that the pipeline should be of size 3, which would lead to 12/3=4 so each stage of PP will be fed a sequence of 4,4,4

Am I doing well so far?

Finally we may add horizontal MP to complete 3D Parallelism, and here data will also need to be split to match the horizontally split layers. what those data slices should be called there?

Perhaps, the way this is done now is that each "P" calls the splits "micro-batches" and as they stack the micro-micro-batches just work, because the context doesn't get overlapped.

or dp-micro-batch, pp-micro-batch?

Tensor slicing/op sharding - I guess this is where I'm most confused - isn't param/optim/gradient sharding the same as tensor slicing/op sharding? In both cases you partition tensors and spread them out to different GPUs - so there must be a nuance that I'm missing here. I probably will understand it once I study Megatron-LM.

This is a great question. The difference comes down to where the computation is performed. With ZeRO's sharding, we save memory by partitioning various tensors before and after the computations, but the actual forward/backward number crunching still happens in its full form. [...]

Your explanation is perfect, @ShadenSmith. Thank you!

I will try to summarize all your various answers together to try to draw a complete picture.

microsoft / DeepSpeed

A request for clarity around 3D Parallelism in DeepSpeed #673