Open stas00 opened 3 years ago
I guess the discussion continued at https://github.com/huggingface/blog/pull/71#pullrequestreview-569704590, but it's best to avoid discussing important things that others may benefit from in code suggestions, because as soon as a suggestion is resolved github hides those comments.
So I will at least re-paste them here for posterity:
@jeffra wrote:
Maybe let me describe each feature in the parallelism space separately to hopefully clear up some confusion.
- ZeRO is the technique we developed to partition optimizer, gradient, and/or parameter states across data parallel ranks.
- Pipeline parallelism partitions layers of a model into stages that can then be processed in parallel. It can be used along side traditional data parallelism or can be also be combined with ZeRO.
- Tensor slicing/op sharding as done in Megatron-LM allows for specific ops in a model to be split across multiple ranks, thus reducing the memory footprint of specific components of a model (e.g., RowParallelLinear).
3D parallelism is the concept we introduced in the press release you quote above. This concept is the combination of all 3 of the above types of parallelism.
I believe the focus on the integration so far has been primarily in enabling number 1 from above, correct?
And then @samyam expanded:
Adding to @jeffra, one way I find it helpful to think about ZeRO is simply as data parallel training but with distributed data storage. The computation on each GPU is exactly the same as data parallel training, but the parameter, gradients and optimizer states are stored in a distributed/partitioned fashion across all the GPUs and fetched only when needed. It doesn't really matter if the partition is horizontal or vertical. This is quiet different from model parallel techniques such as tensor slicing or pipeline parallelism which divide not just the parameter, gradients and optimizer states, but also the computation vertically and horizontally, respectively. From that perspective, I find it more helpful to think of ZeRO as data parallel training than model parallel training.
Thank you for these clarifications, @jeffra and @samyam!
Yes, we have Zero stage 1+2 and some of its other parts integrated (e.g. cpu offload). So we have ZeRO - check!
So you're saying that PP isn't being used by default? So train_micro_batch_size_per_gpu
is only relevant if we explicitly implement use PP, right? which I don't think can be used with ZeRO, or am I wrong?
Tensor slicing/op sharding - I guess this is where I'm most confused - isn't param/optim/gradient sharding the same as tensor slicing/op sharding? In both cases you partition tensors and spread them out to different GPUs - so there must be a nuance that I'm missing here. I probably will understand it once I study Megatron-LM.
I think DeepSpeed would benefit a lot from having a sort of a visual map showing how the different components fit together.
The blog posts have a lot of diagrams, but they only convey a partial picture.
Hi @stas00 , thanks for the great suggestions :-). A detailed map of DeepSpeed would be great.
Pipeline parallelism is not used unless you provide a PipelineModule
to deepspeed.initialize()
. The train_micro_batch_size_per_gpu
configuration tells DeepSpeed how many samples are in each micro-batch provided to forward()
. The total global batch size is therefore train_micro_batch_size_per_gpu * gradient_accumulation_steps * data parallelism degree
. This is true with any DeepSpeed training, regardless of PP/ZeRO/etc.
Tensor slicing/op sharding - I guess this is where I'm most confused - isn't param/optim/gradient sharding the same as tensor slicing/op sharding? In both cases you partition tensors and spread them out to different GPUs - so there must be a nuance that I'm missing here. I probably will understand it once I study Megatron-LM.
This is a great question. The difference comes down to where the computation is performed. With ZeRO's sharding, we save memory by partitioning various tensors before and after the computations, but the actual forward
/backward
number crunching still happens in its full form. That means that you still need enough memory for the activation memory when computing a large layer.
In contrast, tensor slicing like Megatron-LM actually modifies the computations to work in a distributed manner. A rough example: instead of collecting the full data, doing a matrix multiplication, and then partitioning again, the module is inherently modified to work in a distributed manner. This has several very key advantages including reducing the size of activations for layers and also reducing the pressure on global batch size by splitting individual samples across model-parallel groups.
Tensor slicing is a model-specific strategy, which is why DeepSpeed supports, but does not provide it. @samyam phrased it in a great way, we like to think of sharding as optimizations for data parallel training, because we don't need the specifics of the model's computation.
Pipeline parallelism is not used unless you provide a
PipelineModule
todeepspeed.initialize()
.
I understood this part, thank you. I got mislead by the 3D parallelism discussion in the Microsoft blog post and assumed that DS does this out of the box. I think thanks to your clear answers I now have a pretty good understanding of the different parts.
The
train_micro_batch_size_per_gpu
configuration tells DeepSpeed how many samples are in each micro-batch provided toforward()
. The total global batch size is thereforetrain_micro_batch_size_per_gpu * gradient_accumulation_steps * data parallelism degree
. This is true with any DeepSpeed training, regardless of PP/ZeRO/etc.
Yes, of course, this is the DP micro-batch. Which is different from PP micro-batch.
Let me try to summarize my understanding so far:
DP splits the batch into micro-batches and across mulitple gpus, so each GPU only ever sees one micro-batch from the batch. Plus there is gradient_accumulation_steps
so:
total_batch_size = train_micro_batch_size_per_gpu * gradient_accumulation_steps * data parallelism degree
.
PP splits the batch into its chunks, which you also call micro-batches. The split here is not based on the number of gpus in the pipeline, but on the optimal number of stages that leads to the best utilization of the gpus, while overcoming the naive model parallelization problem.
Unlike DP, PP feeds all micro-batches to frontal GPU (in the simple case) and the other GPUs in the pipeline stack get to that same micro-batch in their turn. Here each GPU gets to see each and every micro-batch.
So PP micro-batch is different from DP micro-batch.
Then we have 2D Parallelism where we stack DP and PP, how do you name things here? As now we need to split each DP micro-batch into nano-batches for PP?
Perhaps the first level split should be called mini, and second micro?
Let's look at an example to clearly see which is which.
Say, you have 4 GPUs - 2 for DP and 2 for PP and we want to run a BS=24 and let's ignore gradient accumulation for now.
a. DP gets first dibs on splitting the batch and since we have 2 stacks of GPUs visibile to DP, each stack will get a micro-batch of 12. b. Now within each PP stack, let's say we found that the pipeline should be of size 3, which would lead to 12/3=4 so each stage of PP will be fed a sequence of 4,4,4
Am I doing well so far?
Perhaps, the way this is done now is that each "P" calls the splits "micro-batches" and as they stack the micro-micro-batches just work, because the context doesn't get overlapped.
or dp-micro-batch
, pp-micro-batch
?
Tensor slicing/op sharding - I guess this is where I'm most confused - isn't param/optim/gradient sharding the same as tensor slicing/op sharding? In both cases you partition tensors and spread them out to different GPUs - so there must be a nuance that I'm missing here. I probably will understand it once I study Megatron-LM.
This is a great question. The difference comes down to where the computation is performed. With ZeRO's sharding, we save memory by partitioning various tensors before and after the computations, but the actual
forward
/backward
number crunching still happens in its full form. [...]
Your explanation is perfect, @ShadenSmith. Thank you!
I will try to summarize all your various answers together to try to draw a complete picture.
Let's start with saying that based on my reading of various papers Model Parallelism (MP) is a very inconsistent term. One can slice vertically or horizontally. One can implement a naive slow version or speed it up with pipelining, and almost none of them is really parallel. I tried to summarize and demo a few of the basic options here: https://github.com/huggingface/transformers/issues/8771#issuecomment-758250421
So then DeepSpeed talks a lot about 3D parallelism, the blog posts like this state multiple times that DeepSpeed uses 3D parallelism.
Then @samyam kindly reviewed the draft of the upcoming blog post about DeepSpeed integration in transformers, where he suggests that no, DeepSpeed doesn't do 3D parallelism.
To me it does look like DeepSpeed implements all 3:
So please correct me if I'm wrong in that DeepSpeed isn't already doing 3D.
Quotes from the blog post: https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
and then later: