microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.69k stars 4.04k forks source link

[REQUEST] what‘s the difference of pipeline Parallelism between deepspeed and megatron? #4801

Open mollon650 opened 9 months ago

mollon650 commented 9 months ago

what‘s the difference of pipeline Parallelism between deepspeed and megatron?

siddharth9820 commented 9 months ago

They are mostly identical. The megatron implementation is tightly wedded to Megatron-LM, so you cannot use it elsewhere easily. DS's implementation is modular, so you could parallelize other workloads outside of Megatron-DeepSpeed as well.

One difference is that Megatron offers another optimization called 'interleaved/virtual pipelining' which can be enabled by using this argument - https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/arguments.py#L1097.

mollon650 commented 9 months ago

@siddharth9820 thanks for your reply,I have another question about the code, if not fp16_master_weights_and_gradients: self.single_partition_of_fp32_groups.append(self.parallel_partitioned_bit16_groups[i][partition_id].to( self.device).clone().float().detach()) else: self.single_partition_of_fp32_groups.append(self.parallel_partitioned_bit16_groups[i][partition_id].to( self.device).clone().half().detach()) self.single_partition_of_fp32_groups[ i].requires_grad = True # keep this in case internal optimizer uses it single_partition_of_fp32_groups is detached , and then set single_partition_of_fp32_groups grad enable, I feel confused about the code , why the tensor is detached ,and set grad enable again?