[REQUEST] what‘s the difference of pipeline Parallelism between deepspeed and megatron?

mollon650 commented 9 months ago

what‘s the difference of pipeline Parallelism between deepspeed and megatron?

siddharth9820 commented 9 months ago

They are mostly identical. The megatron implementation is tightly wedded to Megatron-LM, so you cannot use it elsewhere easily. DS's implementation is modular, so you could parallelize other workloads outside of Megatron-DeepSpeed as well.

One difference is that Megatron offers another optimization called 'interleaved/virtual pipelining' which can be enabled by using this argument - https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/arguments.py#L1097.

mollon650 commented 9 months ago

@siddharth9820 thanks for your reply，I have another question about the code， if not fp16_master_weights_and_gradients: self.single_partition_of_fp32_groups.append(self.parallel_partitioned_bit16_groups[i][partition_id].to( self.device).clone().float().detach()) else: self.single_partition_of_fp32_groups.append(self.parallel_partitioned_bit16_groups[i][partition_id].to( self.device).clone().half().detach()) self.single_partition_of_fp32_groups[ i].requires_grad = True # keep this in case internal optimizer uses it single_partition_of_fp32_groups is detached ， and then set single_partition_of_fp32_groups grad enable， I feel confused about the code ， why the tensor is detached ，and set grad enable again？

microsoft / DeepSpeed

[REQUEST] what‘s the difference of pipeline Parallelism between deepspeed and megatron? #4801