Closed blankde closed 3 months ago
This is because the gradients are synchronized across the DP group, so they are identical. Meanwhile, the sum of a parameter tensor should be collected from the whole MP group.
Thanks for your reply. But Megatron will reduce the total norm among MP group. see: https://github.com/NVIDIA/Megatron-LM/blob/8aa4619f2b2a57b5725026a50ebd2b15e8121482/megatron/optimizer/clip_grads.py#L105
Why we do that on moe grad individual?Will this cause double counting?
The key point is that the experts are different in a DP group of Megatron-LM (and also MP group in previous versions of FastMoE), so we have to reduce them. So I suppose the group should be DP group instead of MP group here.
The blame shows that @zms1999 changed this code from the world comm to the mp comm. Can you please recall our intutions of doing so when making the change 9 months ago?
I notice the L2 norm for experts is reduced twice in model parallel group, please see: https://github.com/laekov/fastmoe/blob/cd8372b3a8a5e73d46d2b463ec30995631cfc181/examples/megatron/clip-grad-v2.2.patch#L44C2-L44C2. It is a good ideas to add up the square gradients of all experts. But why reduce in model parallel group here instead of data parallel group? What are the considerations?
Thanks.