laekov / fastmoe

A fast MoE impl for PyTorch
https://fastmoe.ai
Apache License 2.0
1.56k stars 188 forks source link

MoE L2 norm reduce in Megatron #169

Closed blankde closed 3 months ago

blankde commented 1 year ago

I notice the L2 norm for experts is reduced twice in model parallel group, please see: https://github.com/laekov/fastmoe/blob/cd8372b3a8a5e73d46d2b463ec30995631cfc181/examples/megatron/clip-grad-v2.2.patch#L44C2-L44C2. It is a good ideas to add up the square gradients of all experts. But why reduce in model parallel group here instead of data parallel group? What are the considerations?

Thanks.

laekov commented 1 year ago

This is because the gradients are synchronized across the DP group, so they are identical. Meanwhile, the sum of a parameter tensor should be collected from the whole MP group.

blankde commented 1 year ago

Thanks for your reply. But Megatron will reduce the total norm among MP group. see: https://github.com/NVIDIA/Megatron-LM/blob/8aa4619f2b2a57b5725026a50ebd2b15e8121482/megatron/optimizer/clip_grads.py#L105

Why we do that on moe grad individual?Will this cause double counting?

laekov commented 1 year ago

The key point is that the experts are different in a DP group of Megatron-LM (and also MP group in previous versions of FastMoE), so we have to reduce them. So I suppose the group should be DP group instead of MP group here.

The blame shows that @zms1999 changed this code from the world comm to the mp comm. Can you please recall our intutions of doing so when making the change 9 months ago?