For one, this means every MoE submodule's parameters gets the same name, something like ep_size_1... which is not very useful from a logging/tracking perspective.
This also means that if there are multiple MoE submodules being split from the same initial parameter optimization group by split_params_into_different_moe_groups_for_optimizer, you can end up with parameters for 2 different MoE submodules intermixed in the same group if you hit the size limit, when they otherwise could have been together if they had each been tracked by a unique expert_group_name.
Additionally, since the expert number is not tracked, you can also have parameters from different experts in the same group, whereas if you kept them separated by expert_number, you could keep parameters from the same expert in the same group.
Based on the answer I received to a question here: What are the benefits to limiting param_group size?, "we split the groups to save memory because only one of these groups will be on the GPU in full precision at a time." Is this right?
My question is: would it benefit us to keep parameter groups more homogenous either to the specific MoE submodule or expert number rather than having them all be intermixed? Would it benefit locality or anything like that to keep the groups more homogenous like I am thinking?
Describe the solution you'd like
class MoE(nn.Module):
"""Initialize an MoE layer.
Arguments:
...
expert_group_name (str, optional): default=None, the name of the expert group.
"""
def __init__(self,
...
expert_group_name: Optional[str] = None) -> None:
self.expert_group_name = expert_group_name or f"ep_size_{self.ep_size}"
class Experts(nn.Module):
def __init__(self, expert: nn.Module, num_local_experts: int = 1, expert_group_name: Optional[str] = None) -> None:
for i, expert in enumerate(self.deepspeed_experts):
for param in expert.parameters():
param.expert_number = i
And then in split_params_into_different_moe_groups_for_optimizer, we can add another layer of dicts for the expert_number so that generated param groups will always contain parameters from the same submodule and expert.
Is your feature request related to a problem? Please describe. From what I have noticed when looking at the
MoE
implementation, theexpert_group_name
is not configurable: https://github.com/microsoft/DeepSpeed/blob/6de31de73fdf0a5e0f90c92e10cff4e72e91cf65/deepspeed/moe/layer.py#L58For one, this means every
MoE
submodule's parameters gets the same name, something likeep_size_1
... which is not very useful from a logging/tracking perspective.This also means that if there are multiple
MoE
submodules being split from the same initial parameter optimization group bysplit_params_into_different_moe_groups_for_optimizer
, you can end up with parameters for 2 differentMoE
submodules intermixed in the same group if you hit the size limit, when they otherwise could have been together if they had each been tracked by a uniqueexpert_group_name
.Additionally, since the expert number is not tracked, you can also have parameters from different experts in the same group, whereas if you kept them separated by
expert_number
, you could keep parameters from the same expert in the same group.Based on the answer I received to a question here: What are the benefits to limiting param_group size?, "we split the groups to save memory because only one of these groups will be on the GPU in full precision at a time." Is this right?
My question is: would it benefit us to keep parameter groups more homogenous either to the specific
MoE
submodule or expert number rather than having them all be intermixed? Would it benefit locality or anything like that to keep the groups more homogenous like I am thinking?Describe the solution you'd like
And then in
split_params_into_different_moe_groups_for_optimizer
, we can add another layer of dicts for the expert_number so that generated param groups will always contain parameters from the same submodule and expert.