microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.51k stars 4.12k forks source link

[REQUEST] Should MoE be parameter groups be partitionable by expert_group_name or expert number? #5064

Open ringohoffman opened 9 months ago

ringohoffman commented 9 months ago

Is your feature request related to a problem? Please describe. From what I have noticed when looking at the MoE implementation, the expert_group_name is not configurable: https://github.com/microsoft/DeepSpeed/blob/6de31de73fdf0a5e0f90c92e10cff4e72e91cf65/deepspeed/moe/layer.py#L58

For one, this means every MoE submodule's parameters gets the same name, something like ep_size_1... which is not very useful from a logging/tracking perspective.

This also means that if there are multiple MoE submodules being split from the same initial parameter optimization group by split_params_into_different_moe_groups_for_optimizer, you can end up with parameters for 2 different MoE submodules intermixed in the same group if you hit the size limit, when they otherwise could have been together if they had each been tracked by a unique expert_group_name.

Additionally, since the expert number is not tracked, you can also have parameters from different experts in the same group, whereas if you kept them separated by expert_number, you could keep parameters from the same expert in the same group.

Based on the answer I received to a question here: What are the benefits to limiting param_group size?, "we split the groups to save memory because only one of these groups will be on the GPU in full precision at a time." Is this right?

My question is: would it benefit us to keep parameter groups more homogenous either to the specific MoE submodule or expert number rather than having them all be intermixed? Would it benefit locality or anything like that to keep the groups more homogenous like I am thinking?

Describe the solution you'd like

class MoE(nn.Module):
    """Initialize an MoE layer.

    Arguments:
        ...
        expert_group_name (str, optional): default=None, the name of the expert group.
    """

    def __init__(self,
                 ...
                 expert_group_name: Optional[str] = None) -> None:
        self.expert_group_name = expert_group_name or f"ep_size_{self.ep_size}"
class Experts(nn.Module):

    def __init__(self, expert: nn.Module, num_local_experts: int = 1, expert_group_name: Optional[str] = None) -> None:
        for i, expert in enumerate(self.deepspeed_experts):
            for param in expert.parameters():
                param.expert_number = i

And then in split_params_into_different_moe_groups_for_optimizer, we can add another layer of dicts for the expert_number so that generated param groups will always contain parameters from the same submodule and expert.

ringohoffman commented 9 months ago

@mrwyattii