microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
694 stars 84 forks source link

How to create a custom expert with tutel? #226

Open zws98 opened 4 months ago

zws98 commented 4 months ago

Code:

self._moe_layer = tutel_moe.moe_layer( gate_type = {'type': 'top', 'k': top_value, 'fp32_gate': args.fp32_gate}, experts = {'type': 'ffn', 'count_per_node': num_local_experts, 'hidden_size_per_expert': hidden_size, 'activation_fn': lambda x: F.relu(x)}, model_dim = model_dim, scan_expert_func = lambda name, param: setattr(param, 'skip_allreduce', True), seeds = (1, dist_rank + 1, 1), a2a_ffn_overlap_degree = a2a_ffn_overlap_degree, )

How can I define a custom expert, e.g., only one mlp layer?

ghostplant commented 4 months ago

You can follow this example: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_demo.py, which can be executed with: python3 -m tutel.examples.helloworld_demo --batch_size=16

zws98 commented 4 months ago

thanks a lot!

zws98 commented 4 months ago

What if I want to feed another parameter in "class CustomExpertDemo(torch.nn.Module):", how can I revise the code in tutel?

zws98 commented 4 months ago

e.g., def forward(self, x, ctx, anew_param):

ghostplant commented 4 months ago

Is that a static parameter that can be set just in __init__ function of CustomExpertDemo?

zws98 commented 4 months ago

nope, it is a learnable parameter initialized out of the class "CustomExpertDem".

ghostplant commented 4 months ago

Still need a few API upgrades to meet your requirement.

zws98 commented 4 months ago

Thanks, is there an available way to modify it after installing tutel? (e.g., reivising xx.py after installing tutel)

ghostplant commented 4 months ago

You need to feed extra argument data you need here: https://github.com/microsoft/tutel/blob/main/tutel/impls/moe_layer.py#L238, where self.experts is the layer object created from your custom CustomExpertDemo.

You also need to extend corresponding argument list in the forward function to match data you feed: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_demo.py#L101

If you cannot clone and install tutel from source after changes above applied in the source, you have to get the location of installed file maybe at /usr/..../tutel/impls/moe_layer.py and apply the changes there.

zws98 commented 4 months ago

Thanks a lot.

zws98 commented 3 months ago

When I use the Customexpert, it stopped here: if ctx.sharded_count > 1: raise Exception("sharded_count > 1 is not implemented within this expert, Model parallel is disabled.")

class CustomExpert_lora(torch.nn.Module):
    def __init__(self, model_dim, local_experts, sharded_count, my_config, act_layer=nn.GELU):
        super().__init__()
        self.r = 8
        self.scale = 1 / math.sqrt(self.r) 
        self.lora_A1 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B1 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.act = act_layer()
        self.lora_A2 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B2 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.reset_parameters()

    def reset_parameters(self):
        init.kaiming_normal_(self.lora_A1)
        self.lora_A1.data *= self.scale 
        init.constant_(self.lora_B1, 0)
        init.kaiming_normal_(self.lora_A2)
        self.lora_A2.data *= self.scale 
        init.constant_(self.lora_B2, 0)

    def forward(self, x, ctx):

        if ctx.sharded_count > 1:
            raise Exception("`sharded_count > 1` is not implemented within this expert, Model parallel is disabled.")

        t1 = torch.matmul(self.lora_A1, self.lora_B1) 
        t2 = torch.matmul(self.lora_A2, self.lora_B2)  
        y = torch.matmul(x, t1)  
        y = self.act(y)
        y = torch.matmul(y, t2)  
        return y
ghostplant commented 3 months ago

When I use the Customexpert, it stopped here: if ctx.sharded_count > 1: raise Exception("sharded_count > 1 is not implemented within this expert, Model parallel is disabled.")

class CustomExpert_lora(torch.nn.Module):
    def __init__(self, model_dim, local_experts, sharded_count, my_config, act_layer=nn.GELU):
        super().__init__()
        self.r = 8
        self.scale = 1 / math.sqrt(self.r) 
        self.lora_A1 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B1 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.act = act_layer()
        self.lora_A2 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B2 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.reset_parameters()

    def reset_parameters(self):
        init.kaiming_normal_(self.lora_A1)
        self.lora_A1.data *= self.scale 
        init.constant_(self.lora_B1, 0)
        init.kaiming_normal_(self.lora_A2)
        self.lora_A2.data *= self.scale 
        init.constant_(self.lora_B2, 0)

    def forward(self, x, ctx):

        if ctx.sharded_count > 1:
            raise Exception("`sharded_count > 1` is not implemented within this expert, Model parallel is disabled.")

        t1 = torch.matmul(self.lora_A1, self.lora_B1) 
        t2 = torch.matmul(self.lora_A2, self.lora_B2)  
        y = torch.matmul(x, t1)  
        y = self.act(y)
        y = torch.matmul(y, t2)  
        return y

What is the value of adaptive_r in your moe forward setting?

zws98 commented 3 months ago

Where can I find the "adaptive_r" ?

zws98 commented 3 months ago

I have not changed the value of adaptive_r. I directly replaced the above-mentioned custom MLP with the default FFN and the program is working fine.

ghostplant commented 3 months ago

So looks like num_global_experts is smaller than the number of GPUs, right?

zws98 commented 3 months ago

num_global_experts=2, self.world_size=8

ghostplant commented 3 months ago

Yes. When the execution setting num_global_experts < self.world_size, you will have to handle if shared_count > 1 which tells the way to partition expert parameters that are distributed across more than 1 GPU. Typically, you can implement a expert-data-parallelism to enable this execution setting, which requires creating sharded parameters in initialization and then all_gather sharded parameters in forward function. Actually, the built-in FFN layer has included those implementations, but I'll share you a simpler example.

zws98 commented 3 months ago

thanks a lot!

ghostplant commented 3 months ago

Please follow this example in handling sharded_count: https://github.com/microsoft/tutel/blob/main/tutel/experts/llama_ffn.py And another end-to-end example: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_custom_expert_sharded.py