huggingface / optimum-habana

Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Apache License 2.0
154 stars 202 forks source link

Add DynamicMoE support for Mixtral #1511

Open kwisniewski98 opened 5 days ago

kwisniewski98 commented 5 days ago

Add DynamicMoE support for Mixtral

astachowiczhabana commented 4 days ago

@libinta this commit is required with next OH release

Wei-Lin-Intel commented 13 hours ago

Please consider not to use such API with combined w1 and w3. In the future the dynamic MoE can also support for training, such method would change the weights ordering when saving the checkpoints. The dynamic MoE can also support the separate call of w1, w2 and w3. It is not necessary to align with vLLM Mixtral. final_hidden_states = torch.ops.hpu.mixture_of_experts( hidden_states=hidden_states, expert_routing_table=selected_experts, router_weights=routing_weights, w1=w1_list, w2=w3_list, w3=w2_list, permuted_weights=True, activation=act_fn, experts_min=0, experts_max=self.num_experts - 1 )

regisss commented 3 hours ago

What's the difference between this PR and #1518 ?

astachowiczhabana commented 3 hours ago

this is duplicate, sorry, my mistake

kwisniewski98 commented 1 hour ago

Please consider not to use such API with combined w1 and w3. In the future the dynamic MoE can also support for training, such method would change the weights ordering when saving the checkpoints. The dynamic MoE can also support the separate call of w1, w2 and w3. It is not necessary to align with vLLM Mixtral. final_hidden_states = torch.ops.hpu.mixture_of_experts( hidden_states=hidden_states, expert_routing_table=selected_experts, router_weights=routing_weights, w1=w1_list, w2=w3_list, w3=w2_list, permuted_weights=True, activation=act_fn, experts_min=0, experts_max=self.num_experts - 1 )

Done, I've used an op that you've suggested

Wei-Lin-Intel commented 36 minutes ago

LGTM, thanks