Can we implement the expert parallel strategy for MoE to fully exploit the sparse activation property? Ideally, MoE should only use compute at the order of active parameters, but the current implementation uses the same compute as a dense model.
Expert parallelism is very similar to data parallelism across multiple GPUs, the only difference is that the experts are on separate GPUs and the tokens are permuted during MoE layer forward pass, as shown in the figure below.
I can help implement the MoE layer, but I'm curious how to implement data parallel with vLLM?
Can we implement the expert parallel strategy for MoE to fully exploit the sparse activation property? Ideally, MoE should only use compute at the order of active parameters, but the current implementation uses the same compute as a dense model.
Expert parallelism is very similar to data parallelism across multiple GPUs, the only difference is that the experts are on separate GPUs and the tokens are permuted during MoE layer forward pass, as shown in the figure below.
I can help implement the MoE layer, but I'm curious how to implement data parallel with vLLM?
(Diagram from FastMoE)