vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.93k stars 4.13k forks source link

Feature request: Expert parallel for MoE architectures #2405

Open imoneoi opened 9 months ago

imoneoi commented 9 months ago

Can we implement the expert parallel strategy for MoE to fully exploit the sparse activation property? Ideally, MoE should only use compute at the order of active parameters, but the current implementation uses the same compute as a dense model.

Expert parallelism is very similar to data parallelism across multiple GPUs, the only difference is that the experts are on separate GPUs and the tokens are permuted during MoE layer forward pass, as shown in the figure below.

I can help implement the MoE layer, but I'm curious how to implement data parallel with vLLM?

Diagram (Diagram from FastMoE)

hmellor commented 6 months ago

@imoneoi have you done any work on this feature?

Shamauk commented 3 months ago

Any updates?