sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.18k stars 524 forks source link

[Feature] Expert parallelism support #1435

Open chongli-uw opened 2 months ago

chongli-uw commented 2 months ago

Checklist

Motivation

Hi team, First of all thanks so much for such a great project. I am wondering if there is plan to support Expert Parallelism for MoE models?

Related resources

https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html

merrymercy commented 2 months ago

https://github.com/sgl-project/sglang/blob/441c22db8cbcb005b5f005b991e8aa1a65d79bb6/python/sglang/srt/models/mixtral_quant.py#L86-L150

this is an early example

liangzelang commented 1 week ago

https://github.com/sgl-project/sglang/blob/441c22db8cbcb005b5f005b991e8aa1a65d79bb6/python/sglang/srt/models/mixtral_quant.py#L86-L150

this is an early example

@merrymercy Hi, any progress has been made on this issue? The example you provided previously didn't use FusedMOE but mlp. How can we enable Expert Parallel with the current Mixtral/DeepSeek-v2 after using FusedMOE? Do you have a modified example?

merrymercy commented 1 week ago

related #1970

liangzelang commented 1 week ago

related #1970

@merrymercy I see that this issue is mainly related to TP and DP. I noticed that the SGLang Q4 roadmap #1487 mentioned supporting this feature.

zhyncs commented 5 days ago

@liangzelang DP has already been merged(only for DeepSeek right now) https://github.com/sgl-project/sglang/pull/1970 and EP will be supported soon cc @ispobock

xiaobochen123 commented 1 day ago

@liangzelang DP has already been merged(only for DeepSeek right now) #1970 and EP will be supported soon cc @ispobock

@zhyncs Does MoE-EP have any support? I have implemented MoE-EP.

ispobock commented 1 day ago

Does MoE-EP have any support? I have implemented MoE-EP.

@xiaobochen123 We are going to implement it with a DP + EP approach for throughput gains. Currently, DP attention is implemented. Before we start the EP, some updates to the MoE codebase should be done.

I am interested in what kind of MoE-EP did you implement and what codebase did you use? How much are the performance gains compared to TP?