sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.11k stars 513 forks source link

[Develop] Performance Improving Feature #1105

Closed yukavio closed 2 weeks ago

yukavio commented 3 months ago

I want to develop some features based on Sglang to improve the performance of srt.

  1. A new scheduler of ControllerMulti that can more accurately identify the resource utilization of each instance and dispatch the latest requests to processes with low resource utilization.
  2. SplitFuse, which enables decoding tokens and extending tokens could be computed in one batch.
  3. Flexible request swapping. This feature allows a request to be transferred to other processes for continued computation when the process it belongs to lacks sufficient resources to continue decoding, thereby preventing the request from being halted. The transfer would be implemented by kv cache swapping to avoid extra computation.

Looking forward to everyone's suggestions.😊

merrymercy commented 2 months ago

1 and 3 are interesting to us. 2 has been implemented here https://github.com/sgl-project/sglang/blob/5ff25cdf5b1310e83d9e595142b39ae4d7b561e9/python/sglang/srt/server_args.py#L426-L430, although there is still room for improvement.

Please join our Slack channel and we can have more discussions there : https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

yukavio commented 2 months ago

1 and 3 are interesting to us. 2 has been implemented here

https://github.com/sgl-project/sglang/blob/5ff25cdf5b1310e83d9e595142b39ae4d7b561e9/python/sglang/srt/server_args.py#L426-L430

, although there is still room for improvement. Please join our Slack channel and we can have more discussions there : https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

I have implemented the plan1 in this PR: https://github.com/sgl-project/sglang/pull/1142. I am considering temporarily setting aside plans 2 and 3 because I believe that speculative decoding is a more crucial feature. This could significantly enhance the throughput of the inference server. I am implementing the Speculative inference based on EAGLE2 now. I will raise a PR later, and estimate that the initial version of development will be completed within two weeks.

zhyncs commented 2 months ago

Contributions are very welcome! https://arxiv.org/pdf/2406.16858

zhyncs commented 2 months ago

We very much welcome features that improve performance. Overall, we hope the PRs submitted can adhere to the following principles:

  1. If possible, provide information on profiles before and after optimization, such as nsys.
  2. You need to provide a comparison of benchmarks before and after optimization. If there are particularly many changes to the code or if the changes are very complex but the overall improvement is less than 10% or even 5%, we might not consider merging.
  3. Reuse existing components as much as possible.
  4. Add new components or refactor, need to add corresponding unit tests. Thanks!
github-actions[bot] commented 2 weeks ago

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.