Closed yukavio closed 2 weeks ago
1 and 3 are interesting to us. 2 has been implemented here https://github.com/sgl-project/sglang/blob/5ff25cdf5b1310e83d9e595142b39ae4d7b561e9/python/sglang/srt/server_args.py#L426-L430, although there is still room for improvement.
Please join our Slack channel and we can have more discussions there : https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw
1 and 3 are interesting to us. 2 has been implemented here
, although there is still room for improvement. Please join our Slack channel and we can have more discussions there : https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw
I have implemented the plan1 in this PR: https://github.com/sgl-project/sglang/pull/1142. I am considering temporarily setting aside plans 2 and 3 because I believe that speculative decoding is a more crucial feature. This could significantly enhance the throughput of the inference server. I am implementing the Speculative inference based on EAGLE2 now. I will raise a PR later, and estimate that the initial version of development will be completed within two weeks.
Contributions are very welcome! https://arxiv.org/pdf/2406.16858
We very much welcome features that improve performance. Overall, we hope the PRs submitted can adhere to the following principles:
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
I want to develop some features based on Sglang to improve the performance of srt.
transfer
would be implemented by kv cache swapping to avoid extra computation.Looking forward to everyone's suggestions.😊