Open stas00 opened 2 days ago
Hi @stas00
First of all, thank you for your issue.
Your description reveals several issues, which I will point out here. If you have any questions, we can continue the discussion.
Regarding the figure in the README, you can refer to https://github.com/sgl-project/sglang/tree/main/benchmark/blog_v0_2, which provides a detailed description of versions and reproduction methods. Regarding the performance improvement of vLLM v0.6, we have also conducted a benchmark that can be found at https://github.com/sgl-project/sglang/tree/main/benchmark/benchmark_vllm_060. vLLM v0.6 has indeed improved significantly, but there are some limitations. Whether it's --num-scheduler-steps
or --multi-step-stream-outputs
, when enabled or disabled, compared to the baseline, TTFT or ITL may worsen. Meanwhile, other metrics improve. You need to understand the tradeoff behind this, rather than being attracted by a single metric.
Meanwhile, there are some issues with the parameters when you benchmark SGLang. For your testing scenario, you should use --disable-radix
and --enable-torch-compile
. Additionally, for an 8B model, using tp 1 is sufficient, there's no need to use tp 2, etc.
Overall, there are many aspects to consider with benchmarking. Both the configuration of the benchmark and configuration of the server itself can significantly impact the results. We need to focus on the overall performance metrics rather than local ones.
Thank you for your reply, Yineng.
Thank you for sharing the vllm==0.6.0 vs sglang benchmark. This is great, and fits right into the OP
Your frontpage shows vllm throughput being much much worse than sglang.
The benchmark you have shared shows that vllm is slightly worse, which is a very different situation. That's why I was suggesting a new visual is needed to show the updated reality.
Please note the first results table I shared doesn't use --num-scheduler-steps
and I was comparing apples to apples, since the setup I had to benchmark was using tp=2, I had to benchmark sglang with tp=2 as well.
But let's finish the vllm vs sglang discussion as I wasn't seeking to provoke - was just hoping for a fair representation of vllm as it currently appears to be very inferior on that plot you have published many months ago.
===============================
If I get the resources my intention is to support multiple inference backends in our team's inference framework and switch between them depending on which backend performs better than others in each particular use-case - or because of a better stability.
Let's move to how do I make SGLang shine. Thank you for sharing the tips that I should add --disable-radix
and --enable-torch-compile
. Tomorrow is a holiday, so I'm looking forward to re-running the benchmarks on Tue.
And it sounds like a very low TTFT is one of the main objectives of SGLang, correct? We currently do mainly offline generation, so TTFT doesn't matter, but it'll become hugely important when later we will be facing the user. But that's why I was benchmarking throughput. And I'm excited to use SGLang for when very low TTFT is crucial.
One other thing I was puzzling over is how could I do outlines
-style structured json generation, there I pass the target json schema and everything is taken care of automatically - with SGLang it appears I need to manually create a regex - any reason why this can't be automated? I really liked your blog about switching to prefill when doing structured generation and knowing that the next few tokens are fixed and require no generation and wanted to try it out in practice. Though I am starting to diverge here and should probably start a different thread on this one.
Hi SGLang team,
I have just tried SGLang for the first time - and it was probably one of the easiest projects to setup and launch - it literally took me a few minutes to go from 0 to serving - awesome!!! and thank you for making it so easy on the user.
I have just benchmarked vllm=0.6.2 vs sglang=0.3.2 on 2 H100s w/ 8b llama3 and tp=2 and I get vllm slightly faster than sglang performance, yet the benchmark section shows a very different picture. Would it be possible to re-benchmark and tell me if I am missing on some optimization flags to see the results you get - I'm just checking the baseline at the moment - so no quantization and such. Will get there a bit later. FWIW, I have just benchmarked and vllm had a massive throughput speed up made in v0.6.2 over its v0.5 https://x.com/StasBekman/status/1844886291378470966 - which is probably why the benchmark on your site needs a refresher.
Thank you!
Below are the stats and command lines so that it's reproducible by others.
vllm=0.6.2 w/ normal
vllm=0.6.2 w/ --num-scheduler-steps 8
sglang==0.3.2
the servers
vllm:
sglang:
the benchmark client