sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.59k stars 425 forks source link

is it time to rerun the benchmarks? #1639

Open stas00 opened 2 days ago

stas00 commented 2 days ago

Hi SGLang team,

I have just tried SGLang for the first time - and it was probably one of the easiest projects to setup and launch - it literally took me a few minutes to go from 0 to serving - awesome!!! and thank you for making it so easy on the user.

I have just benchmarked vllm=0.6.2 vs sglang=0.3.2 on 2 H100s w/ 8b llama3 and tp=2 and I get vllm slightly faster than sglang performance, yet the benchmark section shows a very different picture. Would it be possible to re-benchmark and tell me if I am missing on some optimization flags to see the results you get - I'm just checking the baseline at the moment - so no quantization and such. Will get there a bit later. FWIW, I have just benchmarked and vllm had a massive throughput speed up made in v0.6.2 over its v0.5 https://x.com/StasBekman/status/1844886291378470966 - which is probably why the benchmark on your site needs a refresher.

Thank you!

Below are the stats and command lines so that it's reproducible by others.

vllm=0.6.2 w/ normal

============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  5.30
Total input tokens:                      12180
Total generated tokens:                  11255
Request throughput (req/s):              9.43
Output token throughput (tok/s):         2121.93
Total Token throughput (tok/s):          4418.26
---------------Time to First Token----------------
Mean TTFT (ms):                          367.92
Median TTFT (ms):                        375.01
P99 TTFT (ms):                           378.64
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.80
Median TPOT (ms):                        6.53
P99 TPOT (ms):                           6.75
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.54
Median ITL (ms):                         6.87
P99 ITL (ms):                            8.56
==================================================

vllm=0.6.2 w/ --num-scheduler-steps 8

============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  4.44
Total input tokens:                      12180
Total generated tokens:                  11249
Request throughput (req/s):              11.27
Output token throughput (tok/s):         2535.33
Total Token throughput (tok/s):          5280.50
---------------Time to First Token----------------
Mean TTFT (ms):                          242.44
Median TTFT (ms):                        231.79
P99 TTFT (ms):                           279.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.42
Median TPOT (ms):                        5.82
P99 TPOT (ms):                           12.75
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.78
Median ITL (ms):                         44.70
P99 ITL (ms):                            101.35
==================================================

sglang==0.3.2

============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  5.47
Total input tokens:                      12180
Total generated tokens:                  11514
Request throughput (req/s):              9.14
Output token throughput (tok/s):         2104.62
Total Token throughput (tok/s):          4330.98
---------------Time to First Token----------------
Mean TTFT (ms):                          240.62
Median TTFT (ms):                        242.29
P99 TTFT (ms):                           326.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.84
Median TPOT (ms):                        7.19
P99 TPOT (ms):                           26.74
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.04
Median ITL (ms):                         6.53
P99 ITL (ms):                            10.16
==================================================

the servers

vllm:

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 9999 \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \
    --dtype=bfloat16 \
    --seed 42 \
    --gpu_memory_utilization 0.8 \
    --num-scheduler-steps 8 \
    -tp 2

sglang:

python -m sglang.launch_server --port 9999 --tp 2  --model-path meta-llama/Meta-Llama-3-8B-Instruct

the benchmark client

git clone https://github.com/vllm-project/vllm
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
mkdir results
python benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --port 9999 \
    --save-result \
    --result-dir results \
    --result-filename test.json \
    --num-prompts 50 \
    --request-rate inf \
    --seed 42
zhyncs commented 2 days ago

Hi @stas00

First of all, thank you for your issue.

Your description reveals several issues, which I will point out here. If you have any questions, we can continue the discussion.

Regarding the figure in the README, you can refer to https://github.com/sgl-project/sglang/tree/main/benchmark/blog_v0_2, which provides a detailed description of versions and reproduction methods. Regarding the performance improvement of vLLM v0.6, we have also conducted a benchmark that can be found at https://github.com/sgl-project/sglang/tree/main/benchmark/benchmark_vllm_060. vLLM v0.6 has indeed improved significantly, but there are some limitations. Whether it's --num-scheduler-steps or --multi-step-stream-outputs, when enabled or disabled, compared to the baseline, TTFT or ITL may worsen. Meanwhile, other metrics improve. You need to understand the tradeoff behind this, rather than being attracted by a single metric.

Meanwhile, there are some issues with the parameters when you benchmark SGLang. For your testing scenario, you should use --disable-radix and --enable-torch-compile. Additionally, for an 8B model, using tp 1 is sufficient, there's no need to use tp 2, etc.

Overall, there are many aspects to consider with benchmarking. Both the configuration of the benchmark and configuration of the server itself can significantly impact the results. We need to focus on the overall performance metrics rather than local ones.

stas00 commented 9 hours ago

Thank you for your reply, Yineng.

Thank you for sharing the vllm==0.6.0 vs sglang benchmark. This is great, and fits right into the OP

Your frontpage shows vllm throughput being much much worse than sglang.

68747470733a2f2f6c6d7379732e6f72672f696d616765732f626c6f672f73676c616e675f6c6c616d61332f38625f7468726f7567687075742e737667

The benchmark you have shared shows that vllm is slightly worse, which is a very different situation. That's why I was suggesting a new visual is needed to show the updated reality.

Please note the first results table I shared doesn't use --num-scheduler-steps and I was comparing apples to apples, since the setup I had to benchmark was using tp=2, I had to benchmark sglang with tp=2 as well.

But let's finish the vllm vs sglang discussion as I wasn't seeking to provoke - was just hoping for a fair representation of vllm as it currently appears to be very inferior on that plot you have published many months ago.

===============================

If I get the resources my intention is to support multiple inference backends in our team's inference framework and switch between them depending on which backend performs better than others in each particular use-case - or because of a better stability.

Let's move to how do I make SGLang shine. Thank you for sharing the tips that I should add --disable-radix and --enable-torch-compile. Tomorrow is a holiday, so I'm looking forward to re-running the benchmarks on Tue.

And it sounds like a very low TTFT is one of the main objectives of SGLang, correct? We currently do mainly offline generation, so TTFT doesn't matter, but it'll become hugely important when later we will be facing the user. But that's why I was benchmarking throughput. And I'm excited to use SGLang for when very low TTFT is crucial.

One other thing I was puzzling over is how could I do outlines-style structured json generation, there I pass the target json schema and everything is taken care of automatically - with SGLang it appears I need to manually create a regex - any reason why this can't be automated? I really liked your blog about switching to prefill when doing structured generation and knowing that the next few tokens are fixed and require no generation and wanted to try it out in practice. Though I am starting to diverge here and should probably start a different thread on this one.