sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.9k stars 477 forks source link

Performance issue comparing sglang to vllm. #169

Closed findalexli closed 8 months ago

findalexli commented 9 months ago

Hi there, Amazing work on the RadixAttention and json contained decoding. I am running into some unexcited performance issue comparing sglang and vllm. I use latest pip of vllm, and use git-clone-ed sglang as of today.

here is my code to launch sglang python -m sglang.launch_server --model-path NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --port 30000 --tp 8

Here is my code to launch v-llm

python -m vllm.entrypoints.openai.api_server --model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --tensor-parallel-size 8

Both running with the same Conda with CUDA 12.1 environment, 8x a10g on aws. Here is the openai-compatible curl request

'curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", "messages": [ {"role": "system", "content": "You are a helpful AI assistant"}, {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."} ] } ' The SG-lang one is giving me 10 second of latency, while the vllm is giving 0.45 second. The number are reported after the first run to avoid any cold-start issue.

comaniac commented 9 months ago

10 seconds look weird. Is it constantly 10 seconds if you run the same request multiple times? And what's the log on the server side?

hnyls2002 commented 9 months ago

I tried your case also with 8xA10g on aws and CUDA12.2, running NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO on the latest main branch.

The script is

curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
    "messages": [
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."}
    ],
    "max_tokens": 100
}'

Feb-09-2024 11-23-11

It takes me less than one second to get the answers. There may be some unnoticed problems; please provide more details or server end outputs so we can help you well.

findalexli commented 9 months ago

HI there, just pulled the latest changes, a lot faster now:

Here are the result:

Running

curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", "messages": [ {"role": "system", "content": "You are a helpful AI assistant"}, {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."} ] }'

I am getting 0.57 (run 3 times), which is still a bit slower than using the same curl command to vllm, sitting at 0.45 second.

I also run the following python script

`

import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)`

The result is consistently around 1.3 second, which is almost 3 times slower than using curl.

All above numbers were run at least 5 times

comaniac commented 8 months ago

Since your prompt is pretty short, it's likely that this request cannot be benefit from RadixAttention that much. In this case, since vLLM enables CUDA graph, it might be faster in terms of prefill computation. You can try a longer prompt (e.g., >500 tokens) to see if it is still the case.

merrymercy commented 8 months ago

@findalexli SGLang is mainly optimized for high-throughput large-batch serving, especially for requests with many shared prefixes. However, in your case, you benchmarked the latency of a single short prompt, which is not what SGLang is optimized for. To obtain more realistic results, you may want to run your own dataset with larger batch sizes.

Another factor is that there is a recent PR in vLLM (https://github.com/vllm-project/vllm/pull/2542) that introduced some fused kernels to improve MoE inference. We can bring it to our code as well. (https://github.com/sgl-project/sglang/issues/179)