Closed findalexli closed 8 months ago
10 seconds look weird. Is it constantly 10 seconds if you run the same request multiple times? And what's the log on the server side?
I tried your case also with 8xA10g
on aws and CUDA12.2
, running NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO
on the latest main branch.
The script is
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."}
],
"max_tokens": 100
}'
It takes me less than one second to get the answers. There may be some unnoticed problems; please provide more details or server end outputs so we can help you well.
HI there, just pulled the latest changes, a lot faster now:
Here are the result:
Running
curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", "messages": [ {"role": "system", "content": "You are a helpful AI assistant"}, {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."} ] }'
I am getting 0.57 (run 3 times), which is still a bit slower than using the same curl command to vllm, sitting at 0.45 second.
I also run the following python script
`
import openai
client = openai.Client(
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response)`
The result is consistently around 1.3 second, which is almost 3 times slower than using curl.
All above numbers were run at least 5 times
Since your prompt is pretty short, it's likely that this request cannot be benefit from RadixAttention that much. In this case, since vLLM enables CUDA graph, it might be faster in terms of prefill computation. You can try a longer prompt (e.g., >500 tokens) to see if it is still the case.
@findalexli SGLang is mainly optimized for high-throughput large-batch serving, especially for requests with many shared prefixes. However, in your case, you benchmarked the latency of a single short prompt, which is not what SGLang is optimized for. To obtain more realistic results, you may want to run your own dataset with larger batch sizes.
Another factor is that there is a recent PR in vLLM (https://github.com/vllm-project/vllm/pull/2542) that introduced some fused kernels to improve MoE inference. We can bring it to our code as well. (https://github.com/sgl-project/sglang/issues/179)
Hi there, Amazing work on the RadixAttention and json contained decoding. I am running into some unexcited performance issue comparing sglang and vllm. I use latest pip of vllm, and use git-clone-ed sglang as of today.
here is my code to launch sglang
python -m sglang.launch_server --model-path NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --port 30000 --tp 8
Here is my code to launch v-llm
python -m vllm.entrypoints.openai.api_server --model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --tensor-parallel-size 8
Both running with the same Conda with CUDA 12.1 environment, 8x a10g on aws. Here is the openai-compatible curl request
'curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", "messages": [ {"role": "system", "content": "You are a helpful AI assistant"}, {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."} ] } ' The SG-lang one is giving me 10 second of latency, while the vllm is giving 0.45 second. The number are reported after the first run to avoid any cold-start issue.