Open stas00 opened 1 month ago
@stas00 could you try also adding --multi-step-stream-outputs=False
? This will mean tokens are streamed back in chunks of 8 (as was the case by default I think in 0.6.2), rather than individually.
Thank you, @njhill
Indeed adding --multi-step-stream-outputs=False
brings it back into the 0.6.2 ballpark - thank you!
vllm==0.6.3.post1 w/ --multi-step-stream-outputs=False
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 --port 9999 \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--tokenizer meta-llama/Meta-Llama-3-8B-Instruct \
--dtype=bfloat16 \
--seed 42 \
--num-scheduler-steps 8 \
--multi-step-stream-outputs=False \
--disable-log-requests \
-tp 2
============ Serving Benchmark Result ============
Successful requests: 2000
Benchmark duration (s): 39.33
Total input tokens: 453502
Total generated tokens: 377187
Request throughput (req/s): 50.85
Output token throughput (tok/s): 9590.81
Total Token throughput (tok/s): 21122.10
---------------Time to First Token----------------
Mean TTFT (ms): 13810.20
Median TTFT (ms): 12929.04
P99 TTFT (ms): 28807.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 20.19
Median TPOT (ms): 19.50
P99 TPOT (ms): 73.49
---------------Inter-token Latency----------------
Mean ITL (ms): 144.42
Median ITL (ms): 145.11
P99 ITL (ms): 533.96
==================================================
Thank you, @njhill
Indeed adding
--multi-step-stream-outputs=False
brings it back into the 0.6.2 ballpark - thank you!
- How can the users keep track of the default knobs being turned on/off between releases? (other than reading each release's notes)
I guess we should make this much clearer in the release notes, in general though I think we try not to change the default behaviour. This particular default change was actually the subject of some debate between myself and @mgoin :)
The hope is that we can make some more follow-on changes to hide/mitigate this additional overhead, but just didn't manage to get to that in time for the release. The rationale for it is to match the non-multistep behaviour when streaming of having a response message per output token. However what's interesting is that I would have expected the TTFT at least to have improved but it seems to have also got worse. I will look into this more closely.
- Do you plan to have a page where you discuss which optimization flags should be turned on/off/specific-values for which use-cases?
I'm not sure that we do have something clear/comprehensive, partly because all of these optimizations are changing so fast and there are many permutations depending on model / GPU / topology, etc. But we definitely should have such a guide imo.
Also this isn't related to the version change, but the serving benchmark always uses response streaming assuming it's supported by the server. If you are interested more in the non-interactive case then you may be able to improve performance further by disabling that in the test, change from True
to False
here.
Thank you for doing these experiments by the way!
Hmm, I wonder if perhaps at the very least you could maintain a single page with a few recipes for the common use cases? So as you are creating new performance improvements you have to measure those anyway - so why not share that process with the users? Just an idea of course.
To clarify - rather than asking for an exhaustive performance flags doc - just have something like:
best flags recipe for:
Surely there might be a few more common goals, but those 3 could probably be the best low-hanging fruit?
Thank you for doing these experiments by the way!
The pleasure is all mine. I enjoy running benchmarks and finding ways to make things faster ;)
Also this isn't related to the version change, but the serving benchmark always uses response streaming assuming it's supported by the server. If you are interested more in the non-interactive case then you may be able to improve performance further by disabling that in the test, change from True to False here.
It's probably more than that since changing the flag makes the util fail:
--- a/benchmarks/backend_request_func.py
+++ b/benchmarks/backend_request_func.py
@@ -238,7 +238,7 @@ async def async_request_openai_completions(
"best_of": request_func_input.best_of,
"max_tokens": request_func_input.output_len,
"logprobs": request_func_input.logprobs,
- "stream": True,
+ "stream": False,
"ignore_eos": request_func_input.ignore_eos,
}
headers = {
Traceback (most recent call last):
File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/vllm/benchmarks/benchmark_serving.py", line 966, in <module>
main(args)
File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/vllm/benchmarks/benchmark_serving.py", line 667, in main
benchmark_result = asyncio.run(
File "/env/lib/conda/stas-inference/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/env/lib/conda/stas-inference/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/vllm/benchmarks/benchmark_serving.py", line 426, in benchmark
raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Traceback (most recent call last):
File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/vllm/benchmarks/backend_request_func.py", line 291, in async_request_openai_completions
output.latency = latency
UnboundLocalError: local variable 'latency' referenced before assignment
Yes completely agree about the kind of page you're describing. I'll open an issue for this.
It's probably more than that since changing the flag makes the util fail:
Ah apologies, I'll check and get back to you, I don't expect it would be much more that needs changing. We can hopefully add it as a param to the script.
Thank you for finally addressing this issue!
We really need a simple guide to achieve optimal speed. The versions of vLLM iterate quickly, but the updated features and their corresponding parameters are not clearly documented. Here are some parameters that seem related to performance, but as an average user with limited knowledge of model architecture, it's hard to understand how to configure them:
--enable-prefix-caching
--enable-chunked-prefill
--max-num-batched-tokens
--max-num-seqs
--max-seq-len-to-capture
--num-scheduler-steps
--multi-step-stream-outputs
I understand that different model architectures require different configurations and experimentation, but it would be incredibly helpful if there were at least a basic recipe
for common models like Llama
、Mistral
that users could follow.
+1 to having a recipe that makes it easier for users to figure out what settings work best under different types of load. Having this as part of the official vLLM docs, would help eliminate new performance issues being raised like #9383 and #9722.
cc @simon-mo @mgoin
Report of performance regression
Using your benchmark
vllm==0.6.2
vllm==0.6.3.post1
Thanks.
Using 2x H100s