[Performance]: speed regression 0.6.2 => 0.6.3?

stas00 commented 1 month ago

Report of performance regression

Using your benchmark

git clone https://github.com/vllm-project/vllm
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
mkdir results
python benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --port 9999 \
    --save-result \
    --result-dir results \
    --result-filename test.json \
    --num-prompts 2000 \
    --request-rate inf \
    --seed 42

vllm==0.6.2

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 9999 \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \
    --dtype=bfloat16 \
    --seed 42 \
    --num-scheduler-steps 8 \
    --disable-log-requests \
    -tp 2

============ Serving Benchmark Result ============
Successful requests:                     2000
Benchmark duration (s):                  37.56
Total input tokens:                      453502
Total generated tokens:                  377235
Request throughput (req/s):              53.24
Output token throughput (tok/s):         10042.39
Total Token throughput (tok/s):          22115.08
---------------Time to First Token----------------
Mean TTFT (ms):                          13418.88
Median TTFT (ms):                        13693.80
P99 TTFT (ms):                           27527.70
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.64
Median TPOT (ms):                        18.37
P99 TPOT (ms):                           74.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           137.43
Median ITL (ms):                         136.93
P99 ITL (ms):                            506.59
==================================================

vllm==0.6.3.post1

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 9999 \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \
    --dtype=bfloat16 \
    --seed 42 \
    --num-scheduler-steps 8 \
    --disable-log-requests \
    -tp 2


============ Serving Benchmark Result ============
Successful requests:                     2000
Benchmark duration (s):                  43.14
Total input tokens:                      453502
Total generated tokens:                  378114
Request throughput (req/s):              46.36
Output token throughput (tok/s):         8764.77
Total Token throughput (tok/s):          19277.04
---------------Time to First Token----------------
Mean TTFT (ms):                          16705.17
Median TTFT (ms):                        15436.29
P99 TTFT (ms):                           36631.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.76
Median TPOT (ms):                        33.73
P99 TPOT (ms):                           110.12
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.63
Median ITL (ms):                         23.22
P99 ITL (ms):                            318.24
==================================================

Thanks.

Using 2x H100s

njhill commented 1 month ago

@stas00 could you try also adding --multi-step-stream-outputs=False? This will mean tokens are streamed back in chunks of 8 (as was the case by default I think in 0.6.2), rather than individually.

stas00 commented 1 month ago

Thank you, @njhill

Indeed adding --multi-step-stream-outputs=False brings it back into the 0.6.2 ballpark - thank you!

How can the users keep track of the default knobs being turned on/off between releases? (other than reading each release's notes)
Do you plan to have a page where you discuss which optimization flags should be turned on/off/specific-values for which use-cases?

vllm==0.6.3.post1 w/ --multi-step-stream-outputs=False

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 9999 \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \
    --dtype=bfloat16 \
    --seed 42 \
    --num-scheduler-steps 8 \
    --multi-step-stream-outputs=False \
    --disable-log-requests \
    -tp 2


============ Serving Benchmark Result ============
Successful requests:                     2000
Benchmark duration (s):                  39.33
Total input tokens:                      453502
Total generated tokens:                  377187
Request throughput (req/s):              50.85
Output token throughput (tok/s):         9590.81
Total Token throughput (tok/s):          21122.10
---------------Time to First Token----------------
Mean TTFT (ms):                          13810.20
Median TTFT (ms):                        12929.04
P99 TTFT (ms):                           28807.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.19
Median TPOT (ms):                        19.50
P99 TPOT (ms):                           73.49
---------------Inter-token Latency----------------
Mean ITL (ms):                           144.42
Median ITL (ms):                         145.11
P99 ITL (ms):                            533.96
==================================================

njhill commented 1 month ago

Thank you, @njhill

Indeed adding --multi-step-stream-outputs=False brings it back into the 0.6.2 ballpark - thank you!

How can the users keep track of the default knobs being turned on/off between releases? (other than reading each release's notes)

I guess we should make this much clearer in the release notes, in general though I think we try not to change the default behaviour. This particular default change was actually the subject of some debate between myself and @mgoin :)

The hope is that we can make some more follow-on changes to hide/mitigate this additional overhead, but just didn't manage to get to that in time for the release. The rationale for it is to match the non-multistep behaviour when streaming of having a response message per output token. However what's interesting is that I would have expected the TTFT at least to have improved but it seems to have also got worse. I will look into this more closely.

Do you plan to have a page where you discuss which optimization flags should be turned on/off/specific-values for which use-cases?

I'm not sure that we do have something clear/comprehensive, partly because all of these optimizations are changing so fast and there are many permutations depending on model / GPU / topology, etc. But we definitely should have such a guide imo.

Also this isn't related to the version change, but the serving benchmark always uses response streaming assuming it's supported by the server. If you are interested more in the non-interactive case then you may be able to improve performance further by disabling that in the test, change from True to False here.

Thank you for doing these experiments by the way!

stas00 commented 1 month ago

Hmm, I wonder if perhaps at the very least you could maintain a single page with a few recipes for the common use cases? So as you are creating new performance improvements you have to measure those anyway - so why not share that process with the users? Just an idea of course.

To clarify - rather than asking for an exhaustive performance flags doc - just have something like:

best flags recipe for:

online serving: smallest TTFT / highest throughput
offline gen: highest throughput

Surely there might be a few more common goals, but those 3 could probably be the best low-hanging fruit?

Thank you for doing these experiments by the way!

The pleasure is all mine. I enjoy running benchmarks and finding ways to make things faster ;)

Also this isn't related to the version change, but the serving benchmark always uses response streaming assuming it's supported by the server. If you are interested more in the non-interactive case then you may be able to improve performance further by disabling that in the test, change from True to False here.

It's probably more than that since changing the flag makes the util fail:

--- a/benchmarks/backend_request_func.py
+++ b/benchmarks/backend_request_func.py
@@ -238,7 +238,7 @@ async def async_request_openai_completions(
             "best_of": request_func_input.best_of,
             "max_tokens": request_func_input.output_len,
             "logprobs": request_func_input.logprobs,
-            "stream": True,
+            "stream": False,
             "ignore_eos": request_func_input.ignore_eos,
         }
         headers = {

Traceback (most recent call last):
  File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/vllm/benchmarks/benchmark_serving.py", line 966, in <module>
    main(args)
  File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/vllm/benchmarks/benchmark_serving.py", line 667, in main
    benchmark_result = asyncio.run(
  File "/env/lib/conda/stas-inference/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/env/lib/conda/stas-inference/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/vllm/benchmarks/benchmark_serving.py", line 426, in benchmark
    raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Traceback (most recent call last):
  File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/vllm/benchmarks/backend_request_func.py", line 291, in async_request_openai_completions
    output.latency = latency
UnboundLocalError: local variable 'latency' referenced before assignment

njhill commented 1 month ago

Yes completely agree about the kind of page you're describing. I'll open an issue for this.

It's probably more than that since changing the flag makes the util fail:

Ah apologies, I'll check and get back to you, I don't expect it would be much more that needs changing. We can hopefully add it as a param to the script.

cyc00518 commented 1 month ago

Thank you for finally addressing this issue!

We really need a simple guide to achieve optimal speed. The versions of vLLM iterate quickly, but the updated features and their corresponding parameters are not clearly documented. Here are some parameters that seem related to performance, but as an average user with limited knowledge of model architecture, it's hard to understand how to configure them:

--enable-prefix-caching
--enable-chunked-prefill
--max-num-batched-tokens
--max-num-seqs
--max-seq-len-to-capture
--num-scheduler-steps
--multi-step-stream-outputs

I understand that different model architectures require different configurations and experimentation, but it would be incredibly helpful if there were at least a basic recipe for common models like Llama、Mistral that users could follow.

vrdn-23 commented 1 month ago

+1 to having a recipe that makes it easier for users to figure out what settings work best under different types of load. Having this as part of the official vLLM docs, would help eliminate new performance issues being raised like #9383 and #9722.

cc @simon-mo @mgoin

vllm-project / vllm

[Performance]: speed regression 0.6.2 => 0.6.3? #9476

Report of performance regression