[Bug]: inter-token latency is lower than TPOT in serving benchmark result

Jeffwan commented 3 months ago

Your current environment

v0.5.2. vLLM env is not an issue so I will just skip the collection process

🐛 Describe the bug

I am running benchmark tests and notice one potential problem.

Seems the inter-token latency is lower than TPOT. Basically, inter-token latency takes TTFT into the consideration and should be higher than TPOT. However the data shows different result. I have not looked at the code yet and I will try to figure this out

root@fb5250e2ae4c:/workspace# python3 vllm/benchmarks/benchmark_serving.py     --backend vllm     --dataset-name sharegpt     --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json     --model meta-llama/Llama-2-7b-chat-hf     --num-prompts 200     --endpoint /v1/completions     --tokenizer meta-llama/Llama-2-7b-chat-hf     --save-result     2>&1 | tee benchmark_serving.txt
Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer='meta-llama/Llama-2-7b-chat-hf', best_of=1, use_beam_search=False, num_prompts=200, sharegpt_output_len=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=True, metadata=None, result_dir=None, result_filename=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:12<00:00,  2.74it/s]s
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  72.96     
Total input tokens:                      49490     
Total generated tokens:                  41078     
Request throughput (req/s):              2.74      
Input token throughput (tok/s):          678.34    
Output token throughput (tok/s):         563.04    
---------------Time to First Token----------------
Mean TTFT (ms):                          3594.18   
Median TTFT (ms):                        3685.95   
P99 TTFT (ms):                           7361.98   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          186.90    
Median TPOT (ms):                        121.63    
P99 TPOT (ms):                           966.47    
---------------Inter-token Latency----------------
Mean ITL (ms):                           121.20    
Median ITL (ms):                         92.91     
P99 ITL (ms):                            310.89    
==================================================

hyhuang00 commented 3 months ago

Observed similar results on my experiments. It seems like TPOT is calculated with the final "[Done]" latency included, whereas ITL does not include the final latency, as shown here. Would like some more explanation on the difference between these metrics.

yzlnew commented 3 months ago

Can confirm on my exps, especially for marlin24 model, ITL is much lower than TPOT, while TPOT for marlin24 is much higher than norm GPTQ model with marlin kernel.

LeoZhao-Intel commented 1 month ago

check code here: https://github.com/vllm-project/vllm/blob/a2469127db6144eedb38d0b505287c0044e4ce06/benchmarks/benchmark_serving.py#L271

the output len of TPOT calculation is based on tokenized len instead of real output token number from model, if the output is wrong, then output len is less than real output token number significantly, in norm case, they are close, if these 2 output lens are same, then TPOT is equal to ILT.

ywang96 commented 1 month ago

inter-token latency takes TTFT

@Jeffwan This is no longer the case and has been fixed by #7372.

The reason why we use separate calculations for TPOT is that sometimes ITL is not a reliable measure for the actual decoding performance. This is because sometimes multiple tokens can be bundled in one server-side event for certain backends/mechanism. Therefore we use the TPOT = (end-to-end latency - TTFT)/len(generated output token ids) as a proxy.

One other thing worth noting is that: TPOT is a per-request metric and ITL is a per-SSE metric.

njhill commented 1 month ago

IMHO the way we are defining ITL here is not very useful and potentially confusing. I think we should report only TTFT and TPOT (in other cases ITL is a synonym for TPOT).

It's mostly irrelevant if we return n tokens per SSE, since if it takes e.g. 100ms to return 5 tokens you can just make use of one every 20ms after you receive them. The initial cost of waiting for all 5 is already captured in the TTFT time.

ywang96 commented 1 month ago

IMHO the way we are defining ITL here is not very useful and potentially confusing. I think we should report only TTFT and TPOT (in other cases ITL is a synonym for TPOT).

It's mostly irrelevant if we return n tokens per SSE, since if it takes e.g. 100ms to return 5 tokens you can just make use of one every 20ms after you receive them. The initial cost of waiting for all 5 is already captured in the TTFT time.

@njhill That's very true - @tlrmchlsmth and I agreed on adding this metric into the benchmark previously when vLLM strictly follows 1 token per SSE protocol. I do think ITL is still valid to keep if we go back to follow that protocol in the future, just so we can get a sense of how the distribution of all decoding operations in certain setup looks like.

tlrmchlsmth commented 1 month ago

@njhill, @ywang96 what do you think about renaming ITL (inter-packet latency?), and hiding it behind an option?

I think it's still a good QoS metric to keep track of, as jittery output generation is going to be a worse user experience than output tokens generated at a constant rate, and the fact that we currently return N tokens per multistep rather than one token at a time is a tradeoff that's worth exposing in our benchmarking scripts. I agree that this is definitely confusing and less important than TPOT

hyhuang00 commented 1 month ago

I understand that the current naming of ITL might be causing some confusion. However, interpreting ITL as the inter-packet latency seems to contradict the problem mentioned here. If ITL measured here represents inter-packet latency, the TPOT should always be less than or equal to ITL, with equality occurring only in cases where single-step postprocessing is applied. This issue suggests the opposite, which is ITL is reported as smaller than TPOT, which indicates there may be a misunderstanding or an underlying issue worth investigating further in the benchmark script.

ywang96 commented 1 month ago

which is ITL is reported as smaller than TPOT,

@hyhuang00 yea that's indeed a good point.

The only possibility I can think of for this is when the model doesn't generate anything except especial tokens (EOS for example), so the generated text is empty but there's still ITL record for it. (Here output[i].itl is List[float])

https://github.com/vllm-project/vllm/blob/d9cd78eb718c233ebc5b84377fc2226af7ef0fa2/benchmarks/benchmark_serving.py#L338-L341

njhill commented 1 month ago

I think jitter is reasonable to meaure/report somehow. But it's only relevant imo when it's uneven - in the case of multistep where we have exactly N per response, this shouldn't be considered jitter imo, since the extra delay is captured in TTFT, and you could just evenly space these N tokens over the time between responses (that could be done client-side too).

If we have some other metric for it I think we should call it something completely different like MTBR "mean time between responses" or MTBOC "mean time between output chunks" or something like that. To avoid confusion with other ITL/TPOT perf metrics.

njhill commented 1 month ago

Actually maybe the variance rather than the mean which would make more sense for this purpose...

tlrmchlsmth commented 1 month ago

I see what you meant about it being captured in the TTFT now -- didn't understand that before but I agree that makes sense. BTW, isn't it partially captured in the TPOT as well, since you might wait an extra few steps after your final token?

I can throw up a PR to do the name change, and we can discuss further in there. Sounds good?

njhill commented 1 month ago

Thanks @tlrmchlsmth

isn't it partially captured in the TPOT as well, since you might wait an extra few steps after your final token?

Yes that's true, but I guess for larger numbers of tokens the amortized difference per token would be quite small.

vllm-project / vllm

[Bug]: inter-token latency is lower than TPOT in serving benchmark result #6531

Your current environment

🐛 Describe the bug