Open Jeffwan opened 3 months ago
Observed similar results on my experiments. It seems like TPOT is calculated with the final "[Done]" latency included, whereas ITL does not include the final latency, as shown here. Would like some more explanation on the difference between these metrics.
Can confirm on my exps, especially for marlin24 model, ITL is much lower than TPOT, while TPOT for marlin24 is much higher than norm GPTQ model with marlin kernel.
check code here: https://github.com/vllm-project/vllm/blob/a2469127db6144eedb38d0b505287c0044e4ce06/benchmarks/benchmark_serving.py#L271
the output len of TPOT calculation is based on tokenized len instead of real output token number from model, if the output is wrong, then output len is less than real output token number significantly, in norm case, they are close, if these 2 output lens are same, then TPOT is equal to ILT.
inter-token latency takes TTFT
@Jeffwan This is no longer the case and has been fixed by #7372.
The reason why we use separate calculations for TPOT is that sometimes ITL
is not a reliable measure for the actual decoding performance. This is because sometimes multiple tokens can be bundled in one server-side event for certain backends/mechanism. Therefore we use the TPOT = (end-to-end latency - TTFT)/len(generated output token ids)
as a proxy.
One other thing worth noting is that: TPOT
is a per-request metric and ITL
is a per-SSE metric.
IMHO the way we are defining ITL here is not very useful and potentially confusing. I think we should report only TTFT and TPOT (in other cases ITL is a synonym for TPOT).
It's mostly irrelevant if we return n tokens per SSE, since if it takes e.g. 100ms to return 5 tokens you can just make use of one every 20ms after you receive them. The initial cost of waiting for all 5 is already captured in the TTFT time.
IMHO the way we are defining ITL here is not very useful and potentially confusing. I think we should report only TTFT and TPOT (in other cases ITL is a synonym for TPOT).
It's mostly irrelevant if we return n tokens per SSE, since if it takes e.g. 100ms to return 5 tokens you can just make use of one every 20ms after you receive them. The initial cost of waiting for all 5 is already captured in the TTFT time.
@njhill That's very true - @tlrmchlsmth and I agreed on adding this metric into the benchmark previously when vLLM strictly follows 1 token per SSE protocol. I do think ITL is still valid to keep if we go back to follow that protocol in the future, just so we can get a sense of how the distribution of all decoding operations in certain setup looks like.
@njhill, @ywang96 what do you think about renaming ITL (inter-packet latency?), and hiding it behind an option?
I think it's still a good QoS metric to keep track of, as jittery output generation is going to be a worse user experience than output tokens generated at a constant rate, and the fact that we currently return N tokens per multistep rather than one token at a time is a tradeoff that's worth exposing in our benchmarking scripts. I agree that this is definitely confusing and less important than TPOT
I understand that the current naming of ITL might be causing some confusion. However, interpreting ITL as the inter-packet latency seems to contradict the problem mentioned here. If ITL measured here represents inter-packet latency, the TPOT should always be less than or equal to ITL, with equality occurring only in cases where single-step postprocessing is applied. This issue suggests the opposite, which is ITL is reported as smaller than TPOT, which indicates there may be a misunderstanding or an underlying issue worth investigating further in the benchmark script.
which is ITL is reported as smaller than TPOT,
@hyhuang00 yea that's indeed a good point.
The only possibility I can think of for this is when the model doesn't generate anything except especial tokens (EOS for example), so the generated text is empty but there's still ITL record for it. (Here output[i].itl
is List[float]
)
I think jitter is reasonable to meaure/report somehow. But it's only relevant imo when it's uneven - in the case of multistep where we have exactly N per response, this shouldn't be considered jitter imo, since the extra delay is captured in TTFT, and you could just evenly space these N tokens over the time between responses (that could be done client-side too).
If we have some other metric for it I think we should call it something completely different like MTBR "mean time between responses" or MTBOC "mean time between output chunks" or something like that. To avoid confusion with other ITL/TPOT perf metrics.
Actually maybe the variance rather than the mean which would make more sense for this purpose...
I see what you meant about it being captured in the TTFT now -- didn't understand that before but I agree that makes sense. BTW, isn't it partially captured in the TPOT as well, since you might wait an extra few steps after your final token?
I can throw up a PR to do the name change, and we can discuss further in there. Sounds good?
Thanks @tlrmchlsmth
isn't it partially captured in the TPOT as well, since you might wait an extra few steps after your final token?
Yes that's true, but I guess for larger numbers of tokens the amortized difference per token would be quite small.
Your current environment
v0.5.2. vLLM env is not an issue so I will just skip the collection process
π Describe the bug
I am running benchmark tests and notice one potential problem.
Seems the inter-token latency is lower than TPOT. Basically, inter-token latency takes TTFT into the consideration and should be higher than TPOT. However the data shows different result. I have not looked at the code yet and I will try to figure this out