Open Msiavashi opened 1 month ago
stream mode shall get each token's latency, and thus prefill and decode phase could be measured. While current benchmark using sync mode, another workaround may be considered is:
So, there is still no embedded mechanism for these measurements/profiling, right?
stream mode shall get each token's latency, and thus prefill and decode phase could be measured. While current benchmark using sync mode, another workaround may be considered is:
- measure latency for input_len=1000, output_len=1, thus get prefill latency for input_len=1000
- measure latency for input_len=1, output_len=1, get average latency A, and then input_len=1, output_len=1000, get average latency B. (B-A)/999 to get the decode latency...
Hi, now do you have some other ways for profiling prefill and decode phase separately?
There's a ongoing PR related. https://github.com/vllm-project/vllm/pull/2809
Your current environment
I'm attempting to independently measure the performance (e.g., latency, throughput, etc.) of the prefill and decode phases. Is there a way to achieve this? I have noticed a few benchmarks that measure end-to-end throughput and latency but do not provide separate metrics for each phase.
I would greatly appreciate any guidance on profiling these two phases separately.
How would you like to use vllm
No response