vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.06k stars 3.27k forks source link

[Usage]: Profiling Prefill and Decode Phases Separately #4900

Open Msiavashi opened 1 month ago

Msiavashi commented 1 month ago

Your current environment

I'm attempting to independently measure the performance (e.g., latency, throughput, etc.) of the prefill and decode phases. Is there a way to achieve this? I have noticed a few benchmarks that measure end-to-end throughput and latency but do not provide separate metrics for each phase.

I would greatly appreciate any guidance on profiling these two phases separately.

How would you like to use vllm

No response

leiwen83 commented 1 month ago

stream mode shall get each token's latency, and thus prefill and decode phase could be measured. While current benchmark using sync mode, another workaround may be considered is:

  1. measure latency for input_len=1000, output_len=1, thus get prefill latency for input_len=1000
  2. measure latency for input_len=1, output_len=1, get average latency A, and then input_len=1, output_len=1000, get average latency B. (B-A)/999 to get the decode latency...
Msiavashi commented 1 month ago

So, there is still no embedded mechanism for these measurements/profiling, right?

KevinZeng08 commented 3 weeks ago

stream mode shall get each token's latency, and thus prefill and decode phase could be measured. While current benchmark using sync mode, another workaround may be considered is:

  1. measure latency for input_len=1000, output_len=1, thus get prefill latency for input_len=1000
  2. measure latency for input_len=1, output_len=1, get average latency A, and then input_len=1, output_len=1000, get average latency B. (B-A)/999 to get the decode latency...

Hi, now do you have some other ways for profiling prefill and decode phase separately?

kerthcet commented 2 weeks ago

There's a ongoing PR related. https://github.com/vllm-project/vllm/pull/2809