[Usage]: Get first token latency

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

30.97k stars 4.71k forks source link

[Usage]: Get first token latency #8471

Open khayamgondal opened 2 months ago

khayamgondal commented 2 months ago

Is there a way to get the first token latency? benchmarks/benchmark_latency.py provides the latency of processing a single batch of requests but I am interested in getting first token latency

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

LiuXiaoxuanPKU commented 2 months ago

Feel free to take a look at benchmark_serving.py, it includes most mainstream metrics such as TTFT, TPOT.