vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.97k stars 4.71k forks source link

[Usage]: Get first token latency #8471

Open khayamgondal opened 2 months ago

khayamgondal commented 2 months ago

Is there a way to get the first token latency? benchmarks/benchmark_latency.py provides the latency of processing a single batch of requests but I am interested in getting first token latency

Before submitting a new issue...

LiuXiaoxuanPKU commented 2 months ago

Feel free to take a look at benchmark_serving.py, it includes most mainstream metrics such as TTFT, TPOT.