vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.87k stars 4.29k forks source link

[Usage]: how to understand logs (when speculative decoding) #9539

Open chenchunhui97 opened 23 hours ago

chenchunhui97 commented 23 hours ago

Your current environment

vllm 0.6.3

How would you like to use vllm

vllm of version v0.6.3:

INFO 10-16 19:05:47 metrics.py:367] Speculative metrics: Draft acceptance rate: 0.642, System efficiency: 0.598, Number of speculative tokens: 3, Number of accepted tokens: 8547, Number of draft tokens: 13311, Number of emitted tokens: 10616.
INFO 10-16 19:06:01 metrics.py:361] Prefix cache hit rate: GPU: 94.35%, CPU: 0.00%

I want to know the meaning of System efficiency, Number of emitted tokens

Before submitting a new issue...

LiuXiaoxuanPKU commented 4 hours ago

Hi, thanks for asking! The definition of the metric can be found here.