[Usage]: how to understand logs (when speculative decoding) - Githubissues

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

28.87k stars 4.29k forks source link

[Usage]: how to understand logs (when speculative decoding) #9539

Open chenchunhui97 opened 23 hours ago

chenchunhui97 commented 23 hours ago

Your current environment

vllm 0.6.3

How would you like to use vllm

vllm of version v0.6.3:

INFO 10-16 19:05:47 metrics.py:367] Speculative metrics: Draft acceptance rate: 0.642, System efficiency: 0.598, Number of speculative tokens: 3, Number of accepted tokens: 8547, Number of draft tokens: 13311, Number of emitted tokens: 10616.
INFO 10-16 19:06:01 metrics.py:361] Prefix cache hit rate: GPU: 94.35%, CPU: 0.00%

I want to know the meaning of System efficiency, Number of emitted tokens

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

LiuXiaoxuanPKU commented 4 hours ago

Hi, thanks for asking! The definition of the metric can be found here.