[Performance]: The accept rate of typical acceptance sampling

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Apache License 2.0

27.38k stars 4.02k forks source link

Proposal to improve performance

No response

Report of performance regression

I tested the accept length ( number of tokens per step) withtypical acceptance sampling. The accept length is even smaller than default reject sampling method. Here is my experimental details:

The dataset I used was mt_bench.
Speculative decoding model's setup: llama3.1 8b as target model and Qwama-0.5B-Instruct as a draft model (num of speculative tokens is 2) llama3.1 8b as target model with MLP-speculator. 3 Temperature was set as 0.9 4 posterior_threshold and posterior_alpha were set as default values.

Do you have some experimental results on this? Or do I need to tune some parameters for typical acceptance sampling? Thanks a lot!

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Results		w/o SD	SD with rejection sampling	SD with typical acceptance before #8562	SD with typical acceptance after #8562
median TTFT (ms)	13.30	12.79	13.46	12.97
median TPOT (ms)	6.97	5.43	7.81	5.53
median End to end (s)	1.10	0.83	1.17	0.84

Results

w/o SD

SD with rejection sampling

SD with typical acceptance before #8562

SD with typical acceptance after #8562

median TTFT (ms)

13.30

12.79

13.46

12.97

median TPOT (ms)

6.97

5.43

7.81

5.53

median End to end (s)

1.10

0.83

1.17

0.84

vllm-project / vllm