vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.38k stars 4.02k forks source link

[Performance]: The accept rate of typical acceptance sampling #8639

Open hustxiayang opened 6 days ago

hustxiayang commented 6 days ago

Proposal to improve performance

No response

Report of performance regression

I tested the accept length ( number of tokens per step) withtypical acceptance sampling. The accept length is even smaller than default reject sampling method. Here is my experimental details:

  1. The dataset I used was mt_bench.
  2. Speculative decoding model's setup: llama3.1 8b as target model and Qwama-0.5B-Instruct as a draft model (num of speculative tokens is 2) llama3.1 8b as target model with MLP-speculator. 3 Temperature was set as 0.9 4 posterior_threshold and posterior_alpha were set as default values.

Do you have some experimental results on this? Or do I need to tune some parameters for typical acceptance sampling? Thanks a lot!

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

LiuXiaoxuanPKU commented 1 day ago

Hi, thanks for the question! I did some quick benchmark for typical acceptance:

Settings: Model: lmsys/vicuna-7b-v1.3 Draft model: abhigoyal/vllm-medusa-vicuna-7b-v1.3 Hardware: 1xH100 Dataset: ShareGPT vllm version: v0.6.1.post2 Request rate: 1 req/s Sampling method: greedy decoding

Results w/o SD SD with rejection sampling SD with typical acceptance before #8562 SD with typical acceptance after #8562
median TTFT (ms) 13.30 12.79 13.46 12.97
median TPOT (ms) 6.97 5.43 7.81 5.53
median End to end (s) 1.10 0.83 1.17 0.84

I also check the token acceptance rate for rejection sampling: SD with rejection sampling: Draft acceptance rate: 0.287, System efficiency: 0.422 SD with typical acceptance before #8562: Draft acceptance rate: 0.283, System efficiency: 0.317 SD with typical acceptance after #8562:Draft acceptance rate: 0.293, System efficiency: 0.427

Notice in the results above, typical acceptance (after #8562) has similar performance to rejection sampling because we are doing greedy decoding.

Some comments here:

  1. After merging this PR (https://github.com/vllm-project/vllm/pull/8562), the acceptance rate should be higher because we accept one more 'recover' token.
  2. I check the default values of posterior_threshold = 0.09 and posterior_alpha = 0.3, they are already small. You can further reduce the value to see if you can get any benefits, but that might affect the generation quality. I have not tested this thoroughly here, feel free to take a try and share the result here.