Open hustxiayang opened 6 days ago
Hi, thanks for the question! I did some quick benchmark for typical acceptance:
Settings: Model: lmsys/vicuna-7b-v1.3 Draft model: abhigoyal/vllm-medusa-vicuna-7b-v1.3 Hardware: 1xH100 Dataset: ShareGPT vllm version: v0.6.1.post2 Request rate: 1 req/s Sampling method: greedy decoding
Results | w/o SD | SD with rejection sampling | SD with typical acceptance before #8562 | SD with typical acceptance after #8562 | |
---|---|---|---|---|---|
median TTFT (ms) | 13.30 | 12.79 | 13.46 | 12.97 | |
median TPOT (ms) | 6.97 | 5.43 | 7.81 | 5.53 | |
median End to end (s) | 1.10 | 0.83 | 1.17 | 0.84 |
I also check the token acceptance rate for rejection sampling: SD with rejection sampling: Draft acceptance rate: 0.287, System efficiency: 0.422 SD with typical acceptance before #8562: Draft acceptance rate: 0.283, System efficiency: 0.317 SD with typical acceptance after #8562:Draft acceptance rate: 0.293, System efficiency: 0.427
Notice in the results above, typical acceptance (after #8562) has similar performance to rejection sampling because we are doing greedy decoding.
Some comments here:
posterior_threshold
= 0.09 and posterior_alpha
= 0.3, they are already small. You can further reduce the value to see if you can get any benefits, but that might affect the generation quality. I have not tested this thoroughly here, feel free to take a try and share the result here.
Proposal to improve performance
No response
Report of performance regression
I tested the
accept length
( number of tokens per step) withtypical acceptance sampling
. The accept length is even smaller than default reject sampling method. Here is my experimental details:posterior_threshold
andposterior_alpha
were set as default values.Do you have some experimental results on this? Or do I need to tune some parameters for
typical acceptance sampling
? Thanks a lot!Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...