[Usage]: why speculate decoding is slower than normal decoding？

yunll commented 1 month ago

Your current environment

The startup command is as follows: it initiates both a standard 7B model and an n-gram speculate model. Speed tests discover that the speculate model performs more slowly."

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 9000 --model Qwen2-7B-Instruct -tp 1 --gpu_memory_utilization 0.9

CUDA_VISIBLE_DEVICES=3 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 9002 --model Qwen2-7B-Instruct -tp 1 --speculative_model [gram] --use-v2-block-manager --num_speculative_tokens 5 --ngram-prompt-lookup-max 4 --gpu_memory_utilization 0.9

result
7b:
first token:  0.04074668884277344s
decode time:  14.328832149505615s
output token:  1000
decode speed:  69.78935823702163 token/s

spec 7b
first token:  0.02350592613220215s
decode time:  15.324904918670654s
output token:  947
decode speed:  61.794836902788866 token/s

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

ShangmingCai commented 1 month ago

Can you share the prompt and sampling params you used to test?

In case you didn't know, you need to setup the sampling params to use greedy decoding because spec decode module only supports Top1 proposer for now. Also, the performance gain of speculative decoding depends on the draft accept rate. Since you are using ngram for the test, your prompt better contain some information that deserves to be looked at or retrieved. Otherwise, you should use a draft model like 'Qwen2-0.5B-Instruct', then you should be able to observe some performance gain.

yunll commented 1 month ago

Can you share the prompt and sampling params you used to test?

In case you didn't know, you need to setup the sampling params to use greedy decoding because spec decode module only supports Top1 proposer for now. Also, the performance gain of speculative decoding depends on the draft accept rate. Since you are using ngram for the test, your prompt better contain some information that deserves to be looked at or retrieved. Otherwise, you should use a draft model like 'Qwen2-0.5B-Instruct', then you should be able to observe some performance gain.

The above results were incorrect. Below are the correct test data：

7b-rag: first token: 0.46736812591552734s decode time: 15.003458261489868s output token: 1000 decode speed: 66.65130015836084 token/s

spec 7b rag: first token: 0.45102858543395996s decode time: 15.191270112991333s output token: 830 decode speed: 54.636642876239634 token/s

sampling params: "n": 1, "max_tokens": 1000, "temperature": 0.0, "top_k": -1, "top_p": 1.0, "ignore_eos": True,
"stream": stream

and my prompt should include most of the information, as it contains multiple retrieved news articles and to provide a summary based on these articles

some vllm logs: INFO 09-13 16:58:22 metrics.py:373] Speculative metrics: Draft acceptance rate: 0.546, System efficiency: 0.142, Number of speculative tokens: 10, N umber of accepted tokens: 22170, Number of draft tokens: 40610, Number of emitted tokens: 6355.

ShangmingCai commented 1 month ago

Can you try "top_k": 1 with "temperature": 0? The result is odd. Spec decode is not supposed to change the output if you are using greedy decoding. If the output is still different with and without spec decode, try this config: --spec-decoding-acceptance-method=typical_acceptance_sampler

some vllm logs: INFO 09-13 16:58:22 metrics.py:373] Speculative metrics: Draft acceptance rate: 0.546, System efficiency: 0.142, Number of speculative tokens: 10, Number of accepted tokens: 22170, Number of draft tokens: 40610, Number of emitted tokens: 6355.

Also, why does the logger print 'Number of speculative tokens: 10', I think you have set '--num_speculative_tokens 5'.

This parameter is not the higher the better, you can try to lower it down to 5 or 3.

yunll commented 1 month ago

Can you try "top_k": 1 with "temperature": 0? The result is odd. Spec decode is not supposed to change the output if you are using greedy decoding. If the output is still different with and without spec decode, try this config: --spec-decoding-acceptance-method=typical_acceptance_sampler

some vllm logs: INFO 09-13 16:58:22 metrics.py:373] Speculative metrics: Draft acceptance rate: 0.546, System efficiency: 0.142, Number of speculative tokens: 10, Number of accepted tokens: 22170, Number of draft tokens: 40610, Number of emitted tokens: 6355.

Also, why does the logger print 'Number of speculative tokens: 10', I think you have set '--num_speculative_tokens 5'.

This parameter is not the higher the better, you can try to lower it down to 5 or 3.

thanks

I try "top_k": 1 with "temperature": 0，and set '--num_speculative_tokens 5', but the result is still the same as before. and the result is same across multiple outputs.

can you tell me what is mean about "System efficiency: 0.142" and "Number of emitted tokens: 6355" in log

ShangmingCai commented 1 month ago

Can you try "top_k": 1 with "temperature": 0? The result is odd. Spec decode is not supposed to change the output if you are using greedy decoding. If the output is still different with and without spec decode, try this config: --spec-decoding-acceptance-method=typical_acceptance_sampler

some vllm logs: INFO 09-13 16:58:22 metrics.py:373] Speculative metrics: Draft acceptance rate: 0.546, System efficiency: 0.142, Number of speculative tokens: 10, Number of accepted tokens: 22170, Number of draft tokens: 40610, Number of emitted tokens: 6355.

Also, why does the logger print 'Number of speculative tokens: 10', I think you have set '--num_speculative_tokens 5'. This parameter is not the higher the better, you can try to lower it down to 5 or 3.

thanks

I try "top_k": 1 with "temperature": 0，and set '--num_speculative_tokens 5', but the result is still the same as before. and the result is same across multiple outputs.

can you tell me what is mean about "System efficiency: 0.142" and "Number of emitted tokens: 6355" in log

draft tokens might be -1. In my opinion, emitted tokens stand for the valid output tokens (not -1), which include the bonus token, that can be scored. But I'm not quite sure. You can check the source code. I will give you some pointers.

https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/metrics.py#L12-L45

https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/metrics.py#L163-L166

https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/metrics.py#L178-L196

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/spec_decode_base_sampler.py#L60-L129

vllm-project / vllm