Open yunll opened 1 month ago
Can you share the prompt and sampling params you used to test?
In case you didn't know, you need to setup the sampling params to use greedy decoding because spec decode module only supports Top1 proposer for now. Also, the performance gain of speculative decoding depends on the draft accept rate. Since you are using ngram for the test, your prompt better contain some information that deserves to be looked at or retrieved. Otherwise, you should use a draft model like 'Qwen2-0.5B-Instruct', then you should be able to observe some performance gain.
Can you share the prompt and sampling params you used to test?
In case you didn't know, you need to setup the sampling params to use greedy decoding because spec decode module only supports Top1 proposer for now. Also, the performance gain of speculative decoding depends on the draft accept rate. Since you are using ngram for the test, your prompt better contain some information that deserves to be looked at or retrieved. Otherwise, you should use a draft model like 'Qwen2-0.5B-Instruct', then you should be able to observe some performance gain.
The above results were incorrect. Below are the correct test data:
7b-rag: first token: 0.46736812591552734s decode time: 15.003458261489868s output token: 1000 decode speed: 66.65130015836084 token/s
spec 7b rag: first token: 0.45102858543395996s decode time: 15.191270112991333s output token: 830 decode speed: 54.636642876239634 token/s
sampling params:
"n": 1,
"max_tokens": 1000,
"temperature": 0.0,
"top_k": -1,
"top_p": 1.0,
"ignore_eos": True,
"stream": stream
and my prompt should include most of the information, as it contains multiple retrieved news articles and to provide a summary based on these articles
some vllm logs: INFO 09-13 16:58:22 metrics.py:373] Speculative metrics: Draft acceptance rate: 0.546, System efficiency: 0.142, Number of speculative tokens: 10, N umber of accepted tokens: 22170, Number of draft tokens: 40610, Number of emitted tokens: 6355.
Can you try "top_k": 1 with "temperature": 0? The result is odd. Spec decode is not supposed to change the output if you are using greedy decoding. If the output is still different with and without spec decode, try this config:
--spec-decoding-acceptance-method=typical_acceptance_sampler
some vllm logs: INFO 09-13 16:58:22 metrics.py:373] Speculative metrics: Draft acceptance rate: 0.546, System efficiency: 0.142, Number of speculative tokens: 10, Number of accepted tokens: 22170, Number of draft tokens: 40610, Number of emitted tokens: 6355.
Also, why does the logger print 'Number of speculative tokens: 10', I think you have set '--num_speculative_tokens 5'.
This parameter is not the higher the better, you can try to lower it down to 5 or 3.
Can you try "top_k": 1 with "temperature": 0? The result is odd. Spec decode is not supposed to change the output if you are using greedy decoding. If the output is still different with and without spec decode, try this config:
--spec-decoding-acceptance-method=typical_acceptance_sampler
some vllm logs: INFO 09-13 16:58:22 metrics.py:373] Speculative metrics: Draft acceptance rate: 0.546, System efficiency: 0.142, Number of speculative tokens: 10, Number of accepted tokens: 22170, Number of draft tokens: 40610, Number of emitted tokens: 6355.
Also, why does the logger print 'Number of speculative tokens: 10', I think you have set '--num_speculative_tokens 5'.
This parameter is not the higher the better, you can try to lower it down to 5 or 3.
thanks
I try "top_k": 1 with "temperature": 0,and set '--num_speculative_tokens 5', but the result is still the same as before. and the result is same across multiple outputs.
can you tell me what is mean about "System efficiency: 0.142" and "Number of emitted tokens: 6355" in log
Can you try "top_k": 1 with "temperature": 0? The result is odd. Spec decode is not supposed to change the output if you are using greedy decoding. If the output is still different with and without spec decode, try this config:
--spec-decoding-acceptance-method=typical_acceptance_sampler
some vllm logs: INFO 09-13 16:58:22 metrics.py:373] Speculative metrics: Draft acceptance rate: 0.546, System efficiency: 0.142, Number of speculative tokens: 10, Number of accepted tokens: 22170, Number of draft tokens: 40610, Number of emitted tokens: 6355.
Also, why does the logger print 'Number of speculative tokens: 10', I think you have set '--num_speculative_tokens 5'. This parameter is not the higher the better, you can try to lower it down to 5 or 3.
thanks
I try "top_k": 1 with "temperature": 0,and set '--num_speculative_tokens 5', but the result is still the same as before. and the result is same across multiple outputs.
can you tell me what is mean about "System efficiency: 0.142" and "Number of emitted tokens: 6355" in log
draft tokens
might be -1. In my opinion, emitted tokens
stand for the valid output tokens (not -1), which include the bonus token, that can be scored. But I'm not quite sure. You can check the source code. I will give you some pointers.
https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/metrics.py#L12-L45
https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/metrics.py#L163-L166
https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/metrics.py#L178-L196
Your current environment
The startup command is as follows: it initiates both a standard 7B model and an n-gram speculate model. Speed tests discover that the speculate model performs more slowly."
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...