Open cermeng opened 4 weeks ago
@cadedaniel I'm wondering if you can provide me with some feedback and suggestions. I'm glad to contribute.
The idea is good and a contribution here is welcome. My primary concern is latency overheads from metrics collection; i.e. the additional logic required to parse the acceptance rate into per-sequence acceptance info.
Suggestion (either/or):
BTW, similar discussion happening here https://github.com/vllm-project/vllm/discussions/7522
also, you can enable the metrics to be printed in benchmark_latency via:
diff --git a/benchmarks/benchmark_latency.py b/benchmarks/benchmark_latency.py
index 97afd301c8f..0ee2bfabb82 100644
--- a/benchmarks/benchmark_latency.py
+++ b/benchmarks/benchmark_latency.py
@@ -47,6 +47,7 @@ def main(args: argparse.Namespace):
distributed_executor_backend=args.distributed_executor_backend,
otlp_traces_endpoint=args.otlp_traces_endpoint,
enable_prefix_caching=args.enable_prefix_caching,
+ disable_log_stats=False,
)
sampling_params = SamplingParams(
Hi Im evaluating speculative decoding, and I'm not able to get any gain from it.
I tested opt 2.7B/llama 3.1 8B/llama 3 8B with the following server configuration parameters:
--port=8080
--model=/mnt/models
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--use-v2-block-manager
--enforce-eager
--speculative-model=/mnt/models/-accelerator
--num-speculative-tokens=3
--ngram-prompt-lookup-max=3
--port=8080
--model=/mnt/models
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--use-v2-block-manager
--enforce-eager
--speculative-model=[ngram]
--num-speculative-tokens=3
--ngram-prompt-lookup-max=3
--port=8080
--model=/mnt/models
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--use-v2-block-manager
--enforce-eager
And the overall behavior is that the less performant approach is with the draft model, then n-gram, and the best case is with speculative decoding off. These results are with an A100-40GB and vLLM 0.5.3 post1.
Is there any guide on the best configuration or scenarios where we can et the most of this feature?
Thanks!
Hi Im evaluating speculative decoding, and I'm not able to get any gain from it.
I tested opt 2.7B/llama 3.1 8B/llama 3 8B with the following server configuration parameters:
Using a draft model
--port=8080 --model=/mnt/models --served-model-name={{.Name}} --distributed-executor-backend=mp --use-v2-block-manager --enforce-eager --speculative-model=/mnt/models/-accelerator --num-speculative-tokens=3 --ngram-prompt-lookup-max=3
Using n-gram
--port=8080 --model=/mnt/models --served-model-name={{.Name}} --distributed-executor-backend=mp --use-v2-block-manager --enforce-eager --speculative-model=[ngram] --num-speculative-tokens=3 --ngram-prompt-lookup-max=3
Speculative decoding disabled
--port=8080 --model=/mnt/models --served-model-name={{.Name}} --distributed-executor-backend=mp --use-v2-block-manager --enforce-eager
And the overall behavior is that the less performant approach is with the draft model, then n-gram, and the best case is with speculative decoding off. These results are with an A100-40GB and vLLM 0.5.3 post1.
Is there any guide on the best configuration or scenarios where we can et the most of this feature?
Thanks!
Hey! Thanks for the interest. What's the draft model you are using for opt 2.7B/llama 3.1 8B/llama 3 8B?
ngram is normally good for document QA or summary, it's not good for online chatting. The perf of SD is workload, model, and hardware dependent.
š The feature, motivation and pitch
I am looking to assess the performance of vllm for speculative decode, but I have been unable to find an offline benchmark script similar to benchmark_latency.py that would allow me to test speculative decode performance. While I can use benchmark_latency.py to obtain e2e latency, it does not provide all of the spec-decode metrics such as the time spent on scoring, verifying, and proposing, as well as the acceptance rate.
Thanks to @cadedaniel's excellent contributions such as https://github.com/vllm-project/vllm/pull/6963 and https://github.com/vllm-project/vllm/pull/3103, we are now able to display spec-decode metrics, including scoring time, verification time, proposal time, and acceptance rate, in the server logging.
However, these metrics can only be viewed in online server logs and are implemented through an asynchronous collector, which could result in inaccuracies. I am considering adding a script called 'benchmark_spec_decode.py' for spec-decode benchmarking in order to capture more spec-decode-related metrics.
Some Proposal
Add a new field
spec_decode_metrics
ofRequestMetrics
https://github.com/vllm-project/vllm/blob/9587b050fba00c3c35da05d3512bf7e351914a50/vllm/sequence.py#L87-L112 and we can also consolidate the classSpecDecodeWorkerMetrics
for more metrics related to spec-decode https://github.com/vllm-project/vllm/blob/9587b050fba00c3c35da05d3512bf7e351914a50/vllm/spec_decode/metrics.py#L13Alternatives
No response
Additional context
No response