Closed ShangmingCai closed 5 days ago
These are great ideas! Contributions welcome :)
These are great ideas! Contributions welcome :)
Thanks! Once we have successfully verified the performance improvement in the inner version, I will submit a PR to begin integrating this feature into the open-source repository. Will keep you updated about our progress.
Also, if there is any progress in the integration of SmartSpec, please let me know. cc @LiuXiaoxuanPKU
Feel free to contact me anytime if any changes or additions you would like!
One other idea you should consider is using multi-lora draft model
One other idea you should consider is using multi-lora draft model
Brilliant! The design philosophy of multi-proposers is similar to that of multiple LoRA support. Also, the choice should not be set through sampling_params
, but be left to the service provider for autonomous scheduling in the generate()
function like LoRA.
Although SpecDecodeWorker does not support LoRA at this stage, I will keep the combination of spec decode and LoRA in mind and advance it step by step :)
Sounds good. Btw I don't think we should let users decide the spec method as it gives too much flexibility to impact other users -- should be set by service provider
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!
🚀 The feature, motivation and pitch
Speculative decoding has demonstrated significant potential in efficiently generating proposals and utilizing idle computing power to expedite the auto-regression decoding process, particularly under lightweight workloads. Thanks to the remarkable work by @cadedaniel, we have verified the latency benefits brought by speculative decoding on the latest version of vllm.
We have observed the following points that we believe could further enhance the utility of speculative decoding:
Ngram Proposer: While the 'Ngram' proposer can offer a 2x to 3x performance improvement in Retrieval-Augmented Generation (RAG) scenarios, its performance diminishes when the RAG module retrieves no relevant data for a query.
Draft-Model-Based Proposers: In contrast, draft-model-based proposers have exhibited higher acceptance rates when the RAG module retrieves no relevant data or faces a more creative task. Yet the performance of this type of implementation is not fully optimized (#4630 #5561). So the current performance gains are limited. We sincerely thank the open-source community for their efforts and hope all this progress will be smooth.
Creative Tasks with High Temperature: We have noticed that both proposer methods exhibit lower performance compared to non-spec implementations when dealing with creative tasks characterized by a high temperature or a great top_k. Maybe the spec decode should be disabled under this circumstance.
Apart from these observations, we were particularly interested in your latest work on speculative length scheduling for different workload scenarios (#5886) Optimizing Speculative Decoding for Serving Large Language Models Using Goodput.
This led us to wonder if vllm could be enhanced to support multiple proposers and provide the flexibility to schedule them appropriately. Alternatively, enabling users to specify the proposer for different requests via SamplingParams could also be a viable solution.
We believe this enhancement could unlock greater potential and adaptivity for vllm's speculative decoding capabilities. We are working on an inner forked version to verify whether we can achieve a higher goodput.
Thanks, feel free to leave a message to let us know what you think of it.
Alternatives
No response
Additional context
No response