vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.42k stars 4.21k forks source link

[Feature]: API control over speculative decoding and prefix caching #7569

Open pseudotensor opened 2 months ago

pseudotensor commented 2 months ago

🚀 The feature, motivation and pitch

speculative decoding can only be enabled at startup, but incurs a loss compared to normal decoding. So it would be useful if can control at runtime.

Alternatives

Restart vllm for each mode and test for each case. Very challenging for 405B etc. and leads to downtimes.

Additional context

No response

NickLucche commented 3 weeks ago

Related to this broader feature proposal https://github.com/vllm-project/vllm/issues/4565