[Feature]: API control over speculative decoding and prefix caching

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

28.42k stars 4.21k forks source link

Open pseudotensor opened 2 months ago

pseudotensor commented 2 months ago

speculative decoding can only be enabled at startup, but incurs a loss compared to normal decoding. So it would be useful if can control at runtime.

Restart vllm for each mode and test for each case. Very challenging for 405B etc. and leads to downtimes.

No response

NickLucche commented 3 weeks ago