Open pseudotensor opened 2 months ago
speculative decoding can only be enabled at startup, but incurs a loss compared to normal decoding. So it would be useful if can control at runtime.
Restart vllm for each mode and test for each case. Very challenging for 405B etc. and leads to downtimes.
No response
Related to this broader feature proposal https://github.com/vllm-project/vllm/issues/4565
🚀 The feature, motivation and pitch
speculative decoding can only be enabled at startup, but incurs a loss compared to normal decoding. So it would be useful if can control at runtime.
Alternatives
Restart vllm for each mode and test for each case. Very challenging for 405B etc. and leads to downtimes.
Additional context
No response