vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.08k stars 3.11k forks source link

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding #3398

Open tchaton opened 3 months ago

tchaton commented 3 months ago

This paper might be of interest: https://arxiv.org/pdf/2402.12374.pdf

rkooo567 commented 3 months ago

cc @cadedaniel