[Feature]: Support for predicted outputs - Githubissues

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

30.31k stars 4.59k forks source link

[Feature]: Support for predicted outputs #10137

Open flozi00 opened 1 week ago

flozi00 commented 1 week ago

🚀 The feature, motivation and pitch

https://platform.openai.com/docs/guides/latency-optimization#use-predicted-outputs

Reminds me on: https://github.com/FasterDecoding/REST https://arxiv.org/html/2311.08252v2

Alternatives

No response

Additional context

I could give it a try to implement it based on ngram speculation

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

simon-mo commented 2 days ago

I could give it a try to implement it based on ngram speculation

That sounds great! cc @LiuXiaoxuanPKU