[Feature Request] Support for Prompt Lookup Decoding to speed up autoregressive decoding in LLMs

pacman100 commented 8 months ago

🚀 Feature

Hello, thank you for all the great work, I truly like this project! ✨

It would be great to incorporate Prompt Lookup Decoding to speed up autoregressive decoding in LLMs. The project is here: https://github.com/apoorvumang/prompt-lookup-decoding.

Motivation

For input-grounded tasks such as code infilling/editing, summarization, doc-qa, multi-turn chat etc., there is a high n-gram overlap between LLM input (prompt) and LLM output. This could be entity names, phrases, or code chunks that the LLM directly copies from the input while generating the output. Prompt lookup exploits this pattern to speed up autoregressive decoding in LLMs.

This would be a great addition to mlc-llm and the whole algo is quite simple as shown in https://github.com/apoorvumang/prompt-lookup-decoding.

Alternatives

Speculative decoding using a smaller draft model. Even this can be a great feature addition but it introduces overhead of using another smaller model.

Additional context

I use mlc-llm to locally host a fine-tuned starcoder model to assist me with code infilling tasks in vs code. However, currently only starcoder-1B suits the latency requirements on Mac M1 pro. It would be great to be able to use codellama 7B model with prompt lookup decoding for better performance and meeting the latency requirements.

Screenshot 2023-11-28 at 12 39 44 PM

MasterJH5574 commented 8 months ago

Thank you @pacman100! Lookup decoding and speculative decoding are important methods and we are already aware of them. We plan to bring them to our serving project as a first step (under active development in branch serving). We will then followup with the algorithm support in the current chat app.

pacman100 commented 8 months ago

Thank you @MasterJH5574 for the details, looking forward to it.

Let me know if I can contribute/help.

kdcyberdude commented 5 months ago

Any updates on this?

tqchen commented 2 months ago

we have incorproated spec decode into MLC Engine

mlc-ai / mlc-llm