Closed pacman100 closed 2 months ago
Thank you @pacman100! Lookup decoding and speculative decoding are important methods and we are already aware of them. We plan to bring them to our serving project as a first step (under active development in branch serving
). We will then followup with the algorithm support in the current chat app.
Thank you @MasterJH5574 for the details, looking forward to it.
Let me know if I can contribute/help.
Any updates on this?
we have incorproated spec decode into MLC Engine
🚀 Feature
Hello, thank you for all the great work, I truly like this project! ✨
It would be great to incorporate Prompt Lookup Decoding to speed up autoregressive decoding in LLMs. The project is here: https://github.com/apoorvumang/prompt-lookup-decoding.
Motivation
For input-grounded tasks such as code infilling/editing, summarization, doc-qa, multi-turn chat etc., there is a high n-gram overlap between LLM input (prompt) and LLM output. This could be entity names, phrases, or code chunks that the LLM directly copies from the input while generating the output. Prompt lookup exploits this pattern to speed up autoregressive decoding in LLMs.
This would be a great addition to
mlc-llm
and the whole algo is quite simple as shown in https://github.com/apoorvumang/prompt-lookup-decoding.Alternatives
Speculative decoding using a smaller draft model. Even this can be a great feature addition but it introduces overhead of using another smaller model.
Additional context
I use
mlc-llm
to locally host a fine-tuned starcoder model to assist me with code infilling tasks in vs code. However, currently only starcoder-1B suits the latency requirements on Mac M1 pro. It would be great to be able to use codellama 7B model with prompt lookup decoding for better performance and meeting the latency requirements.