Open melisa-writer opened 4 weeks ago
That's a brilliant idea.
I think
This algorithm can be implemented efficiently through prefix cache, and vllm supports prefix cache.
So you can implement this algorithm without modifying any vllm code.
specific
How do I get $5,000? (joke)
Thank you!
Indeed Automatic Prefix Caching could be used to simulate WiM algorithm. However there are a few issues related with this:
The real issue is the different tokenization you get by sending the text "A gentle breeze stirred" and then "the leaves as children" or the text "A gentle breeze stirred the leaves as children". To really apply WiM by exploiting Prefix Caching, you need to send multiple requests trimmed at the exact point in which you want the tokenizer to be break the text, but this would mean sending multiple requests to vLLM (which is what you do when you "simulate it"), which would result in much longer time to process compared to prefill-generate-prefill-generate done in the same request.
In a high workload scenario, the KV-Cache would be evicted, and in case of cache-miss it would be re-created again. But is it a problem? It may be. The whole point of WiM is to reuse the partially prefilled KV-Cache (or in other terms, KV prefixes).
To summarise: Good point with Automatic Prefix Caching, it can be used for prototyping. We still need something different for production use case.
Use prompt_token_ids as input can bypass the tokenizer.
The previous Margin generation is still in progress, so KV-Cache is still on the GPU. You cannot miss it.
Prefix Caching is almost a perfect solution
🚀 The feature, motivation and pitch
Writer has introduced "Writing in the Margins" algorithm (WiM) that boosts results for long context window retrieval. The task is composed from "context" and "query" that is put at the end.
The basic idea is to generate additional text while doing chunked prefill. The extra decoding step does not contribute to the KV-cache prefilling. The text is later concatenated and added to the final chunk.
There exists a pure HuggingFace transformers implementation: https://github.com/writer/writing-in-the-margins
This is a high level overview of the inference pattern:
And this is more detailed explanation how to do it efficiently by batch generation and prefill requests.
The algorithm itself:
The expected solution can be a feature added to vllm or a vllm fork, we are happy to maintain it. The WiM solution assumes extra input preprocessing steps (nltk splitting) and variable chunk size for chunked prefill, but those details can be left out from the solution.
We offer $5,000 bounty for the main contributor (but the bounty can be shared if there is more than one developer involved).
paper: ArXiv press coverage:
Alternatives
No response
Additional context
Github: https://github.com/writer/writing-in-the-margins
Before submitting a new issue...