[Feature]: Integrate Writing in the Margins inference pattern ($5,000 Bounty)

melisa-writer commented 4 weeks ago

🚀 The feature, motivation and pitch

Writer has introduced "Writing in the Margins" algorithm (WiM) that boosts results for long context window retrieval. The task is composed from "context" and "query" that is put at the end.

The basic idea is to generate additional text while doing chunked prefill. The extra decoding step does not contribute to the KV-cache prefilling. The text is later concatenated and added to the final chunk.

There exists a pure HuggingFace transformers implementation: https://github.com/writer/writing-in-the-margins

This is a high level overview of the inference pattern:

And this is more detailed explanation how to do it efficiently by batch generation and prefill requests.

The algorithm itself:

The expected solution can be a feature added to vllm or a vllm fork, we are happy to maintain it. The WiM solution assumes extra input preprocessing steps (nltk splitting) and variable chunk size for chunked prefill, but those details can be left out from the solution.

We offer $5,000 bounty for the main contributor (but the bounty can be shared if there is more than one developer involved).

paper: ArXiv press coverage:

Alternatives

No response

Additional context

Github: https://github.com/writer/writing-in-the-margins

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

noooop commented 3 weeks ago

That's a brilliant idea.

I think

This algorithm can be implemented efficiently through prefix cache, and vllm supports prefix cache.

So you can implement this algorithm without modifying any vllm code.

specific

start vllm with prefix cache on
submit Margin generation requests in order
immediately submit the next Margin generation request when the previous request returns the first token.

How do I get $5,000? （joke）

melisa-writer commented 3 weeks ago

Thank you!

Indeed Automatic Prefix Caching could be used to simulate WiM algorithm. However there are a few issues related with this:

Different tokenization:

The real issue is the different tokenization you get by sending the text "A gentle breeze stirred" and then "the leaves as children" or the text "A gentle breeze stirred the leaves as children". To really apply WiM by exploiting Prefix Caching, you need to send multiple requests trimmed at the exact point in which you want the tokenizer to be break the text, but this would mean sending multiple requests to vLLM (which is what you do when you "simulate it"), which would result in much longer time to process compared to prefill-generate-prefill-generate done in the same request.

High workload KV cache eviction:

In a high workload scenario, the KV-Cache would be evicted, and in case of cache-miss it would be re-created again. But is it a problem? It may be. The whole point of WiM is to reuse the partially prefilled KV-Cache (or in other terms, KV prefixes).

Storing customer data: Automatic prefix caching means user data is stored in RAM. We would rather turn that feature off due to compliance issues.

To summarise: Good point with Automatic Prefix Caching, it can be used for prototyping. We still need something different for production use case.

noooop commented 3 weeks ago

Use prompt_token_ids as input can bypass the tokenizer.

The previous Margin generation is still in progress, so KV-Cache is still on the GPU. You cannot miss it.

Prefix Caching is almost a perfect solution

vllm-project / vllm