vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.92k stars 4.7k forks source link

[Feature]: Integrate Writing in the Margins inference pattern ($5,000 Bounty) #9807

Open melisa-writer opened 4 weeks ago

melisa-writer commented 4 weeks ago

🚀 The feature, motivation and pitch

Writer has introduced "Writing in the Margins" algorithm (WiM) that boosts results for long context window retrieval. The task is composed from "context" and "query" that is put at the end.

The basic idea is to generate additional text while doing chunked prefill. The extra decoding step does not contribute to the KV-cache prefilling. The text is later concatenated and added to the final chunk.

There exists a pure HuggingFace transformers implementation: https://github.com/writer/writing-in-the-margins

This is a high level overview of the inference pattern:

Screenshot 2024-10-29 at 18 33 22

And this is more detailed explanation how to do it efficiently by batch generation and prefill requests.

Screenshot 2024-10-29 at 18 34 19

The algorithm itself:

Screenshot 2024-10-29 at 18 39 05

The expected solution can be a feature added to vllm or a vllm fork, we are happy to maintain it. The WiM solution assumes extra input preprocessing steps (nltk splitting) and variable chunk size for chunked prefill, but those details can be left out from the solution.

We offer $5,000 bounty for the main contributor (but the bounty can be shared if there is more than one developer involved).

paper: ArXiv press coverage:

Alternatives

No response

Additional context

Github: https://github.com/writer/writing-in-the-margins

Before submitting a new issue...

noooop commented 3 weeks ago

That's a brilliant idea.

I think

This algorithm can be implemented efficiently through prefix cache, and vllm supports prefix cache.

So you can implement this algorithm without modifying any vllm code.

specific

  1. start vllm with prefix cache on
  2. submit Margin generation requests in order
  3. immediately submit the next Margin generation request when the previous request returns the first token.

How do I get $5,000? (joke)

melisa-writer commented 3 weeks ago

Thank you!

Indeed Automatic Prefix Caching could be used to simulate WiM algorithm. However there are a few issues related with this:

  1. Different tokenization:

The real issue is the different tokenization you get by sending the text "A gentle breeze stirred" and then "the leaves as children" or the text "A gentle breeze stirred the leaves as children". To really apply WiM by exploiting Prefix Caching, you need to send multiple requests trimmed at the exact point in which you want the tokenizer to be break the text, but this would mean sending multiple requests to vLLM (which is what you do when you "simulate it"), which would result in much longer time to process compared to prefill-generate-prefill-generate done in the same request.

  1. High workload KV cache eviction:

In a high workload scenario, the KV-Cache would be evicted, and in case of cache-miss it would be re-created again. But is it a problem? It may be. The whole point of WiM is to reuse the partially prefilled KV-Cache (or in other terms, KV prefixes).

  1. Storing customer data: Automatic prefix caching means user data is stored in RAM. We would rather turn that feature off due to compliance issues.

To summarise: Good point with Automatic Prefix Caching, it can be used for prototyping. We still need something different for production use case.

noooop commented 3 weeks ago

Use prompt_token_ids as input can bypass the tokenizer.

The previous Margin generation is still in progress, so KV-Cache is still on the GPU. You cannot miss it.

Prefix Caching is almost a perfect solution