[Feature]: Add support for interchangable radix attention

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Apache License 2.0

30.07k stars 4.54k forks source link

🚀 The feature, motivation and pitch

I am working on adjustment of radix attentions now. Thank you for your support for the radix attention. Currently, catching for A that allows for more efficient A+B generation. However, in some tree-of-thoughts settings, we are also interested for caching A+B, A+C, A+D, and thus more efficiently generates A+B+C+D. I think this feature could be developed with some adjustment from the current function of `enable_prefix_caching. I would also really appreciate if you could share some insights on how to implement this function from the current implementation of prefix_caching. Thank you very much for the great work!

Alternatives

No response

Additional context

No response

vllm-project / vllm