vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.14k stars 3.98k forks source link

[Feature]: Long context window - Ring Attention with Blockwise Transformers for Near-Infinite Context #3573

Open chizhang118 opened 6 months ago

chizhang118 commented 6 months ago

This paper might be of interest: https://arxiv.org/pdf/2310.01889.pdf

This paper proposes Ring Attention with Blockwise Transformers, which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. This method handles long context window using multiple devices and could support up to 16M context window for extreme cases.

@simon-mo Is this a feature you'd like to see implemented?

hassan-twelvelabs commented 2 months ago

Hello. Is this feature being actively worked on? Thanks.

chizhang118 commented 2 months ago

Hello. Is this feature being actively worked on? Thanks.

No for now.