[Feature]: Long context window - Ring Attention with Blockwise Transformers for Near-Infinite Context

chizhang118 commented 6 months ago

This paper might be of interest: https://arxiv.org/pdf/2310.01889.pdf

This paper proposes Ring Attention with Blockwise Transformers, which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. This method handles long context window using multiple devices and could support up to 16M context window for extreme cases.

@simon-mo Is this a feature you'd like to see implemented?

hassan-twelvelabs commented 2 months ago

Hello. Is this feature being actively worked on? Thanks.

chizhang118 commented 2 months ago

Hello. Is this feature being actively worked on? Thanks.

No for now.

vllm-project / vllm

[Feature]: Long context window - Ring Attention with Blockwise Transformers for Near-Infinite Context #3573