This paper proposes Ring Attention with Blockwise Transformers, which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. This method handles long context window using multiple devices and could support up to 16M context window for extreme cases.
@simon-mo Is this a feature you'd like to see implemented?
This paper might be of interest: https://arxiv.org/pdf/2310.01889.pdf
This paper proposes Ring Attention with Blockwise Transformers, which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. This method handles long context window using multiple devices and could support up to 16M context window for extreme cases.
@simon-mo Is this a feature you'd like to see implemented?