sustcsonglin / flash-linear-attention

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton
MIT License
1.33k stars 68 forks source link

Quick question: Is there a non-causal optimized form of Flash Linear Attention? #31

Closed yzeng58 closed 2 months ago

yzeng58 commented 4 months ago

Great work!

It appears that both GLA and RetNet are optimized only for causal cases. Is there an optimized linear attention for non-causal scenarios?

sustcsonglin commented 4 months ago

Thanks for your interest - we plan to implement this soon

yzeng58 commented 4 months ago

Thanks! Another quick question: is there any place that I can directly use the plain flash linear attention with Triton, without adding the forget gate and chunkwise form?

sustcsonglin commented 4 months ago

https://github.com/sustcsonglin/flash-linear-attention/blob/main/fla/layers/linear_attn.py

yzeng58 commented 4 months ago

Thanks so much for your quick response!

yzeng58 commented 4 months ago

In this case, there is only chunk, fused_chunk and recurrent mode right? In the figure below (in GLA paper), there is a green line without using chunkwise parallel at all. I thought it is different from the "recurrent" mode defined in https://github.com/sustcsonglin/flash-linear-attention/blob/main/fla/layers/linear_attn.py, right?

image

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

YicongHong commented 1 month ago

Thanks for your interest - we plan to implement this soon

Really looking forward to it! Any updates? Thanks! 😄