Closed yzeng58 closed 2 months ago
Thanks for your interest - we plan to implement this soon
Thanks! Another quick question: is there any place that I can directly use the plain flash linear attention with Triton, without adding the forget gate and chunkwise form?
Thanks so much for your quick response!
In this case, there is only chunk, fused_chunk and recurrent mode right? In the figure below (in GLA paper), there is a green line without using chunkwise parallel at all. I thought it is different from the "recurrent" mode defined in https://github.com/sustcsonglin/flash-linear-attention/blob/main/fla/layers/linear_attn.py, right?
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Thanks for your interest - we plan to implement this soon
Really looking forward to it! Any updates? Thanks! 😄
Great work!
It appears that both GLA and RetNet are optimized only for causal cases. Is there an optimized linear attention for non-causal scenarios?