sustcsonglin / flash-linear-attention

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton
MIT License
1.24k stars 66 forks source link

[DRAFT] Beta gradient does not match #43

Closed hypnopump closed 1 month ago

hypnopump commented 1 month ago