Open ouusan opened 1 month ago
paper: https://arxiv.org/pdf/2410.05258
Diff attention part: https://github.com/microsoft/unilm/blob/master/Diff-Transformer/multihead_diffattn.py#L99-L117 Diff attention in Flash Attention1/2 for different QKV dim: https://github.com/microsoft/unilm/blob/master/Diff-Transformer/multihead_flashdiff_1.py#L91
1.https://github.com/facebookresearch/xformers 2.https://github.com/Dao-AILab/flash-attention 3.https://aka.ms/flash-diff
pre-RMSNorm: https://blog.csdn.net/qq_39970492/article/details/131125752 SwiGLU: https://zhuanlan.zhihu.com/p/650237644
Flash Attention:
paper: https://arxiv.org/pdf/2410.05258
Diff attention part: https://github.com/microsoft/unilm/blob/master/Diff-Transformer/multihead_diffattn.py#L99-L117 Diff attention in Flash Attention1/2 for different QKV dim: https://github.com/microsoft/unilm/blob/master/Diff-Transformer/multihead_flashdiff_1.py#L91
1.https://github.com/facebookresearch/xformers 2.https://github.com/Dao-AILab/flash-attention 3.https://aka.ms/flash-diff