Closed wangbluo closed 2 months ago
For cases where s_q s_kv element size >= 10 GB, dispatch only to the FlashAttentionDaoLoader kernel, and use an empty tensor as a placeholder for the attention_mask. Additionally, only causal and padded causal modes are supported.
📝 What does this PR do?
For cases where s_q s_kv element size >= 10 GB, dispatch only to the FlashAttentionDaoLoader kernel, and use an empty tensor as a placeholder for the attention_mask. Additionally, only causal and padded causal modes are supported.