Open NonvolatileMemory opened 1 month ago
Triton does not support bfloat16
https://triton-lang.org/main/programming-guide/chapter-3/debugging.html#limitations
Thanks for your reply!
However, based on my test case, when using group query attention, the gradient of k and v cannot pass allclose with torch implementation vs fused-attention.
both fused-attention and flash-attn-og cannot pass the bfloat16 test