Closed WeijieMax closed 11 months ago
@WeijieMax
self atttn
is a $900 \times 900$ matrix, the memory allocation is not time consuming, FlashAttn
is helpless in such case. FlashAttn
can not support arbitrary mask, since the arbitrary mask is a very large mask like Attention map. But DN
needs mask among groups, which should be a very complex CUDA programming.Thanks for your instant reply. I understand what you answered, thanks!
Sorry for the interruption. I have a doubt that now that flash attention is faster than vanilla attention, why not substitute all attention parts into the flash one? I see there is still multiheadattention in the config. Do I ignore other details?