A small question about flash-attn

junjie18 / CMT

[ICCV 2023] Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

Other

308 stars 34 forks source link

A small question about flash-attn #59

Closed WeijieMax closed 11 months ago

WeijieMax commented 11 months ago

Sorry for the interruption. I have a doubt that now that flash attention is faster than vanilla attention, why not substitute all attention parts into the flash one? I see there is still multiheadattention in the config. Do I ignore other details?

junjie18 commented 11 months ago

@WeijieMax

The self atttn is a $900 \times 900$ matrix, the memory allocation is not time consuming, FlashAttn is helpless in such case.
FlashAttn can not support arbitrary mask, since the arbitrary mask is a very large mask like Attention map. But DN needs mask among groups, which should be a very complex CUDA programming.

WeijieMax commented 11 months ago

Thanks for your instant reply. I understand what you answered, thanks!