In the article, the sparse cross attention is proposed. But i don't understand how do you implement this module. It seems that you don't split the query to N groups and do n times attention operations. I think you may use cross-attn-mask and key-padding-mask to make the query attention with the selected object features. Is that true?
But using attn-mask still remains unacceptable computational complexity in attention operation and each query should calculate with all obj features. I wonder if you using a different implement or my guess is right.
Thank you for your great work!
In the article, the sparse cross attention is proposed. But i don't understand how do you implement this module. It seems that you don't split the query to N groups and do n times attention operations. I think you may use cross-attn-mask and key-padding-mask to make the query attention with the selected object features. Is that true?
But using attn-mask still remains unacceptable computational complexity in attention operation and each query should calculate with all obj features. I wonder if you using a different implement or my guess is right.