question about sparse cross attention module

Thank you for your great work!

In the article, the sparse cross attention is proposed. But i don't understand how do you implement this module. It seems that you don't split the query to N groups and do n times attention operations. I think you may use cross-attn-mask and key-padding-mask to make the query attention with the selected object features. Is that true?

But using attn-mask still remains unacceptable computational complexity in attention operation and each query should calculate with all obj features. I wonder if you using a different implement or my guess is right.

tusen-ai / MV2D

question about sparse cross attention module #14