Closed X-Lai closed 1 year ago
Thanks for your interests in our work!
You are right that in our code the self-attention is computed using the input of k-means cross-attention instead of the output. The main reason is that we build kMaX-DeepLab on top of MaX-DeepLab, where the self-attention and cross-attention are computed parallely instead of sequentially (so the qkv used for self-attention and cross-attention are computed at the beginning of the block). So to reuse as many codes as we can and keep things simple, we did the similar thing here to use the input to generate qkv.
I agree with you that intuitively speaking, computing self-attention qkv using the output of cross-attention sounds better. But we did not ablate it. Hope this address your question and please let me know if you have any other problems!
Thanks a lot for your prompt reply! It addresses my problem.
First, thanks a lot for sharing codes of the solid work. But I have a problem regarding the architecture of kMaX Transformer Decoder.
In https://github.com/google-research/deeplab2/blob/main/model/layers/dual_path_transformer.py#L611, the self-attention for the cluster centers are performed. But the input
memory_query
,memory_key
, andmemory_value
are all computed at https://github.com/google-research/deeplab2/blob/main/model/layers/dual_path_transformer.py#L524, which is prior to the k-means cross attention (https://github.com/google-research/deeplab2/blob/main/model/layers/dual_path_transformer.py#L567). So, the self-attention for the cluster centers are actually computed from the input of the k-means cross attention, rather than the output.Somehow, this is counterintuitive because the self-attention between object queries are computed from the output of the cross attention in previous works (e.g., Mask2Former). Also, this is not consistent with Fig.1 of the paper. Am I doing something wrong? Would you give some hints? Thank you very much! @YknZhu @yucornetto @csrhddlam @joe-siyuan-qiao