The architecture of kMaX Transformer Decoder seems not consistent with Fig.1 in the paper

google-research / deeplab2

DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a unified and state-of-the-art TensorFlow codebase for dense pixel labeling tasks.

Apache License 2.0

1.01k stars 159 forks source link

First, thanks a lot for sharing codes of the solid work. But I have a problem regarding the architecture of kMaX Transformer Decoder.

In https://github.com/google-research/deeplab2/blob/main/model/layers/dual_path_transformer.py#L611, the self-attention for the cluster centers are performed. But the input memory_query, memory_key, and memory_value are all computed at https://github.com/google-research/deeplab2/blob/main/model/layers/dual_path_transformer.py#L524, which is prior to the k-means cross attention (https://github.com/google-research/deeplab2/blob/main/model/layers/dual_path_transformer.py#L567). So, the self-attention for the cluster centers are actually computed from the input of the k-means cross attention, rather than the output.

Somehow, this is counterintuitive because the self-attention between object queries are computed from the output of the cross attention in previous works (e.g., Mask2Former). Also, this is not consistent with Fig.1 of the paper. Am I doing something wrong? Would you give some hints? Thank you very much! @YknZhu @yucornetto @csrhddlam @joe-siyuan-qiao

Thanks for your interests in our work!

You are right that in our code the self-attention is computed using the input of k-means cross-attention instead of the output. The main reason is that we build kMaX-DeepLab on top of MaX-DeepLab, where the self-attention and cross-attention are computed parallely instead of sequentially (so the qkv used for self-attention and cross-attention are computed at the beginning of the block). So to reuse as many codes as we can and keep things simple, we did the similar thing here to use the input to generate qkv.

I agree with you that intuitively speaking, computing self-attention qkv using the output of cross-attention sounds better. But we did not ablate it. Hope this address your question and please let me know if you have any other problems!

google-research / deeplab2

The architecture of kMaX Transformer Decoder seems not consistent with Fig.1 in the paper #157