What is the difference between the attention block in TopFormer with None-local block?

hustvl / TopFormer

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation, CVPR2022

Other

373 stars 42 forks source link

What is the difference between the attention block in TopFormer with None-local block? #10

Open ChenDirk opened 2 years ago

speedinghzl commented 2 years ago

Besides reducing the dim of Q and K, we use the multi-head self-attention rather than the 1-head self-attention used in the Non-Local block.

ChenDirk commented 2 years ago

Thank you for your reply, I check the code, the implementation method is amazing! I also find the LayerNorm is very slow in inference, and the batchnorm can merge in convolutional layer, BN will not add FLOPs, amazing design! But I found that in the Attention block, there is one activation function, that is different with the MultiHeadAttention layer, did you compare the performance with/without activation function?

speedinghzl commented 2 years ago

The inserted activation function is intended initially to increase non-linearity. However, we found that removing the activation function could achieve slightly better performance. I suggest removing it when using Topformer.