Open ChenDirk opened 2 years ago
Thank you for your reply, I check the code, the implementation method is amazing! I also find the LayerNorm is very slow in inference, and the batchnorm can merge in convolutional layer, BN will not add FLOPs, amazing design! But I found that in the Attention block, there is one activation function, that is different with the MultiHeadAttention layer, did you compare the performance with/without activation function?
The inserted activation function is intended initially to increase non-linearity. However, we found that removing the activation function could achieve slightly better performance. I suggest removing it when using Topformer.
Besides reducing the dim of Q and K, we use the multi-head self-attention rather than the 1-head self-attention used in the Non-Local block.