sail-sg / volo

VOLO: Vision Outlooker for Visual Recognition
Apache License 2.0
929 stars 94 forks source link

Main diference between "outlook attention" and "Involution". #1

Closed wuyongfa-genius closed 3 years ago

wuyongfa-genius commented 3 years ago

Thanks for your excellent work!!! I noticed your "outlook attention" is very similar with the "Involution"(https://github.com/d-li14/involution).I just want to know the main difference. As I see, the main difference is that you use a extra linear projection on the input iteself and you use a extra softmax to generate the attention weight. However I did not find the detailed comparation between these two methods in your paper.

yuanli2333 commented 3 years ago

As you have said, the difference is that we use an extra linear projection on the input itself and use softmax to generate the attention map. Our overall idea is to encode fine-level token representations with a sliding window, which is different from the idea of involution. But now we also notice the similarity and dissimilarity between outlook attention with involution, so we will discuss it in our updated version recently.

wuyongfa-genius commented 3 years ago

Thanks for your explanation. Maybe the motivation is the main difference here. And involution aims at improving based on convnets, while yours aiming at improving based on ViTs and apparently you have designed a much better model architecture based on outlook attention. Either way, this kind of attention is indeed helpful in convnets or ViTs. Again thanks for your excellent work...

houqb commented 3 years ago

Thanks for your question. Our observation is that computing similarity between pairs of token representations is essentially important. You may refer to the differences between Dynamic Convolution (https://openreview.net/pdf?id=SkVhlh09tX) and self-attention.

theFoxofSky commented 3 years ago

Involution can also be easily implemented by DDF operation, which is a faster Cuda-based operation. It would be great that DDF can help to speed up VOLO. Here is the link: https://github.com/theFoxofSky/ddfnet/

lartpang commented 3 years ago

@theFoxofSky

Involution can also be easily implemented by DDF operation, which is a faster Cuda-based operation. It would be great that DDF can help to speed up VOLO. Here is the link: https://github.com/theFoxofSky/ddfnet/

In fact, due to the existence of the Fold operation, there is a big difference between this and the general dynamic convolution. This further realizes the communication between different windows.

You can look at the analysis here. https://github.com/sail-sg/volo/issues/7

If you think there is a problem with my understanding, please point it out.

wuyongfa-genius commented 3 years ago

@yuanli2333 @Andrew-Qibin Thanks for your replies. I finally figured out the difference through below explanation:

Note the definition of in equation 3.

image

It's a local window centered at (i,j). And here in equation 5

image the summation is over (m,n), which means gathering vectors from different windows but same original element location (i,j), same as step 1 and 2 in your comment @lartpang .

The notation is a little tricky, but it does not calculate the sum of features in the neighborhood corresponding to (i,j), but doing a fold op.

I think the main difference is that you use a kind of attention mechanisim named outlook attention which performed within local windows(attention weights are predicted by the central pixel feature) and then "fold" the values back into feature maps(In involution they just predict K^2 weights and weight-sum the local neigbors to get the central pixel feature). Thank you for your excellent work!!! You may close this issue if I am right. >_<...

monney commented 3 years ago

Since you're aggregating across neighboring windows, you are looking at a neighborhood of size 2K-1x2K-1. The attention weights end up being based on the center KxK pixels when you take the fold into account. So to me it seems like the main difference from involution is that you base the weighted average on the center KxK pixels rather than just the center alone.