Question about two attention modules

Heither666 commented 3 years ago

There are two attention modules used in SAScnenNet，one is chain connected 3xChAM and the other is “Attention Module”.

Q1: What happened when feature pass 3xChAM (concentrate on svevral special channel that strongly about the scene?) Q2: Why do we need 3 ChAM but not less or more (is it bucause 3module can make the feature more concentrate on the decisive feature that help to make sure the scene?) Q3: Why do we need “Attention Module”, what is the difference between this and ChAM in function (is it like one judeg "what" and another judge "where" in CBAM?)

I very much look forward to your reply

alexlopezcifuentes commented 3 years ago

Hi!

Thanks for the message I will try to answer the three questions. The first thing I want to clarify is that the Attention Module is one of the contributions of the paper but ChAM is not a contribution of ours, it is just applied to the method. Because of this, I suggest you take a look at the original ChAM paper which is really nicely explained.

Q1: The aim of ChAM, as explained in the original paper, is to compute self-attention over the channel dimension. We used this in order to attend more to specific channels from the Semantic Branch. Our idea, as features from the semantic branch depend on the semantic segmentation input tensor, is that ChAM will help us to attend more to specific objects (channels).

Q2: You can use as many ChAM modules as you want. All the design of the proposed architecture is based on the Residual Network construction, so we are using the space between ResNet Basic Block to introduce them. If I remember properly the original authors also did the same thing, but again, is a matter of design and you can use them where ever you want.

Q3: "Attention Module" and ChAM are both "Attention Mechanisms" but the aim of both is totally different. As explained before, ChAM aims to enhance the focus on specific channels (objects in our case) in the Semantic Branch. However, the Attention Module aims to force the RGB Branch network to focus on specific areas indicated by the final Semantic Branch Feature tensor. With this process, we try to focus RG Branch attention on specific objects from the image, the ones learned by the Semantic Branch.

Heither666 commented 3 years ago

Thank you for your quick and detailed reply!

About Q2, I notice that the output of RGB Branch is 512x7x7 and the 3ChAM modules change the input of Semantic Branch from 128x28x28 to 256x14x14 to512x7x7. So is it means that the number of ChAM module is up to the output of RGB Branch? Is it because we need to get a 512x7x7 output( decided by the shape of RBG Branch's output) and our input of Semantic Branch is 128x28x28 so we need 3 ChAM module to finish the shape change?

Looking forward to your reply.

alexlopezcifuentes commented 3 years ago

Actually is the other way around. We started with RGB Branch as a common ResNet-18 architecture. Semantic Segmentation Branch is built to match the exact same feature sizes as the one obtained in the RGB branch. Actually, the Semantic Segmentation branch is a ResNet-like architecture but only including the layers that perform the downsampling in size.

ChAM module does not change any size of any tensor, it just computes an attention feature tensor and applies it. The layers reducing the size are the convolutional layers from the Semantic Branch.

Heither666 commented 3 years ago

Thank you so much! I understand what you mean So, if we use ResNet 50 as the backbone, is there still 3ChAM added after conv block? (as you can see, the output shape is changed to 2048x7x7 and the Semantic Branch will change as well) I will run the evaluation.py later and see the difference (my computer is now running another program)

vpulab / Semantic-Aware-Scene-Recognition

Question about two attention modules #26