Closed 619862306 closed 4 years ago
Sorry for the delay. This is a good question. Honestly, this is also our concern about CGNL module when we find it works in various tasks at first. We try to understand it in another pointviews.
Indeed it's a scalar [b, 1] for the whole features [b, c, h, w] if the kernel is linear. If we use a BN or fully-connected layer to learn this scalar from features like SE-Net, we think it's ok but not totally same.
If this scalar is learned in BN manner, it loses the important part of attention pipeline to generate the semantic feature structure aware weights for features.
If this scalar is learned in SE manner which is regressed by a fully-connected layer, it will squeeze the feature maps into a scalar along [h, w] axis. This process also doesn't have the semantic feature structure informations for calculating attention weights.
So CGNL computes the weights with the aggregation of whole semantic feature structures even though the results is a simple scalar.
We also assume that the improvements are brought by the additional conv layers in CGNL module, not its nature of capturing the long-range relationships. But if this hypothesis is right, we can't explain why does both NL and CGNL model work, and why does CGNL module have the comparable or better results compared to NL module.
Thank you for your replay. : )
Thank you for your work.
There is a qusetion about SpatialCGNL dot production kernel.
In your code,the calculation process of the dot production kernel:p = p.view(b, 1, c h w), g = g.view(b, c h w, 1), att = torch.bmm(p, g), the shape of att is (b 1 1) ,what is the meaning of the shape of att?