hszhao / SAN

Exploring Self-attention for Image Recognition, CVPR2020.
MIT License
747 stars 133 forks source link

Clarification on Aggregation #7

Closed MSiam closed 4 years ago

MSiam commented 4 years ago

Hello I had a question regarding your code. In your patchwise attention model you are using these cython kernels for aggregation

I was a bit confused on what exactly is it doing can you explain its functionality? If I have for an example input data: 1x256x1xWH weights: 1x32x7**2xWH

How does it perform the hadamard product described in equation 4

Also can I generally confirm the following in SAM module and its mapping to equations 4 and 5 in your paper: 1- conv1: phi, conv2:psi, conv3:beta 2- delta is simple concatenation 3- conv_w: gamma

Thanks for your help.

hszhao commented 4 years ago

Hi, for the input data, the number of positions is HW. And the weight for each local region is 32*72, with local footprint size as 72. The attention weights are 32 channels, shared by 256 feature channels. The number of channels sharing the same attention weight is set to 8, as described in the last part in Sec. 5.1. Thanks.