leaderj1001 / Stand-Alone-Self-Attention

Implementing Stand-Alone Self-Attention in Vision Models using Pytorch
MIT License
456 stars 83 forks source link

matrix multiplication instead of scalar dot product #16

Open vainaixr opened 4 years ago

vainaixr commented 4 years ago

I think they use matrix multiplication, not dot product

vainaixr commented 4 years ago

therefore, I think, final lines should be like,

        k_out = k_out.contiguous().view(batch, self.groups, height, width, -1, self.out_channels // self.groups)
        p_out = p_out.contiguous().view(batch, self.groups, height, width, -1, self.out_channels // self.groups)
        v_out = v_out.contiguous().view(batch, self.groups, height, width, -1, self.out_channels // self.groups)

        q_out = q_out.view(batch, self.groups, height, width, 1, self.out_channels // self.groups)

        out = torch.matmul(q_out, (k_out).transpose(-1, -2))
        out = F.softmax(out, dim=-1)

        out = torch.matmul(out, v_out).view(batch, -1, height, width)
        return out
KinWaiCheuk commented 3 years ago

After studying the code line by line, I also have this doubt. I was wondering if the element-wise product used in this code is another type of attention mechanism.

But at least in the paper equation (2) is of the form

which can be achieved by the code proposed by @vainaijr

KinWaiCheuk commented 3 years ago

I realized that this issue is the duplicate of #10

MartinGer commented 3 years ago

Are there any updates to this issue? Even though this repo has now over 300 stars, the implementation is different from what is described in the paper or other implementations as https://arxiv.org/pdf/1904.09925.pdf. For example, the number of groups/heads in this implementation doesn't seem to make any difference. What confuses me is that while testing I could achieve better results with this implementation than with the ones described in the papers. I couldn't find any explanation though.