leaderj1001 / Stand-Alone-Self-Attention

Implementing Stand-Alone Self-Attention in Vision Models using Pytorch
MIT License
456 stars 83 forks source link

The wrong imp of the inner-product operation #10

Open XiaLiPKU opened 4 years ago

XiaLiPKU commented 4 years ago

In Equation 2 of the paper, the query and the key are fed into inner-product operation, instead of point multiplication.

So the follow line https://github.com/leaderj1001/Stand-Alone-Self-Attention/blob/e0a168ef8d4a7b93ae706a7d7c68b982e112821e/attention.py#L48 should be out = (q_out * k_out).sum(dim=2)

20171130 commented 4 years ago

I found the same problem. It seems the implementation in the code is equivalent to having #attention heads = #embed dimensions.

ifeherva commented 4 years ago

@XiaLiPKU How would that modify line 49 and 50?

canaltin commented 4 years ago

@20171130 That was also my first opinion, but then there is an inconsistency with "groups" definition (to replicate the "attention heads") throughout the paper & the code.

Anyway, your alternative implementation helped me to understand the general concepts: https://github.com/20171130/AttentionLite/blob/master/model.py