Closed apeguero1 closed 4 years ago
Hi @apeguero1, thank you for pointing this out!
It seems as though you are correct. Looking at the code, It looks like I don't need the w_q
, w_k
, and w_v
matrices, if I make sure that for each head, there is a to_q
, to_k
, and to_v
matrix that is an nn.Linear(channels, dim)
layer. This would mean that instead of having only one to_q
, to_k
, and to_v
matrix for each MHAttention
, there would be nhead
of these matrices. This would mean that, essentially, the LinearAttentionHead
module would just become a matrix multiplication with no learnable parameters, besides the E
and F
layers.
I will make a pull request later of this change, let me know what you think!
Great sounds good, I'll take a look!
Check it out and see what you think, let me know if you see any errors :+1:
The only problem is that training takes longer, i don't know if that's just a thing with my computer and it not running so well, so if you see something that could affect that, please let me know! <- Nevermind that
Yep looks good to me. Thanks for the quick response! (:
No prob :) Merged, and the latest version is available with version 0.10.0
on pip
Hey @tatp22 great repo!
I'm having trouble wrapping my head around the
w_q, w_k, and w_v
linear layers in theLinearAttentionHead
module. Are they needed? There's no activation between the previous linear layers,to_q, to_k, to_v
inMHAttention
, and those weights so they wouldn't add any expressivity to the model since you would just be multiplying two matrices together which is equivalent to one linear layer. TheE
andF
projections also seem like they're being composed withw_k, and w_v
without a non-linearity.Looking at Eq. 7 from the paper your implementation seems correct though.
Any thoughts on this?