Closed sshzhang closed 2 years ago
Sorry for late A 1: Qk指的就是第k个session,这里的h是self-attention的multi-head的分法 A 2: 是一样的,我们希望bias encoding去分别不同session,所有session的 self-attention参数都是一样的。不过我们做过实验,其实影响不是特别大
@649435349 Thanks for your clarification. So I guess the question here is: I agree that the self-attention parameters are the same across different sessions. But the problem is, in the original implementation of self-attention in "Attention is All your need", the parameters for different heads are different. But in your paper and implementation, the parameters for different heads are all the same. Actually, as I see, you divide the embedding space into several parts. And then you apply the same head to each part of the embedding space. So, do you mean different parts of the embeddings should not be mixed together? Or do you have any other motivation here?
@649435349 Thanks for your clarification. So I guess the question here is: I agree that the self-attention parameters are the same across different sessions. But the problem is, in the original implementation of self-attention in "Attention is All your need", the parameters for different heads are different. But in your paper and implementation, the parameters for different heads are all the same. Actually, as I see, you divide the embedding space into several parts. And then you apply the same head to each part of the embedding space. So, do you mean different parts of the embeddings should not be mixed together? Or do you have any other motivation here?
Well, it could be an unintentional error. In fact, I implement self-attention in the same way as "Attention is All your need", except the bias encoding.
@649435349 Thanks for your clarification. So I guess the question here is: I agree that the self-attention parameters are the same across different sessions. But the problem is, in the original implementation of self-attention in "Attention is All your need", the parameters for different heads are different. But in your paper and implementation, the parameters for different heads are all the same. Actually, as I see, you divide the embedding space into several parts. And then you apply the same head to each part of the embedding space. So, do you mean different parts of the embeddings should not be mixed together? Or do you have any other motivation here?
Well, it could be an unintentional error. In fact, I implement self-attention in the same way as "Attention is All your need", except the bias encoding.
I see. Thanks for your reply!
作者你好,以下有两个问题,希望您能解答.
第一点:在你论文中Qk应该指用户的第k个session矩阵, 但是在论文后面 这个Qk好像和原来意思不一样啊?能帮忙解释一下具体含义吗?
第二点:我看过Attention is All your need论文,我发现你论文中的参数矩阵应该是不同session之间是不同的.而你的论文中看起来所有参数矩阵W^Q都一样.