Open zziC7 opened 1 month ago
Hello,
The difference you've noticed stems from different parameterization approaches to achieve the same mathematical formulation.
In the paper's notation (Equation 1), we have: [Q₁; Q₂] = XW^Q, where W^Q ∈ ℝ^(d_model × 2d)
self.W_q = nn.Linear(d_model, 2 * self.d_head * num_heads, bias=False)
This directly implements the paper's formulation by creating a projection matrix that outputs concatenated Q₁ and Q₂ in one operation.
Hello, I noticed that in your code, the projection method of q, k, v is
self.W_q = nn.Linear(d_model, 2 * self.d_head * num_heads, bias=False)
However, in other repository I found they calculate q, k, v as:
self.q_proj = nn.Linear(embed_dim, embed_dim, bias=False)
code from this linkThe shape difference leads to differences in subsequent differential attention calculations. So I wonder which code is the method in the paper, or are the two just different ways of writing it?
Thanks.