Closed XavierXiao closed 2 years ago
In the Attention block, the projection to k and v is defined as self.to_kv = nn.Linear(dim, dim_head * 2, bias = False)
self.to_kv = nn.Linear(dim, dim_head * 2, bias = False)
So the parameters are shared across heads? I think this is not the case in common multi-head attention. Is there any references?
@XavierXiao it is from https://arxiv.org/abs/1911.02150 and adopted for PaLM https://arxiv.org/abs/2204.02311 . Alphacode too
In the Attention block, the projection to k and v is defined as
self.to_kv = nn.Linear(dim, dim_head * 2, bias = False)
So the parameters are shared across heads? I think this is not the case in common multi-head attention. Is there any references?