Why there is only one head for key and value?

lucidrains / imagen-pytorch

Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch

MIT License

8.11k stars 768 forks source link

Closed XavierXiao closed 2 years ago

XavierXiao commented 2 years ago

In the Attention block, the projection to k and v is defined as self.to_kv = nn.Linear(dim, dim_head * 2, bias = False)

So the parameters are shared across heads? I think this is not the case in common multi-head attention. Is there any references?

lucidrains commented 2 years ago