lucidrains / imagen-pytorch

Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch
MIT License
8.11k stars 768 forks source link

Why there is only one head for key and value? #255

Closed XavierXiao closed 2 years ago

XavierXiao commented 2 years ago

In the Attention block, the projection to k and v is defined as self.to_kv = nn.Linear(dim, dim_head * 2, bias = False)

So the parameters are shared across heads? I think this is not the case in common multi-head attention. Is there any references?

lucidrains commented 2 years ago

@XavierXiao it is from https://arxiv.org/abs/1911.02150 and adopted for PaLM https://arxiv.org/abs/2204.02311 . Alphacode too