Bug in attention? - Githubissues

LucasSloan commented 2 years ago

The scale is being calculated here as 1 / math.sqrt(math.sqrt(ch)). The comment says it was adapted from the attention implementation here, where the scale is int(C) ** (-0.5), which is 1 / math.sqrt(ch), not 1 / math.sqrt(math.sqrt(ch)).

Is this change to use 2 square roots intentional?

unrealwill commented 2 years ago

Hello, At first, it seems like a bug but it isn't one, because it appears two times in the einsum product https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/unet.py#L348-L351

An extra comment would have been welcomed because it looks like a typo at first sight.

So we are computing the right thing : QKT/ math.sqrt(dk)= (Q/ math.sqrt(math.sqrt(dk))) *(KT/ math.sqrt(math.sqrt(dk)))

The justification for the scaling actor is given in the "Attention is all you need" paper https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

In the foot note of the above paper : To illustrate why the dot products get large, assume that the components of q and k are independent random variables with mean 0 and variance 1. Then their dot product, q · k = ∑dk i=1 qiki, has mean 0 and variance dk .

This probably means that to really do the right thing the q and k should be of unit variance, ( which probably isn't the case here because they are using a GroupNorm(32) ) https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/unet.py#L302 https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/nn.py#L93-L100 Which means that depending on the number of heads and number of channels, the input variance may not be 1 (to further check... ) )

LucasSloan commented 2 years ago

Got it - thanks for the info!

openai / guided-diffusion

Bug in attention? #38