lucidrains / DALLE2-pytorch

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
MIT License
11.03k stars 1.07k forks source link

Regarding L2 norm clamping in Diffusion Prior #68

Closed xiankgx closed 2 years ago

xiankgx commented 2 years ago

Why do we clamp only during sampling and not during training? Shouldn't they be matching? Please enlighten me.

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L843-L844

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L859-L860

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L885-L900

xiankgx commented 2 years ago

Also, here we multiply with a scale without first doing l2norm.

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L986

which is ok if we use XClip because we are doing l2norm here.

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L180

But, we are not doing l2norm when using OpenAI CLIP.

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L274-L275

lucidrains commented 2 years ago

@xiankgx good idea! i've added it here https://github.com/lucidrains/DALLE2-pytorch/commit/14e63a3f67674435a1a15b45e170c6a1146484d3 although i think the whole l2norm clamping thing is not proven out yet

lucidrains commented 2 years ago

Also, here we multiply with a scale without first doing l2norm.

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L986

which is ok if we use XClip because we are doing l2norm here.

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L180

But, we are not doing l2norm when using OpenAI CLIP.

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L213

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L213 ohh, this isn't OpenAIClip, it is actually from CoCa https://arxiv.org/abs/2205.01917 , which debuted yesterday. i think it is a better version of clip

however, it is unclear from the CoCa paper whether they l2normed for cosine similarity contrastive learning

in the paper, it seems they use layernorms on both image and text cls tokens, but not sure if the l2norm is present

xiankgx commented 2 years ago

Sorry, wrong line quote.

xiankgx commented 2 years ago

Lol, don't take my word for it, I'm a newbie in diffusion models.

lucidrains commented 2 years ago

newbie

@xiankgx same, i think we all are, except for a few researchers around the world and maybe @crowsonkb lol

you are right! https://github.com/openai/CLIP/blob/main/clip/model.py#L364 they normalized it outside of the encoding functions, let me fix it now :pray:

xiankgx commented 2 years ago

Maybe we can ask crowsonkb for advice.

lucidrains commented 2 years ago

https://github.com/lucidrains/DALLE2-pytorch/releases/tag/0.1.4