What is the purpose of the Prior Network?

egeozsoy commented 2 years ago

CLIP is trained to bring the image and text embeddings of an image-text pair as close to each other as possible. So the question is, why can't we just skip the prior network, and directly input the clip text embeddings to the decoder network? Prior network tries to predict clip image embeddings from clip text embeddings, but by definition (clip training), these two should be already very close to each other. Am I missing something?

rom1504 commented 2 years ago

Clip training does not try to make text and image embedding close in the space It only tries to make it so matching examples are closer than non matching ones. IE using clip is trained for ranking, not really for aligning modalities in latent space

On Mon, May 30, 2022, 13:54 egeozsoy @.***> wrote:

CLIP is trained to bring the image and text embeddings of an image-text pair as close to each other as possible. So the question is, why can't we just skip the prior network, and directly input the clip text embeddings to the decoder network? Because prior network tries to predict clip image embeddings from clip text embeddings, but by definition (clip training), these should be already very close to each other. Am I missing something?

— Reply to this email directly, view it on GitHub https://github.com/lucidrains/DALLE2-pytorch/issues/125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437VPWZ5SX3KTOJVAAZDVMST5PANCNFSM5XKKLRFA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

rom1504 commented 2 years ago

In practice however you can indeed plug directly the text embeddings into the decoder, openai reports trying it and it works. But it works worse than using the prior

On Mon, May 30, 2022, 14:15 Romain Beaumont @.***> wrote:

Clip training does not try to make text and image embedding close in the space It only tries to make it so matching examples are closer than non matching ones. IE using clip is trained for ranking, not really for aligning modalities in latent space

On Mon, May 30, 2022, 13:54 egeozsoy @.***> wrote:

CLIP is trained to bring the image and text embeddings of an image-text pair as close to each other as possible. So the question is, why can't we just skip the prior network, and directly input the clip text embeddings to the decoder network? Because prior network tries to predict clip image embeddings from clip text embeddings, but by definition (clip training), these should be already very close to each other. Am I missing something?

— Reply to this email directly, view it on GitHub https://github.com/lucidrains/DALLE2-pytorch/issues/125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437VPWZ5SX3KTOJVAAZDVMST5PANCNFSM5XKKLRFA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

egeozsoy commented 2 years ago

Thank you for the quick answer, that makes so much sense. Do you have a link to this second thing you mentioned. "openai reports trying it and it works"?

egeozsoy commented 2 years ago

Apparently it is directly in their paper. Refer to section 5.1. Thanks for pointing this out @rom1504

Weixin-Liang commented 2 years ago

Just want to share our paper that might help answering this question:

CLIP is trained to bring the image and text embeddings of an image-text pair as close to each other as possible. So the question is, why can't we just skip the prior network, and directly input the clip text embeddings to the decoder network? Prior network tries to predict clip image embeddings from clip text embeddings, but by definition (clip training), these two should be already very close to each other. Am I missing something?

arXiv:2203.02053 uploaded on 3 Mar 2022

Figure 1: The pervasive modality gap in multi-modal contrastive representation learning

In the paper, we present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. As shown in Figure 1(b), CLIP’s image embeddings and text embeddings are located in two completely separate regions of the embedding space. We find this phenomenon consistently across various multi-modal models, covering texts, natural images [Radford et al., 2021], videos [Xu et al., 2021], medical images [Zhang et al., 2020], and amino-acid sequences [EleutherAI]. Interestingly, this phenomenon still holds even when we embed using multi-modal models with random weights (Figure 1 (c)).

Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model initialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multi-modal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model’s downstream zero-shot classification performance and fairness.

Hope that it will be helpful for future readers! arXiv:2203.02053

https://modalitygap.readthedocs.io/

REBELa23187 commented 1 year ago

Just want to share our paper that might help answering this question:

CLIP is trained to bring the image and text embeddings of an image-text pair as close to each other as possible. So the question is, why can't we just skip the prior network, and directly input the clip text embeddings to the decoder network? Prior network tries to predict clip image embeddings from clip text embeddings, but by definition (clip training), these two should be already very close to each other. Am I missing something?

arXiv:2203.02053 uploaded on 3 Mar 2022

Figure 1: The pervasive modality gap in multi-modal contrastive representation learning

In the paper, we present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. As shown in Figure 1(b), CLIP’s image embeddings and text embeddings are located in two completely separate regions of the embedding space. We find this phenomenon consistently across various multi-modal models, covering texts, natural images [Radford et al., 2021], videos [Xu et al., 2021], medical images [Zhang et al., 2020], and amino-acid sequences [EleutherAI]. Interestingly, this phenomenon still holds even when we embed using multi-modal models with random weights (Figure 1 (c)).

Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model initialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multi-modal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model’s downstream zero-shot classification performance and fairness.

Hope that it will be helpful for future readers! arXiv:2203.02053

https://modalitygap.readthedocs.io/

COOL! U R so great! Just curious, do you think it's possible to get this modality gap eliminated or almost eliminated? (The blue and red dots group together and interleave)

lucidrains / DALLE2-pytorch

What is the purpose of the Prior Network? #125