openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
24.55k stars 3.2k forks source link

Issue with Text Encoder Output Dimensions in Fine-Tuned CLIP Model When Using with Stable Diffusion #452

Open QXGeraldMo opened 2 months ago

QXGeraldMo commented 2 months ago

I'm encountering an issue with the dimensions of the text encoder output in a fine-tuned CLIP model. The fine-tuning output of my CLIP model based on RN50 is (1, 1024), whereas the output from CLIPTextModel in transformers is (1, 77, 768). This causes difficulties when integrating with Stable Diffusion models, as the encoder_hidden_states parameter in U-Net requires an embedding of shape (x, x, 768).

Can someone help me deal with this problem? Would truncating the output of encode_text to (1, 1, 768) potentially resolve this issue?

supersonicMaclaurin commented 1 month ago

same question