Confused with image_features minus text_features

instantX-research / InstantStyle

InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation 🔥

https://instantstyle.github.io/

1.65k stars 106 forks source link

Confused with image_features minus text_features #48

Open CharlesGong12 opened 4 months ago

CharlesGong12 commented 4 months ago

Hi thanks for your amazing work! I am confused with the subtraction operation image_features minus text_features. The image features is encoded by
CLIPVisionModelWithProjection but the text features is encoded by CLIPTextModel, which doesn't have a projection operation. Therefore why can we directly use image_features minus text_features? It seems that image features and text features are not in the same space.

CharlesGong12 commented 4 months ago

The sdpipeline's encode_prompt is here[https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L302]

haofanwang commented 4 months ago

Good eye! Thanks for your feedback! @CharlesGong12

Let me make it clear, for SDXL model, the second text encoder is CLIPVisionModelWithProjection, and the pooled feature is only from the 2nd encoder as text_features. For SD1.5 model, it is indeed a CLIPTextModel, so in our inference code, its text_feature is extracted manually as here.

Hope this helps. Please let me know if you have further question.