Open CharlesGong12 opened 4 months ago
The sdpipeline's encode_prompt is here[https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L302]
Good eye! Thanks for your feedback! @CharlesGong12
Let me make it clear, for SDXL model, the second text encoder is CLIPVisionModelWithProjection, and the pooled feature is only from the 2nd encoder as text_features. For SD1.5 model, it is indeed a CLIPTextModel, so in our inference code, its text_feature is extracted manually as here.
Hope this helps. Please let me know if you have further question.
Hi thanks for your amazing work! I am confused with the subtraction operation
image_features minus text_features
. The image features is encoded byCLIPVisionModelWithProjection but the text features is encoded by CLIPTextModel, which doesn't have a projection operation. Therefore why can we directly use
image_features minus text_features
? It seems that image features and text features are not in the same space.