Using SigLIP as text encoder

holwech commented 9 months ago

Is your feature request related to a problem? Please describe. I'm looking into testing SigLIP as an alternative text encoder. Is this supported or even possible to do? My knowledge is limited enough around this that I don't know if this makes sense to do.

There is some mention of this as an option in this notebook https://github.com/NielsRogge/Transformers-Tutorials/blob/master/SigLIP/Inference_with_(multilingual)_SigLIP%2C_a_better_CLIP_model.ipynb

Describe the solution you'd like. Load and use SigLIP with diffuser models as an alternative text encoder.

fabiorigano commented 9 months ago

hi @holwech,

You can load Siglip text model and tokenizer as follows:

from transformers import SiglipTextModel, SiglipTokenizer
from diffusers import StableDiffusionPipeline
import torch

text_encoder = SiglipTextModel.from_pretrained("google/siglip-base-patch16-512", torch_dtype=torch.float16)
tokenizer = SiglipTokenizer.from_pretrained("google/siglip-base-patch16-512", torch_dtype=torch.float16)

pipe = StableDiffusionPipeline.from_pretrained("SG161222/Realistic_Vision_V4.0_noVAE", 
        text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.float16)

pipe.to("cuda")

However, since text-to-image latent diffusion models are usually trained on CLIP text embeddings, I think that changing text encoder just for inference won't give good results. The ideal approach might be to first train a text-to-image model on Siglip text embeddings.

holwech commented 9 months ago

Thanks @fabiorigano! I guess this approach is not that viable then, but it was worth a try :)

Here are some example images for those who are curious:

Prompt: "a photo of an astronaut riding a horse on mars"

Seed: 3

CLIP

SigLIP

Seed: 4

CLIP

SigLIP

huggingface / diffusers

Using SigLIP as text encoder #6722