Closed holwech closed 9 months ago
hi @holwech,
You can load Siglip text model and tokenizer as follows:
from transformers import SiglipTextModel, SiglipTokenizer
from diffusers import StableDiffusionPipeline
import torch
text_encoder = SiglipTextModel.from_pretrained("google/siglip-base-patch16-512", torch_dtype=torch.float16)
tokenizer = SiglipTokenizer.from_pretrained("google/siglip-base-patch16-512", torch_dtype=torch.float16)
pipe = StableDiffusionPipeline.from_pretrained("SG161222/Realistic_Vision_V4.0_noVAE",
text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.float16)
pipe.to("cuda")
However, since text-to-image latent diffusion models are usually trained on CLIP text embeddings, I think that changing text encoder just for inference won't give good results. The ideal approach might be to first train a text-to-image model on Siglip text embeddings.
Thanks @fabiorigano! I guess this approach is not that viable then, but it was worth a try :)
Here are some example images for those who are curious:
Prompt: "a photo of an astronaut riding a horse on mars"
Seed: 3
CLIP
SigLIP
Seed: 4
CLIP
SigLIP
Is your feature request related to a problem? Please describe. I'm looking into testing SigLIP as an alternative text encoder. Is this supported or even possible to do? My knowledge is limited enough around this that I don't know if this makes sense to do.
There is some mention of this as an option in this notebook https://github.com/NielsRogge/Transformers-Tutorials/blob/master/SigLIP/Inference_with_(multilingual)_SigLIP%2C_a_better_CLIP_model.ipynb
Describe the solution you'd like. Load and use SigLIP with diffuser models as an alternative text encoder.