Thank you for your work in stable diffusion; it benefits me a lot.
Could you elaborate on how you train the 'Image Mixer' model? Why can it accept text input if the model is fine-tuned with only a CLIP image encoder? Or is the model fine-tuned with both the CLIP image and text encoder?
Dear Justin,
Thank you for your work in stable diffusion; it benefits me a lot.
Could you elaborate on how you train the 'Image Mixer' model? Why can it accept text input if the model is fine-tuned with only a CLIP image encoder? Or is the model fine-tuned with both the CLIP image and text encoder?
I appreciate your help.
Best wishes,
Zongze