text input for Image Mixer

Dear Justin,

Thank you for your work in stable diffusion; it benefits me a lot.

Could you elaborate on how you train the 'Image Mixer' model? Why can it accept text input if the model is fine-tuned with only a CLIP image encoder? Or is the model fine-tuned with both the CLIP image and text encoder?