Mixing the embeddings of CLIP(s) and T5 may help both with simplifying the training by using more image-related CLIP embeddings and improving prompt alignment by using T5-XXL large language model. This way it can leverage one of the crucial parts of SD3 (see their text encoder ablation studies) while making no changes to the model's structure, except for text embedding dims.
Possible implementation (uses tensor concatenation and padding):
Mixing the embeddings of CLIP(s) and T5 may help both with simplifying the training by using more image-related CLIP embeddings and improving prompt alignment by using T5-XXL large language model. This way it can leverage one of the crucial parts of SD3 (see their text encoder ablation studies) while making no changes to the model's structure, except for text embedding dims.
Possible implementation (uses tensor concatenation and padding):
https://github.com/NUS-HPC-AI-Lab/OpenDiT/blob/adef0a537bce08130526027eceda82661dafd372/opendit/embed/clip_and_t5_text_emb.py