[feat] Mixture of text encoders (CLIP+T5) for easier training and better prompt alignment

kabachuha commented 7 months ago

Mixing the embeddings of CLIP(s) and T5 may help both with simplifying the training by using more image-related CLIP embeddings and improving prompt alignment by using T5-XXL large language model. This way it can leverage one of the crucial parts of SD3 (see their text encoder ablation studies) while making no changes to the model's structure, except for text embedding dims.

Possible implementation (uses tensor concatenation and padding):

https://github.com/NUS-HPC-AI-Lab/OpenDiT/blob/adef0a537bce08130526027eceda82661dafd372/opendit/embed/clip_and_t5_text_emb.py

github-actions[bot] commented 6 months ago

This issue is stale because it has been open for 7 days with no activity.

zhengzangw commented 5 months ago

Thanks for your suggestion. Currently we plan to stick to T5 embedding only.

hpcaitech / Open-Sora

[feat] Mixture of text encoders (CLIP+T5) for easier training and better prompt alignment #119