tianweiy / DMD2

(NeurIPS 2024 Oral 🔥) Improved Distribution Matching Distillation for Fast Image Synthesis
Other
524 stars 28 forks source link

prompt input to unet #28

Closed LifuWang-66 closed 4 months ago

LifuWang-66 commented 4 months ago

I found that the input to unet is output from CLIP tokenizer, but in both sd and sdxl pipelines the inputs are output from CLIP text encoder. Is there a specific reason for the choice?

tianweiy commented 4 months ago

we change the tokenizer output to text encoder output at https://github.com/tianweiy/DMD2/blob/0f8a481716539af7b2795740c9763a7d0d05b83b/main/sd_unified_model.py#L176

LifuWang-66 commented 4 months ago

Thanks for your swift response!

I think it is only specific to sdxl, but other models are still using tokenizer output. Maybe it will be better to use text encoder output for other models as well?

tianweiy commented 4 months ago

for sd, it is also using text embedding https://github.com/tianweiy/DMD2/blob/0f8a481716539af7b2795740c9763a7d0d05b83b/main/sd_unified_model.py#L222

I mean there is no way for the unet to take any other conditioning, right ?