sail-sg / MDT

Masked Diffusion Transformer is the SOTA for image synthesis. (ICCV 2023)
Apache License 2.0
516 stars 38 forks source link

Support for Training with Custom Prompts Instead of Image IDs #52

Open rutuja1409 opened 1 week ago

rutuja1409 commented 1 week ago

Hi @gasvn,

I would like to train a model using my custom dataset. However, I noticed that the current training process only supports using image IDs. Is there a way to provide a custom prompt for each image instead of using just the image ID?

If this feature is not currently available, is there a plan to include it in any upcoming releases?

Thank you!

gasvn commented 1 week ago

I suggest you that you can follow stable diffusion to use the embeddings of clip output and use cross-attention to change the condition.

rutuja1409 commented 1 week ago

I suggest you that you can follow stable diffusion to use the embeddings of clip output and use cross-attention to change the condition.

Thank you for your reply. I understand the first part about using CLIP embeddings, but could you please clarify how you suggest changing the condition in the masked diffusion transformer code? Specifically, what modifications should I make to integrate the CLIP embeddings with the cross-attention mechanism in the training process?