Open rutuja1409 opened 1 month ago
I suggest you that you can follow stable diffusion to use the embeddings of clip output and use cross-attention to change the condition.
I suggest you that you can follow stable diffusion to use the embeddings of clip output and use cross-attention to change the condition.
Thank you for your reply. I understand the first part about using CLIP embeddings, but could you please clarify how you suggest changing the condition in the masked diffusion transformer code? Specifically, what modifications should I make to integrate the CLIP embeddings with the cross-attention mechanism in the training process?
Hi @gasvn,
I would like to train a model using my custom dataset. However, I noticed that the current training process only supports using image IDs. Is there a way to provide a custom prompt for each image instead of using just the image ID?
If this feature is not currently available, is there a plan to include it in any upcoming releases?
Thank you!