Closed ethancohen123 closed 2 years ago
I haven't played with this myself yet, but I think it should be possible by taking a embeddings form new modality, mapping it to the existing embeddings space with a head
module and then fine-tune the model using that. There is some relevent work here https://github.com/PITI-Synthesis/PITI which fine-tunes GLIDE to do image-2-image tasks. The same could be applied to latent-diffusion.
Also cc @anton-l @patrickvonplaten
Sounds like a cool idea @ethancohen123! @anton-l if we have a text2image fine-tuning / training script it would be quite trivial to experiment with such ideas in diffusers
Yep if there is a training script for text2image that would be great !
Coming soon #356
Hey, Has it been released yet ? By looking at #356 it seems like the code is for finetunning. Is it planned to have a training pipeline from scratch (or maybe it can be easily modified from the code) on my data for example ? Im working with custom data that are not natural images and texts and I also want to explore best latent conditionning (pretrained conditionned encoder, clip pretrained between modalities or from scratch). Could you please indicates me where (if there is one) the best code base is to start with ? Thanks !
The script should be ready to merge by tomorrow! The fine-tuning and training is exactly similar. If you want to train you'll just have to load the random models instead of pre-trained ones.
Perfect, thank you very much for your reactivity :) When you say load random models is it like with the transformers library where you define the config for the model and then model= model_from_diffusers(config) ?
Also yo you have any idea on the different performance of stable diffusion based on way the conditiioning is incorporated (clip based vs scratch based vs pretrained text based for example ) ? I was not able to find such experiments in the research paper
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi, Is there a way to make the latent diffusion model able to get the context from any modality (that would be user defined) with a training script associated to this ? (modality as features, text or basically encoder that outpus some latent vector of an other modality ) Let me know if its not clear :) Thank you !