huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.47k stars 5.45k forks source link

Latent diffusion model #163

Closed ethancohen123 closed 2 years ago

ethancohen123 commented 2 years ago

Hi, Is there a way to make the latent diffusion model able to get the context from any modality (that would be user defined) with a training script associated to this ? (modality as features, text or basically encoder that outpus some latent vector of an other modality ) Let me know if its not clear :) Thank you !

patil-suraj commented 2 years ago

I haven't played with this myself yet, but I think it should be possible by taking a embeddings form new modality, mapping it to the existing embeddings space with a head module and then fine-tune the model using that. There is some relevent work here https://github.com/PITI-Synthesis/PITI which fine-tunes GLIDE to do image-2-image tasks. The same could be applied to latent-diffusion.

Also cc @anton-l @patrickvonplaten

patrickvonplaten commented 2 years ago

Sounds like a cool idea @ethancohen123! @anton-l if we have a text2image fine-tuning / training script it would be quite trivial to experiment with such ideas in diffusers

ethancohen123 commented 2 years ago

Yep if there is a training script for text2image that would be great !

patil-suraj commented 2 years ago

Coming soon #356

ethancohen123 commented 2 years ago

Hey, Has it been released yet ? By looking at #356 it seems like the code is for finetunning. Is it planned to have a training pipeline from scratch (or maybe it can be easily modified from the code) on my data for example ? Im working with custom data that are not natural images and texts and I also want to explore best latent conditionning (pretrained conditionned encoder, clip pretrained between modalities or from scratch). Could you please indicates me where (if there is one) the best code base is to start with ? Thanks !

patil-suraj commented 2 years ago

The script should be ready to merge by tomorrow! The fine-tuning and training is exactly similar. If you want to train you'll just have to load the random models instead of pre-trained ones.

ethancohen123 commented 2 years ago

Perfect, thank you very much for your reactivity :) When you say load random models is it like with the transformers library where you define the config for the model and then model= model_from_diffusers(config) ?

ethancohen123 commented 2 years ago

Also yo you have any idea on the different performance of stable diffusion based on way the conditiioning is incorporated (clip based vs scratch based vs pretrained text based for example ) ? I was not able to find such experiments in the research paper

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.