huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.15k stars 5.2k forks source link

multiple training pairs of diffusion models? #2867

Closed pure-rgb closed 1 year ago

pure-rgb commented 1 year ago

I went through the documentation abaouot training diffusion models. There are now many type of diffusion models and different modalities. In a general cases, a training data would be consisted with image and prompt or prompt and image. Now, for model like intruction-pixel 2 pixel and controlnet, what would be the training pairs? And would it be same for text2image and image2image translation? It is not clear in the document.

In the official instruct-pix2pix, the mentioned about custom data generation. For example, input, prompt, edited_input (output). When using diffusers library, would it be possible to start training with input and prompt, without edited_input (output)?

And about control-net, what would be the training pairs if I want to train the model as text2image and image2image manner. How should the dataset be organized? I couldn't find any clear instruction in doc.

patrickvonplaten commented 1 year ago

For all pix2pix, controlnet, fine-tuning one needs a prompt as input x and an image as output y

pure-rgb commented 1 year ago

@patrickvonplaten Understood, thanks.

For pix2pix, from the doc, the demo dataset is used this. In this dataset, I can see that, there are 3 item (input image, prompt, edited image). From this point, I'm bit confused, how could one make training or fine-tuning with x: prompt and y: edited_image? Sorry, if I missunderstood that it already does; could you please redirect to src code? Same goes to https://huggingface.co/docs/diffusers/main/en/training/controlnet#training?

pure-rgb commented 1 year ago

@patrickvonplaten pls, let me know if I'm not clear with my quest.

patrickvonplaten commented 1 year ago

Ok yes to clarify:

Text-to-image: Input (X): Prompt Label (Y): Image

Pix-to-Pix: Input (X): Prompt, Original image Label (Y): Edited image

ControlNet: Input (X): Prompt, Conditioned image (e.g. human pos) Label (Y): Image

patrickvonplaten commented 1 year ago

You can also find much more info about this in the official docs: https://huggingface.co/docs/diffusers/training/overview

pure-rgb commented 1 year ago

@patrickvonplaten Thanks 💯