Closed pure-rgb closed 1 year ago
For all pix2pix, controlnet, fine-tuning one needs a prompt as input x and an image as output y
@patrickvonplaten Understood, thanks.
For pix2pix, from the doc, the demo dataset is used this. In this dataset, I can see that, there are 3 item (input image, prompt, edited image). From this point, I'm bit confused, how could one make training or fine-tuning with x: prompt and y: edited_image? Sorry, if I missunderstood that it already does; could you please redirect to src code? Same goes to https://huggingface.co/docs/diffusers/main/en/training/controlnet#training?
@patrickvonplaten pls, let me know if I'm not clear with my quest.
Ok yes to clarify:
Text-to-image: Input (X): Prompt Label (Y): Image
Pix-to-Pix: Input (X): Prompt, Original image Label (Y): Edited image
ControlNet: Input (X): Prompt, Conditioned image (e.g. human pos) Label (Y): Image
You can also find much more info about this in the official docs: https://huggingface.co/docs/diffusers/training/overview
@patrickvonplaten Thanks 💯
I went through the documentation abaouot training diffusion models. There are now many type of diffusion models and different modalities. In a general cases, a training data would be consisted with image and prompt or prompt and image. Now, for model like intruction-pixel 2 pixel and controlnet, what would be the training pairs? And would it be same for text2image and image2image translation? It is not clear in the document.
In the official instruct-pix2pix, the mentioned about custom data generation. For example, input, prompt, edited_input (output). When using diffusers library, would it be possible to start training with input and prompt, without edited_input (output)?
And about control-net, what would be the training pairs if I want to train the model as text2image and image2image manner. How should the dataset be organized? I couldn't find any clear instruction in doc.