Closed alphacoder01 closed 1 year ago
Hi thanks for asking. We train a diffusion model and the edited image is the image being generated by the reverse diffusion process. During training, the edited image is encoded into the latent space of the auto encoder, noise is added to those latent features, and the model trained to produce a denoised version of it. first_stage_key is used by stable diffusion training code to indicate this is what we pass to the encoder and denoise.
edit=dict(c_concat=image_0, c_crossattn=prompt))
is our conditioning, the c_concat
is encoded into the same latent space and concatenated with the noisy edited image, and c_crossattn
is the text conditioning that is passed through the CLIP encoder and used via cross attention.
After training, when performing edits to a new image and no edited image is available, the edited image latents starts out as being entirely noise, and are progressively denoised and turned into the edited image.
@timothybrooks Thanks for the clear explanation!!
Hi, After reading the paper, I couldn't understand the need for edited image in the dataset. It is mentioned that
In the code however the edited images is passed as following the in dataset:
Also in train.yaml file I could find that this
edited
key is used as :Can you please explain how you guys are using the edited image for training and how do you perform inference when edited image is not available?