timothybrooks / instruct-pix2pix

Other
6.41k stars 542 forks source link

[Doubt] Use of Edited Image in dataset #81

Closed alphacoder01 closed 1 year ago

alphacoder01 commented 1 year ago

Hi, After reading the paper, I couldn't understand the need for edited image in the dataset. It is mentioned that

For our task, the score network eθ(zt, cI , cT ) has two conditionings: the input image cI and text instruction cT .

In the code however the edited images is passed as following the in dataset:

return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))

Also in train.yaml file I could find that this edited key is used as :

model: base_learning_rate: 1.0e-04 target: ldm.models.diffusion.ddpm_edit.LatentDiffusion params: ckpt_path: stable_diffusion/models/ldm/stable-diffusion-v1/v1-5-pruned-emaonly.ckpt linear_start: 0.00085 linear_end: 0.0120 num_timesteps_cond: 1 log_every_t: 200 timesteps: 1000 first_stage_key: edited cond_stage_key: edit image_size: 32 channels: 4 cond_stage_trainable: false # Note: different from the one we trained before conditioning_key: hybrid monitor: val/loss_simple_ema scale_factor: 0.18215 use_ema: true load_ema: false

Can you please explain how you guys are using the edited image for training and how do you perform inference when edited image is not available?

timothybrooks commented 1 year ago

Hi thanks for asking. We train a diffusion model and the edited image is the image being generated by the reverse diffusion process. During training, the edited image is encoded into the latent space of the auto encoder, noise is added to those latent features, and the model trained to produce a denoised version of it. first_stage_key is used by stable diffusion training code to indicate this is what we pass to the encoder and denoise.

edit=dict(c_concat=image_0, c_crossattn=prompt)) is our conditioning, the c_concat is encoded into the same latent space and concatenated with the noisy edited image, and c_crossattn is the text conditioning that is passed through the CLIP encoder and used via cross attention.

After training, when performing edits to a new image and no edited image is available, the edited image latents starts out as being entirely noise, and are progressively denoised and turned into the edited image.

alphacoder01 commented 1 year ago

@timothybrooks Thanks for the clear explanation!!