Closed gracezhao1997 closed 1 year ago
Hi,
We're using Blended Latent Diffusion. Blended latent diffusion and our paper are based on LDM and not on Stable Diffusion which did not exist at the time. There is no CLIP model involved anywhere in the process, since LDM uses BERT for a text encoder, and the model itself is anyhow already text-conditioned and this can be used to guide the inpainting process.
Closing this since I assume from the e-mails you got your answer. Feel free to re-open if you need additional help.
Dear Rinon, I was wondering did you use the CLIP-based loss D_{CLIP} to guide the generation process of stable diffusion in the localized image editing experiment? If so, since the input of CLIP is in pixel space and the stable diffusion is in latent space, did you first fed the predicted z0 into the the decoder in stable diffusion to reconstrcut image in pixel space and then fed it to the CLIP model to computeD{CLIP}(see the figure beblow)? I would be very appreciative if you could reply.