rinongal / textual_inversion

MIT License
2.87k stars 278 forks source link

some questions about the localized image editing experiment using Blended Latent Diffusion #121

Closed gracezhao1997 closed 1 year ago

gracezhao1997 commented 1 year ago

Dear Rinon,  I was wondering did you use the CLIP-based loss D_{CLIP} to guide the generation process of stable diffusion in the localized image editing experiment? If so, since the input of CLIP is in pixel space and the stable diffusion is in latent space, did you first fed the predicted z0 into the the decoder in stable diffusion to reconstrcut image in pixel space and then fed it to the CLIP model to computeD{CLIP}(see the figure beblow)? I would be very appreciative if you could reply.

image
rinongal commented 1 year ago

Hi,

We're using Blended Latent Diffusion. Blended latent diffusion and our paper are based on LDM and not on Stable Diffusion which did not exist at the time. There is no CLIP model involved anywhere in the process, since LDM uses BERT for a text encoder, and the model itself is anyhow already text-conditioned and this can be used to guide the inpainting process.

rinongal commented 1 year ago

Closing this since I assume from the e-mails you got your answer. Feel free to re-open if you need additional help.