Output source image is too different from the input image

Daisy5296 commented 1 year ago

Hi, thanks for your work!

When I tried some real images, the null-text inversion output was fine, but the ptp editing output was totally different from the input, including both the source image and the edited image. Could you please help to explain why this happens? And any advice on how to solve it?

Daisy5296 commented 1 year ago

In my understanding, the ptp generated source image should be the same as the output image of the null text inversion, is that correct? Anyone has any thoughts on this would be very appreciated!

MunchkinChen commented 1 year ago

Hi, thanks for your work!

When I tried some real images, the null-text inversion output was fine, but the ptp editing output was totally different from the input, including both the source image and the edited image. Could you please help to explain why this happens? And any advice on how to solve it?

I had exactly the same problem, null-text inversion works perfectly well. But the first output of ptp editing isn't the same as the null-text inverted one (it's supposed to be the same). I managed to narrow down the problem: when using x_t, uncond_embeddings, and prompts = [prompt_used_for_inversion] gives the correct inverted result. However, once prompts has a batch size > 1, for example if we use [prompt_used_for_inversion, new_prompt], it gives an entirely different result. And this the case even with an empty controller...

jingweim commented 1 year ago

I dug into this as well. Seems like whenever the batch size is different from what was used during inversion (bs=1 during inversion, bs=2 during ptp), unet outputs different results. The first divergence I found is after a pytorch linear layer. This explanation here might be why this is happening: https://github.com/pytorch/pytorch/issues/9146#issuecomment-409344822

It's saying something like when the input shape is different, BLAS could have entirely different operation ordering and that could lead to small differences. These small differences might not matter much for neural network training and such, but in our case we need the first half of the ptp results to be exactly the same as the inversion results, so any small difference after going through many layers could end up a big difference.

I think our best bet would be to execute the unet two times. First time the exact input we had for inversion, [base prompt uncond, base prompt cond]. And the second time run [new prompt uncond, new prompt cond].

g-jing commented 1 year ago

@jingweim Thanks for your suggestion! I found that directly executing the unet twice during the editing process will lead to a bug, as the p2p script is supposed to get a batch size of an even number. Did you modify the script and get it running successfully?

g-jing commented 1 year ago

Do you mean 1 works but 2 does not work?

failbetter77 commented 1 year ago

Do you mean 1 works but 2 does not work?

umm.. i think i make a mistake, now it runs well!!

g-jing commented 1 year ago

That looks great. Could you share what part did you modify? Also what version of pytorch, transformers, and diffusers are you using? Thanks.

failbetter77 commented 1 year ago

That looks great. Could you share what part did you modify? Also what version of pytorch, transformers, and diffusers are you using? Thanks.

i just modify two things.

1) requirement.txt diffusers==0.14.0

2) in ptp_utils.py, follow below issue.

replace the original def forward(x, context=None, mask=None) function in def register_attention_control(model, controller) of ptp_utils.py with the following codes:

https://github.com/google/prompt-to-prompt/issues/44#issuecomment-1593284782

+pytorch : 1.13.1

google / prompt-to-prompt

Output source image is too different from the input image #47