Closed Daisy5296 closed 1 year ago
In my understanding, the ptp generated source image should be the same as the output image of the null text inversion, is that correct? Anyone has any thoughts on this would be very appreciated!
Hi, thanks for your work!
When I tried some real images, the null-text inversion output was fine, but the ptp editing output was totally different from the input, including both the source image and the edited image. Could you please help to explain why this happens? And any advice on how to solve it?
I had exactly the same problem, null-text inversion works perfectly well. But the first output of ptp editing isn't the same as the null-text inverted one (it's supposed to be the same). I managed to narrow down the problem: when using x_t, uncond_embeddings, and prompts = [prompt_used_for_inversion] gives the correct inverted result. However, once prompts has a batch size > 1, for example if we use [prompt_used_for_inversion, new_prompt], it gives an entirely different result. And this the case even with an empty controller...
I dug into this as well. Seems like whenever the batch size is different from what was used during inversion (bs=1 during inversion, bs=2 during ptp), unet outputs different results. The first divergence I found is after a pytorch linear layer. This explanation here might be why this is happening: https://github.com/pytorch/pytorch/issues/9146#issuecomment-409344822
It's saying something like when the input shape is different, BLAS could have entirely different operation ordering and that could lead to small differences. These small differences might not matter much for neural network training and such, but in our case we need the first half of the ptp results to be exactly the same as the inversion results, so any small difference after going through many layers could end up a big difference.
I think our best bet would be to execute the unet two times. First time the exact input we had for inversion, [base prompt uncond, base prompt cond]. And the second time run [new prompt uncond, new prompt cond].
@jingweim Thanks for your suggestion! I found that directly executing the unet twice during the editing process will lead to a bug, as the p2p script is supposed to get a batch size of an even number. Did you modify the script and get it running successfully?
Do you mean 1 works but 2 does not work?
Do you mean 1 works but 2 does not work?
umm.. i think i make a mistake, now it runs well!!
That looks great. Could you share what part did you modify? Also what version of pytorch, transformers, and diffusers are you using? Thanks.
That looks great. Could you share what part did you modify? Also what version of pytorch, transformers, and diffusers are you using? Thanks.
i just modify two things.
1) requirement.txt diffusers==0.14.0
2) in ptp_utils.py, follow below issue.
replace the original def forward(x, context=None, mask=None) function in def register_attention_control(model, controller) of ptp_utils.py with the following codes:
https://github.com/google/prompt-to-prompt/issues/44#issuecomment-1593284782
+pytorch : 1.13.1
Hi, thanks for your work!
When I tried some real images, the null-text inversion output was fine, but the ptp editing output was totally different from the input, including both the source image and the edited image. Could you please help to explain why this happens? And any advice on how to solve it?