google / prompt-to-prompt

Apache License 2.0
2.98k stars 279 forks source link

Inconsistency in the generation of image #32

Open yasserben opened 1 year ago

yasserben commented 1 year ago

Hi ! :smiley:

I have a problem regarding the prompt-to-prompt notebook. The image of the squirrel changes a little bit between the Cross-Attention Visualization :

g_cpu = torch.Generator().manual_seed(8888)
prompts = ["A painting of a squirrel eating a burger"]
controller = AttentionStore()
image, x_t = run_and_display(prompts, controller, latent=None, run_baseline=False, generator=g_cpu)
show_cross_attention(controller, res=16, from_where=("up", "down"))

and Replacement edit cells :

prompts = ["A painting of a squirrel eating a burger",
           "A painting of a lion eating a burger"]

controller = AttentionReplace(prompts, NUM_DIFFUSION_STEPS, cross_replace_steps=.8, self_replace_steps=0.4)
_ = run_and_display(prompts, controller, latent=x_t, run_baseline=True)

sections:

The one on the left was generated from the cell of the Cross-Attention Visualization, and the right one from **the Replacement edit** cell. If you look closely at the two black circles on the left, you’ll see a difference between the two squirrels, and this is not supposed to happen I guess. I think you can reproduce the same errors, if you run the following code :

controller = EmptyControl()

g_cpu = torch.Generator().manual_seed(8888)
prompts = ["A painting of a squirrel eating a burger"]
image_1, x_t = run_and_display(prompts, controller, latent=None, run_baseline=False, generator=g_cpu)

g_cpu = torch.Generator().manual_seed(8888)
prompts = ["A painting of a squirrel eating a burger","A painting of a squirrel eating a burger"]
image_2, x_t = run_and_display(prompts, controller, latent=None, run_baseline=False, generator=g_cpu)

The single squirrel in image_1 would be different from the two squirrels generated in image_2

After going a little bit through the code, I suspect it comes from the size of the prompt. Because it contains two sentences, so we have batch_size = 2 in the Replacement cell. I think that’s why it doesn’t generate the exact same picture as if we used only one sentence, so maybe the total size the batch influences the generation of an image from a text even if the prompt remains the same.

The same problem arises when we work with the notebook of Null-text inversion :

difference_cat

Thank you for your help !! :blush:

Sang-kyung commented 10 months ago

Does anyone solve this issue? I think generating one image in each time (2 times iteration on unet) would be the solution. It would be really appreciated if someone can share the modifications of the code.

RahulSajnani commented 7 months ago

You can use the ddim latents that are stored while doing the null text optimization and initialize the first image with those latents while editing.