huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
24.09k stars 4.97k forks source link

Running `diffusers/stable-diffusion-xl-1.0-inpainting-0.1` with `StableDiffusionXLInpaintPipeline` introduces weird noise when strength is set to 1.0 during inference #8450

Open jackylu0124 opened 1 month ago

jackylu0124 commented 1 month ago

Describe the bug

I am currently comparing the inpainting generated results between the diffusers/stable-diffusion-xl-1.0-inpainting-0.1 model and the stabilityai/stable-diffusion-2-inpainting model, and I noticed that the strength parameter in the __call__() function in StableDiffusionInpaintPipeline defaults to 1.0 whereas the strength parameter in the __call__() function in StableDiffusionXLInpaintPipeline default to 0.9999.

What I want to achieve is that I want to use strength=1.0 in the StableDiffusionXLInpaintPipeline pipeline because otherwise the original content of the image has a much larger impact than the prompt does on the generated result. For example, if the original image has a blue car, and my prompt describes a pink car, using strength=0.9999 or even strength=0.99999999 would still show a blue car in the generated result. And the only way that I can effectively avoid this behavior is by setting strength=1.0 when using the StableDiffusionXLInpaintPipeline pipeline. However, using strength=1.0 in the StableDiffusionXLInpaintPipeline pipeline introduces a lot of noises in the generated image, and I have tried increasing the number of inference steps but it does not help with removing the noises.

For example, the following are the original image (with white pixels added in the margin to better illustrate the weird noises) and the original image's corresponding mask as well as the inpainted results of the two pipelines. And the result from the StableDiffusionXLInpaintPipeline pipeline has a lot of noises.

P.S. I also read something that sounds similar in https://github.com/huggingface/diffusers/issues/4392, but am not sure if the noises that I am seeing here is the same thing as what's discussed in that issue. Plus I would like to know how I can resolve the weird noises issue when using StableDiffusionXLInpaintPipeline with the diffusers/stable-diffusion-xl-1.0-inpainting-0.1 model with strength=1.0.

Original Image: dog Original Image's Corresponding Mask: dog_mask Inpainted Result (StableDiffusionInpaintPipeline with strength=1.0 and the prompt "Furry lion sitting on a bench, high quality, 4k") sd_2_gen_image Inpainted Result (StableDiffusionXLInpaintPipeline with strength=1.0 and the prompt "Furry lion sitting on a bench, high quality, 4k") sd_xl_gen_image

Reproduction

The following is the code I used to generate the inpainted result for both the StableDiffusionInpaintPipeline (with the stabilityai/stable-diffusion-2-inpainting model) and the StableDiffusionXLInpaintPipeline (with the diffusers/stable-diffusion-xl-1.0-inpainting-0.1 model). You can change the boolean value on the line USE_SDXL_INPAINT = True # <=== Change this to generate inpainted result of the respective pipeline/model. I have also pasted the original image and its corresponding mask image that I used in the "Describe the bug" section above.

Code:

import os
import numpy as np
import torch
from PIL import Image
from diffusers import StableDiffusionInpaintPipeline, StableDiffusionXLInpaintPipeline

os.chdir(os.path.dirname(os.path.abspath(__file__)))

USE_SDXL_INPAINT = True # <=== Change this

def main():
    image_pil = Image.open("./dog.png")
    mask_pil = Image.open("./dog_mask.png")

    if USE_SDXL_INPAINT:
        image_pil = image_pil.resize((1024, 1024))
        mask_pil = mask_pil.resize((1024, 1024))

    image_np = np.array(image_pil)
    mask_np = np.array(mask_pil)

    image_torch = torch.from_numpy(np.expand_dims(np.transpose(image_np / 255, (2, 0, 1)), 0).astype(np.float16)).cuda()
    print("image_torch.size():", image_torch.size())
    print("image_torch.dtype:", image_torch.dtype)
    mask_torch = torch.from_numpy(np.expand_dims(np.transpose(np.expand_dims(mask_np[:, :, 0] / 255, -1), (2, 0, 1)), 0).astype(np.float16)).cuda()
    print("mask_torch.size():", mask_torch.size())
    print("mask_torch.dtype:", mask_torch.dtype)

    if USE_SDXL_INPAINT:
        pipe = StableDiffusionXLInpaintPipeline.from_pretrained("../stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
    else:
        pipe = StableDiffusionInpaintPipeline.from_pretrained("../stable-diffusion-2-inpainting", torch_dtype=torch.float16).to("cuda")
    pipe.enable_xformers_memory_efficient_attention()

    results = pipe(
        prompt="Furry lion sitting on a bench, high quality, 4k",
        image=image_torch,
        mask_image=mask_torch,
        strength=1.0,
        num_inference_steps=50,
        generator=torch.Generator("cuda").manual_seed(123),
        output_type="np"
    )
    gen_image = results.images[0]
    gen_image_pil = Image.fromarray((gen_image * 255).round().astype(np.uint8).clip(0, 255))
    if USE_SDXL_INPAINT:
        gen_image_pil.save("./sd_xl_gen_image.png")
    else:
        gen_image_pil.save("./sd_2_gen_image.png")

if __name__ == "__main__":
    main()

Logs

No response

System Info

System: Windows GPU: RTX 3090

diffusers-cli env output

Who can help?

@yiyixuxu @sayakpaul

sayakpaul commented 1 month ago

Cc: @asomoza

asomoza commented 1 month ago

Hi, this is something that is referred in the model card:

When the strength parameter is set to 1 (i.e. starting in-painting from a fully masked image), the quality of the image is degraded. The model retains the non-masked contents of the image, but images look less sharp. We're investing this and working on the next version.

There's some more experiments in this issue, at the end the only method that works is to inpaint and the paste back over the original image the generated part, also you'll need to match the histogram.

IMO it's not a good idea to use a strength of 1.0 which is literally as you're saying, ignore the original image. What you can do, is to use a generative fill in the area where you want to inpaint, also you can look at using differential diffusion or use an inpainting controlnet.

jackylu0124 commented 1 month ago

Hi @asomoza, thank you very much for your reply and insights!

I have the following follow-up questions:

  1. The problem with using strength less than 1.0 is that even if I set strength=0.99999999, the context of the original image still has too much influence on the generated result. For example, if the original image contains a blue car in it, and in my prompt I describe a pink car, and I run the inference with the mask over the car, the generated image would still have a blue car instead of a pink car. The only way for me to achieve this is by setting strength=1.0, and the generated result looks as expected when I set strength=1.0 in StableDiffusionInpaintPipeline, in fact I think strength=1.0 is the default value in the StableDiffusionInpaintPipeline's __call__() function. Do you by chance know the reason why using strength=1.0 in StableDiffusionInpaintPipeline works fine and can change the color of the objects in the image based on the prompt and does not introduce any weird noises in the generated image while using strength=1.0 in the StableDiffusionXLInpaintPipeline can also change the color of of the objects in the image based on the prompt but introduces a lot of weird noises? In other words, I would like to know how I can use strength=1.0 in the StableDiffusionXLInpaintPipeline just like I did with the StableDiffusionInpaintPipeline, but without all the noises in the generated result.
  2. Could you plese elaborate on what you mean by "use a generative fill in the area"?

Thank you very much for your time again!

asomoza commented 1 month ago

1.- The reason is the difference in the model architecture and training, as far as I know, the only trained inpainting model for SDXL is the one from the diffusers team, no one else trained one and it has this one problem when using a strength of 1.0, so for the time being, this is a common problem without solution until someone else trains another one. Fooocus has one but it's a black box, I don't know if it's a trained one or a merge, the author didn't provide any information with it and only did it for fooocus with a lot of hard coded stuff so it's complicated to port it to other solutions.

2.- Generative fill refers that you fill the area with something that resembles what you want, there's a couple of methods for this for example: lama, patchmatch or the opencv ones. This is the best method for inpainting when you want to remove or change an object, also it works if you paint and guide the generation by yourself. Automatic1111 has some options too for this, for example to fill it with noise or paint it gray.

I'm still in debt about doing some inpainting guides, but maybe you can learn something from the outpainting ones I did, I show and apply some of this techniques.

If you provide me with some images and what you want to do (where you need a strength of 1.0), I can give you a quick guide on how to achieve it with some other techniques instead.

jackylu0124 commented 1 month ago

Hi @asomoza, thank you for your fast reply and insights!

I am mostly looking for programmatic solution as opposed to UI tools. Thanks a lot for sharing the link to your outpainting guides! I will take a look at those first.

Regarding the SDXL inpainting model and its training, do you know if the training script used for the stabilityai/stable-diffusion-2-inpainting model is open sourced? Also is the training script for the diffusers/stable-diffusion-xl-1.0-inpainting-0.1 model that's trained by the diffusers team based on the training script that's used for the stabilityai/stable-diffusion-2-inpainting model? And is the training script for the diffusers/stable-diffusion-xl-1.0-inpainting-0.1 model open sourced?

Thanks a lot again!

asomoza commented 1 month ago

I mentioned automatic1111 just for you to know that sometimes filling it with noise or a gray color could also work since people use it in that UI.

The stable-diffusion-2-inpainting was provided from Stabilty AI from the beginning, I didn't use SD2 that much and I don't really know if they released the training code, probably better to look or ask in their repo questions about it.

And about the training code for stable-diffusion-xl-1.0-inpainting-0.1, no, there's is no open sourced code for training a SDXL inpainting model, not in diffusers and as far as I know, anywhere else.

jackylu0124 commented 1 month ago

I see, thank you very much for your detailed reply! So to confirm, the stable-diffusion-xl-1.0-inpainting-0.1 model is trained by the diffusers team, but it's not open sourced right?

asomoza commented 1 month ago

The model weights have the same license as the original, this one in particular has a Open RAIL++-M License and if you're asking if you can use it commercially, yes.

jackylu0124 commented 1 month ago

If you provide me with some images and what you want to do (where you need a strength of 1.0), I can give you a quick guide on how to achieve it with some other techniques instead.

So here's an example to better illustrate the issue I mentioned above, and also the goal I want to achieve.

The following is the original image: black_jacket

The following is the original image's mask: black_jacket_mask

I used the same script and seed in the code in the previous message with the prompt "White dress shirt, high quality, 4k" and the following settings:

Result generated with StableDiffusionXLInpaintPipeline with the diffusers/stable-diffusion-xl-1.0-inpainting-0.1 model with strength=0.99999999: sd_xl_gen_image

Result generated with StableDiffusionXLInpaintPipeline with the diffusers/stable-diffusion-xl-1.0-inpainting-0.1 model with strength=1.0: sd_xl_gen_image_strength_1_0

Result generated with StableDiffusionInpaintPipeline with the stabilityai/stable-diffusion-2-inpainting model with strength=1.0: sd_2_gen_image

As you can see, unless I set strength=1.0 in the StableDiffusionXLInpaintPipeline with the diffusers/stable-diffusion-xl-1.0-inpainting-0.1 model, the original image's context (the black color of the black jacket) really has a much heavier influence than the prompt (the white color in the prompt "White dress shirt, high quality, 4k"). What I would like to achieve is for the inpainting to be more directed by the prompt instead of the original image's context, and so far I could only achieve that by setting strength=1.0. But as you can also see, setting strength=1.0 in the StableDiffusionXLInpaintPipeline with the diffusers/stable-diffusion-xl-1.0-inpainting-0.1 model also introduces a lot of noises in the generated result.

I would really appreciate any insights you have on how I could achieve this goal. Thanks a lot again!

asomoza commented 1 month ago

ok, first lets start that you're using a 512px image with SDXL which is really bad, the main reason you're getting those bad weird borders around the image is because of that, for example with a 1024 image:

20240611145437 20240611152327_1485424270

But then we still see the discoloration and the noise over the white background, it's not that evident if you don't use a white background though.

Since you're using a extreme case, where you want to inpaint something white over something black, you'll need remove the black first, I suggest using lama for the bests results but since you're using a strength of 1.0 in your example, that means you don't care about the previous content of the image so you can literally just erase what was before, maybe paint it with gray or white.

If I have time later I'll give it a try to show you an example.

jackylu0124 commented 1 month ago

Hi @asomoza , thank you very much for your detailed reply and experiment!

The reason I am using 512x512 input image and mask is for the purpose of comparing the generated result from the StableDiffusionInpaintPipeline (with the stabilityai/stable-diffusion-2-inpainting model) and the StableDiffusionXLInpaintPipeline (with the diffusers/stable-diffusion-xl-1.0-inpainting-0.1 model) because the StableDiffusionInpaintPipeline (with the stabilityai/stable-diffusion-2-inpainting model) takes in 512x512 input. Also note that in the script I pasted in my previous message, I resize the image to 1024x1024 before passing into the StableDiffusionXLInpaintPipeline (with the diffusers/stable-diffusion-xl-1.0-inpainting-0.1 model).

But regardless, my main concern is not the "bad weird borders" around the image, but rather the "discoloration and the noise" you mentioned. For example, the color of the face of the person in the generated image in your experiment looks a lot more saturated than the one in the original image and there are also noises covering the generated image. And to confirm, in order to resolve the issue (the issue where the original image's context has much greater influence than the prompt does), I can set strength to be less than 1.0 and try replacing the area to be inpainted with gray or white color right?

Would replacing the area to be inpainted with pure black or randomized pixels color solve the issue (the issue where the original image's context has much greater influence than the prompt does) as well?

Thanks for your help again!

jackylu0124 commented 1 month ago

I see, thank you very much for your detailed reply! So to confirm, the stable-diffusion-xl-1.0-inpainting-0.1 model is trained by the diffusers team, but it's not open sourced right?

@asomoza Apologies for the confusion earlier, what I meant to ask is that whether the training code/script used by the diffusers team to train the stable-diffusion-xl-1.0-inpainting-0.1 model is open sourced, and if so where can I find it?

Thanks a lot again!

asomoza commented 1 month ago

the color of the face of the person in the generated image in your experiment looks a lot more saturated than the one in the original image and there are also noises covering the generated image

the difference in the saturation and the noise in the background can be fixed with pasting just the inpainted area in the original image and then match the histogram, I did that in the post I linked before. That's one solution to this problem.

in order to resolve the issue (the issue where the original image's context has much greater influence than the prompt does), I can set strength to be less than 1.0 and try replacing the area to be inpainted with gray or white color right?

yes also depending on the use case, you can also use a generative fill for this.

Would replacing the area to be inpainted with pure black or randomized pixels color solve the issue (the issue where the original image's context has much greater influence than the prompt does) as well?

In this case no, pure black would have the same problem as the original image, it's hard for the inpainting model to try to change something black with something white unless you set the strength to 1.0 which is telling the model to completely ignore what was in that area before. Random pixels could work but not that well, and since it's random if you get too much dark pixels will have the same problem.

@asomoza Apologies for the confusion earlier, what I meant to ask is that whether the training code/script used by the diffusers team to train the stable-diffusion-xl-1.0-inpainting-0.1 model is open sourced, and if so where can I find it?

This code is not public and hasn't been shared, this library only shares training code as basic examples and encourage the users to use this code, learn from it and adapt it for the specific tasks. I think there isn't any training code available for any inpainting model the same as for example multiple aspect ratio or more advanced training codes.

jackylu0124 commented 3 weeks ago

Hi @asomoza , thank you for your detailed reply and explanation! Also sorry about my late reply, would you mind sharing the link to the post you mentioned in

the difference in the saturation and the noise in the background can be fixed with pasting just the inpainted area in the original image and then match the histogram, I did that in the post I linked before. That's one solution to this problem.

where you demonstrate "pasting" and "histogram matching"?

Thanks a lot for the help again!