huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.1k stars 5.38k forks source link

Much worse performance from StableDiffusionControlNetInpaintPipeline than sd-webui-controlnet #6101

Closed brandonwsaw closed 8 months ago

brandonwsaw commented 11 months ago

Hey folks, I'm getting much worse behavior with Diffusers than A1111 when using ControlNet Inpainting. I'm using the exact same model, seed, inputs, etc. but it's clear the inpainting behavior is very different. Below is one example but I have more if it's helpful. Lots of artifacts from Diffusers, A1111 essentially just recolors. Thanks for all your help, let me know how else I can be helpful.

Original Image image

Diffusers Inpainting image

A1111 Inpainting image

Diffusers Script:

from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, DDIMScheduler, AutoencoderKL, EulerAncestralDiscreteScheduler
from diffusers.utils import load_image
import numpy as np
import torch
from PIL import Image

init_image = load_image("image.png")
init_image = init_image.resize((1024, 1024))

generator = torch.Generator(device="cpu").manual_seed(478847657)

mask_image = load_image("mask.png")
mask_image = mask_image.resize((1024, 1024))

def make_inpaint_condition(image, image_mask):
    image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
    image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0

    assert image.shape[0:1] == image_mask.shape[0:1], "image and image_mask must have the same image size"
    image[image_mask > 0.5] = -1.0  # set as masked pixel
    image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
    image = torch.from_numpy(image)
    return image

control_image = make_inpaint_condition(init_image, mask_image)

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
    "stablediffusionapi/anything-v5", controlnet=controlnet, torch_dtype=torch.float16
)

pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# generate images
output_images = pipe(
    "red hair",
    num_inference_steps=20,
    generator=generator,
    image=init_image,
    mask_image=mask_image,
    control_image=control_image,
    guidance_scale=7,
    controlnet_conditioning_scale=0.5,
).images

# Save the images
for i, image in enumerate(output_images):
    image.save(f'output{i+0}.png')

A111 Settings red hair Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 478847657, Size: 1024x1024, Model hash: a1535d0a42, Denoising strength: 1, Mask blur: 4, ControlNet 0: "Module: none, Model: control_v11p_sd15_inpaint [ebff9138], Weight: 0.3, Resize Mode: Crop and Resize, Low Vram: False, Guidance Start: 0, Guidance End: 1, Pixel Perfect: False, Control Mode: Balanced", Version: v1.6.0-2-g4afaaf8a

--

Bonus Example (Top: Diffusers, Bottom: A1111) image image

bghira commented 11 months ago

do you mind posting the image of the mask?

brandonwsaw commented 11 months ago

do you mind posting the image of the mask?

Sure:

image image

I don't suspect it's related to the mask. This one isn't great, but similar one is used with the A1111 results. And getting the same problem with even very simple masks like the eyes one.

bghira commented 11 months ago

thank you. if you search the github issues you'll find one discussing inpainting in Diffusers vs A1111. there's some postprocessing you have to do, using the mask to actually composite the inpainted area into the original image. i wanted to see the mask so i could be more clear what the end result should be.

brandonwsaw commented 11 months ago

thank you. if you search the github issues you'll find one discussing inpainting in Diffusers vs A1111. there's some postprocessing you have to do, using the mask to actually composite the inpainted area into the original image. i wanted to see the mask so i could be more clear what the end result should be.

sorry to clarify, are you saying this is something I can solve myself with some postprocessing of the mask beforehand? I'm not sure I found the right issue you're referencing, do you mean this one? https://github.com/huggingface/diffusers/issues/5808

bghira commented 11 months ago

yes, currently it's done via post.

https://github.com/huggingface/diffusers/issues/4782 https://github.com/huggingface/diffusers/issues/3880

bghira commented 11 months ago

https://github.com/huggingface/diffusers/pull/4536 might actually be what you need.

brandonwsaw commented 11 months ago

Thanks, will play around with this, but this issue seems different to me - I'm seeing very different inpainting behavior within the mask than I get from A111, not issues outside the mask. (Although, I actually have noticed that in some other projects so this is good to know).

bghira commented 11 months ago

well the DDIM in Diffusers has some issues (#6068 comes to mind mostly) and so you might want to try Euler or even Euler A.

tolgacangoz commented 11 months ago

Hi @brandonwsaw. It seems that you used DDIM in the code but Euler a in A1111. Also, diffusers has not supported several A1111 features such as Mask blur yet.

kadirnar commented 11 months ago

They are adding mask_blur support. But the inpaint pipeline doesn't work well.

https://github.com/huggingface/diffusers/pull/6072

yiyixuxu commented 11 months ago

Hi @brandonwsaw

thanks for the issue!

Yeah I think there are lots of differences in settings, most have been summarized by @bghira and @standardAI :

  1. mask_blur: it is just a pre-processing step for the mask; you can use this line to create blurred mask and use it instead

             mask_b = mask.filter(ImageFilter.GaussianBlur(0.4))
  2. controlnet_conditioning_scale are different: 0.5 in diffusers 0.3 in auto1111

  3. schedulers are different

  4. image sizes are different:auto1111 config says the output size is 1024; does this mean an upscaler is applied?

  5. post-processing is different, diffusers do not overlay the output to the original image, and this should be responsible for the difference we see in the unmasked area.

  6. what is "pixel-perfect" in auto1111 setting? what option is it corresponding to in UI?

  7. what is the "masked_content" mode here? Is it "originaL"? if so, if we want to achieve similar in diffusers, you would use a strength value that's slightly lower than 1.0, e.g. 0.999. in diffusers, when you pass strength == 1.0, it will use a random noise as initial latent, which is similar to the "latent_noise" mode in auto1111

brandonwsaw commented 11 months ago

Thanks all for your input and help. I had some red herrings in there, my fault - I pasted over A1111 settings from a run that didn't match, but I'm seeing the same behavior even when all settings are identical. Here's an example where settings are identicall. You can see A1111 seems to be a recolor, diffusers has pretty different behavior inside the mask.

Both are: Euler A, 512x512, CFG 7, ControlNet Weight 0.5, Original Latent, Denoising 1, Mask Blur 0

A111: grid-0312 red hair Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 478847657, Size: 512x512, Model hash: a1535d0a42, Denoising strength: 1, Mask blur: 0, ControlNet 0: "Module: none, Model: control_v11p_sd15_inpaint [ebff9138], Weight: 0.5, Resize Mode: Crop and Resize, Low Vram: False, Guidance Start: 0, Guidance End: 1, Pixel Perfect: False, Control Mode: Balanced", Version: v1.6.0-2-g4afaaf8a

Diffusers: image


from diffusers.utils import load_image
import numpy as np
import torch
from PIL import Image

init_image = load_image("image (1).png")
init_image = init_image.resize((512, 512))

generator = torch.Generator(device="cpu").manual_seed(478847657)

mask_image = load_image("hair-mask (1).png")
mask_image = mask_image.resize((512, 512))

def make_inpaint_condition(image, image_mask):
    image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
    image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0

    assert image.shape[0:1] == image_mask.shape[0:1], "image and image_mask must have the same image size"
    image[image_mask > 0.5] = -1.0  # set as masked pixel
    image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
    image = torch.from_numpy(image)
    return image

control_image = make_inpaint_condition(init_image, mask_image)

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
    "stablediffusionapi/anything-v5", controlnet=controlnet, torch_dtype=torch.float16
)

pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
pipe.safety_checker = None
pipe.requires_safety_checker = False

# generate images
output_images = pipe(
    "red hair",
    negative_prompt='',
    num_inference_steps=20,
    generator=generator,
    image=init_image,
    mask_image=mask_image,
    control_image=control_image,
    guidance_scale=7,
    controlnet_conditioning_scale=0.5,
    strength=0.999,
).images

# Save the images
for i, image in enumerate(output_images):
    image.save(f'output{i+3}.png')```
brandonwsaw commented 11 months ago

Hi @brandonwsaw

thanks for the issue!

Yeah I think there are lots of differences in settings, most have been summarized by @bghira and @standardAI :

  1. mask_blur: it is just a pre-processing step for the mask; you can use this line to create blurred mask and use it instead
             mask_b = mask.filter(ImageFilter.GaussianBlur(0.4))
  2. controlnet_conditioning_scale are different: 0.5 in diffusers 0.3 in auto1111
  3. schedulers are different
  4. image sizes are different:auto1111 config says the output size is 1024; does this mean an upscaler is applied?
  5. post-processing is different, diffusers do not overlay the output to the original image, and this should be responsible for the difference we see in the unmasked area.
  6. what is "pixel-perfect" in auto1111 setting? what option is it corresponding to in UI?
  7. what is the "masked_content" mode here? Is it "originaL"? if so, if we want to achieve similar in diffusers, you would use a strength value that's slightly lower than 1.0, e.g. 0.999. in diffusers, when you pass strength == 1.0, it will use a random noise as initial latent, which is similar to the "latent_noise" mode in auto1111

Thanks, interesting to know about mask blur, post processing, and especially the masked content, but I did play with those and they don't seem responsible. I turned off mask blur and used the 0.999 trick in the example above. A1111 also produces a similar result with mask_content set to latent noise.

I'm not exactly sure what Pixel Perfect is, here's the UI, default is False: image

yiyixuxu commented 11 months ago

@brandonwsaw

Interesting.. thanks a lot for these additional experiments! Can we set controlnet_conditioning_scale = 0 in both to compare? just want to see if the difference coming from the controlnet part or inpaint part

brandonwsaw commented 11 months ago

Sure, here's with the control weight at 0:

A1111: image red hair Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 478847657, Size: 512x512, Model hash: a1535d0a42, Denoising strength: 1, Mask blur: 0, ControlNet 0: "Module: none, Model: control_v11p_sd15_inpaint [ebff9138], Weight: 0, Resize Mode: Crop and Resize, Low Vram: False, Guidance Start: 0, Guidance End: 1, Pixel Perfect: False, Control Mode: Balanced", Version: v1.6.0-2-g4afaaf8a

Diffusers: image

    "red hair",
    negative_prompt='',
    num_inference_steps=20,
    generator=generator,
    image=init_image,
    mask_image=mask_image,
    control_image=control_image,
    guidance_scale=7,
    controlnet_conditioning_scale=0.0,
    strength=0.999,
).images
yiyixuxu commented 11 months ago

@brandonwsaw thanks! will look into now:)

yiyixuxu commented 11 months ago

hi @brandonwsaw There are two things I noticed here:

  1. the image and mask you provided has different aspect ratio: image size is 395 x 393 (not 1:1), and mask size is 572 x 572(1:1); so simply running the PIL.Image.resize() method on both will cause the image and mask to slightly mismatch; In auto1111 you used "crop and resize", which crop the image to 393 x 393 first before resize to 572 x 572
  2. I don't think "simply recolor the hair" is the expected behavior, even for the inpaint controlnet in auto1111. Normally it would use "masked image" as input for controlnet, which would be the same as diffusers, i.e. the output of make_inpaint_condition. However, in this particular example, because the image and mask you provided have different sizes, it decided to use the "image" instead of "masked image" as the control_image; here is an example output from auto1111 when your image and mask have same size: auto1111_girl_output
  3. if you want to use this pipeline to only recolor hair, you can modify the make_inpaint_condition function. This script will generate same result as auto1111
    
    from diffusers.utils import load_image
    import numpy as np
    import torch
    from PIL import Image
    from diffusers import EulerAncestralDiscreteScheduler, ControlNetModel, StableDiffusionControlNetInpaintPipeline

init_image = load_image("yiyi_image_girl.png")

generator = torch.Generator(device="cpu").manual_seed(478847657)

mask_image = load_image("yiyi_image_mask_girl.png")

def make_inpaint_condition(image, image_mask): image = np.array(image.convert("RGB")).astype(np.float32) / 255.0 image = np.expand_dims(image, 0).transpose(0, 3, 1, 2) image = torch.from_numpy(image) return image

control_image = make_inpaint_condition(init_image, mask_image)

controlnet = ControlNetModel.from_pretrained( "lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16 ) pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained( "stablediffusionapi/anything-v5", controlnet=controlnet, torch_dtype=torch.float16 )

pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload() pipe.safety_checker = None pipe.requires_safety_checker = False

generate images

output_images = pipe( "red hair", num_inference_steps=20, generator=generator, image=init_image, mask_image=mask_image, control_image=control_image, guidance_scale=7, controlnet_conditioning_scale=0.5, strength=0.999, ).images

Save the images

for i, image in enumerate(output_images): image.save(f'test_5_output{i+3}.png')


image
![yiyi_image_girl](https://github.com/huggingface/diffusers/assets/12631849/89cf30a3-11c1-4053-84da-cf665d315642)
mask
![yiyi_image_mask_girl](https://github.com/huggingface/diffusers/assets/12631849/024d56df-05f5-474a-a699-c88ab358b68d)
output
![yiyi_test_5_output3](https://github.com/huggingface/diffusers/assets/12631849/4b1a88b9-1993-4b37-90b0-f80f52643149)
kadirnar commented 11 months ago

@yiyixuxu ,

How can I do this for SDXL? Because there is no sdxl-controlnet-inpaint model. https://github.com/Mikubill/sd-webui-controlnet/discussions/2225

tolgacangoz commented 11 months ago

@yiyixuxu ,

How can I do this for SDXL? Because there is no sdxl-controlnet-inpaint model. Mikubill/sd-webui-controlnet#2225

Isn't this what you are looking for or did I understand something wrong?

kadirnar commented 11 months ago

@yiyixuxu , How can I do this for SDXL? Because there is no sdxl-controlnet-inpaint model. Mikubill/sd-webui-controlnet#2225

Isn't this what you are looking for or did I understand something wrong?

No. I'm looking for the sdxl version of this model.

https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint

tolgacangoz commented 11 months ago

@yiyixuxu , How can I do this for SDXL? Because there is no sdxl-controlnet-inpaint model. Mikubill/sd-webui-controlnet#2225

Isn't this what you are looking for or did I understand something wrong?

No. I'm looking for the sdxl version of this model.

https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint

OK then, sry 😅.

brandonwsaw commented 11 months ago

@yiyixuxu thanks for looking into this. I don't think mask size is the issue here - I grabbed a quick screenshot with the snip tool to post here which is why one of them is slightly different dimensions. But the image/mask I used in my script are both 512x512 (below). And in A111, I'm using their native inpaint function to draw on top of the original image, so the image/mask must be identical.

Interesting, I'll give that mask inpaint condition a shot, seems neat. But I do suspect there's something going on with controlnet, I'm getting worse results even outside of hair recoloring. Here's an example of changing the mouth, again results are pretty different. It's harder to see the differences bc it's smaller (that's why I picked the hair example to show), but Diffusers has more artifacts, blurry lines, and generally lower quality.

Don't want to take up more of your time if you don't think there's something underlying here, but after spending a lot of time trying to recreate A111 results with Diffusers across different experiments it feels like the controlnet for Diffusers isn't as effective for inpainting.

A1111 image open mouth, talking, laughing Negative prompt: closed mouth Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 478847657, Size: 1024x1024, Model hash: a1535d0a42, Denoising strength: 1, Mask blur: 0, ControlNet 0: "Module: none, Model: control_v11p_sd15_inpaint [ebff9138], Weight: 0.3, Resize Mode: Crop and Resize, Low Vram: False, Guidance Start: 0, Guidance End: 1, Pixel Perfect: False, Control Mode: Balanced", Version: v1.6.0-2-g4afaaf8a

Diffusers Frame 1 (2)


from diffusers.utils import load_image
import numpy as np
import torch
from PIL import Image

init_image = load_image("image.png")
init_image = init_image.resize((1024, 1024))

generator = torch.Generator(device="cpu").manual_seed(478847657)

mask_image = load_image("mouth-mask.png")
mask_image = mask_image.resize((1024, 1024))

def make_inpaint_condition(image, image_mask):
    image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
    image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0

    assert image.shape[0:1] == image_mask.shape[0:1], "image and image_mask must have the same image size"
    image[image_mask > 0.5] = -1.0  # set as masked pixel
    image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
    image = torch.from_numpy(image)
    return image

control_image = make_inpaint_condition(init_image, mask_image)

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
    "stablediffusionapi/anything-v5", controlnet=controlnet, torch_dtype=torch.float16
)

pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
pipe.safety_checker = None
pipe.requires_safety_checker = False

# generate images
output_images = pipe(
    "open mouth, talking, laughing",
    negative_prompt='closed mouth',
    num_inference_steps=20,
    generator=generator,
    image=init_image,
    mask_image=mask_image,
    control_image=control_image,
    guidance_scale=7,
    controlnet_conditioning_scale=0.3,
    strength=0.999,
).images

# Save the images
for i, image in enumerate(output_images):
    image.save(f'output{i+8}.png')```
bghira commented 11 months ago

i think the difference comes down to seeds. although A1111's output has worse image compression artifacts.

the inpainted mouth looks bad there, too. some kind of image ghosting, lips where they don't belong or something?

image

as opposed to Diffusers...

image

but i don't think it's "much worse results" with Diffusers. am i missing it? i don't have the best eyes.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

simbrams commented 9 months ago

Hey, I'm running into the same issue, did you guys found a solution to this small quality difference ?

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.