Unexpected behavior of `image_processor.apply_overlay`

huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.

https://huggingface.co/docs/diffusers

Apache License 2.0

26.45k stars 5.45k forks source link

Unexpected behavior of `image_processor.apply_overlay` #9166

Open cortwave opened 3 months ago

cortwave commented 3 months ago

Describe the bug

The apply_overlay method of image_processor works unexpectedly if init_image and image have different shapes. For example it can appear using StableDiffusionXLInpaintPipeline with padding_mask_crop value filled.

According to apply_overlay code, it resizes init_image to the image size and works with all images in these coordinates. And if crop_coordinates are in init_image coordinates (that is expected and intuitive, StableDiffusionXLInpaintPipeline uses exactly this coordinates system) then it inserts image into the wrong place.

Reproduction

This code corresponds to the example from official documentation.

import torch
from diffusers import StableDiffusionXLInpaintPipeline

from diffusers.utils import load_image
import numpy as np
from PIL import Image

init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")

inpainting_mask = np.zeros_like(init_image)
inpainting_mask[210:555, 71:600] = 255
inpainting_map = Image.fromarray(inpainting_mask)

base_model_name: str = "diffusers/stable-diffusion-xl-1.0-inpainting-0.1"
pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")

image = pipe(
    "cat",
    image=init_image,
    mask_image=inpainting_map,
    strength=0.99,
    num_inference_steps=20,
    guidance_scale=8.5,
    padding_mask_crop=16,
).images[0]

image.save("output.png")

Output Image output

As we can see, the crop was pasted into the wrong location such as crop_coordinates used in apply_overlay method are in original image system, but all was resized to the crop image size

Logs

No response

System Info

🤗 Diffusers version: 0.28.2
Platform: Ubuntu 22.04.4 LTS - Linux-5.15.0-113-generic-x86_64-with-glibc2.35
Running on a notebook?: No
Running on Google Colab?: No
Python version: 3.10.14
PyTorch version (GPU?): 2.1.1+cu121 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.24.0
Transformers version: 4.41.2
Accelerate version: 0.22.0
PEFT version: 0.11.1
Bitsandbytes version: not installed
Safetensors version: 0.4.3
xFormers version: 0.0.23
Accelerator: NVIDIA RTX A5000, 24564 MiB VRAM
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

No response

asomoza commented 3 months ago

Hi, do you have a use case where this is important? I get that this can be considered a bug but as I see it, this can be fixed in four ways:

Change the default generation (width and height) to match the init_image like it's done with img2img.
Throw an error when they don't match.
1 and 2 if the width and height are provided.
Stretch the mask to match the final generation and then apply the overlay.

IMO doing an inpaint with a different aspect ratio generates a really bad image and I don't see why anyone would use it. This is probably the reason why no one noticed this until now.

cortwave commented 3 months ago

@asomoza Hi, I don't understand why inpainting region should have the same size as initial image. For example, if I have 2k X 4k image and want to inpaint only 1024x1024 square there. Why should initial image size decrease the quality of this inpainted square? Also this problem is not about aspect ratio, for example, if I use height and width of the init_image everything works fine, but if I use height // 2 and width // 2 (it means the same aspect ratio) I still have problem. So that means that the original image should has exactly the same resolution as inpainting area.

The use case when it can be an issue. If I have a big image (8k resolution) and want to inpaint some small area. According to your suggestion I need to downscale my image to 1024x1024 and perform inpainting. But that means that if I want to get the image in initial size I have to upscale it back. So I just loose image quality downscaling->upscaling x8.

P.S. I agree that inpainting when mask crop aspect ratio and generated crop image aspect ratio are different can lead to bad results due to anisotropic resize during overlay operation, but it doesn't mean that original image and generated crop image should have the same size.

asomoza commented 3 months ago

I understand better now, I think you have a small misunderstanding on what the code does.

First, to clarify in the aspect ratio, I meant your example, if you don't pass padding_mask_crop=16 you'll see what I mean.

So for the rest, the cropping was something that was added later to match the functionality of sd-webui, which is just that to get a better quality inpainting, it makes a square on the inpainting area (masked), upscale it, do the inpaint, then downscale it again and paste it over the original image.

For this to work and to match the positions of the original image and the "paste back" generated image, you need to use a mask that is the same size of the original and generated image. If the mask has a different size than the generated image, the paste back won't match and you'll get your result.

To get it to work in your case, you need to do this steps outside the pipeline and not inside, so for example if you have a 8k image and you want to inpaint a small square of it, you'll need to manually crop that square to a for example, 1024 square and the use that as a source image, diffusers doesn't do that part for you and as far as I know, neither do the webuis except for InvokeAI which does this that I'm describing in the Unified Canvas.

Edit: Forgot about this:

So that means that the original image should has exactly the same resolution as inpainting area

What is required for this to work is that the generation matches the original image and the mask, not the inpaint area, so to fix your example, you just need to set the width and height to match the other two, the inpainted area will get upscaled and pasted back correctly then.

result

cortwave commented 3 months ago

For this to work and to match the positions of the original image and the "paste back" generated image, you need to use a mask that is the same size of the original and generated image. If the mask has a different size than the generated image, the paste back won't match and you'll get your result.

I agree that mask should have the same size as the original image, but don't agree that it also must be equal to generated image size. If we cut a crop from the original image and from the mask correspondingly and generate image of the same size (not the size of the original mask but the size of crop) that means we can past that newly generated image using crop coordinates of the original image.

It's really possible to do, but the logic of apply_overlay is strange. We need to past newly generated image to the empty image of init_image size and alpha-blend it using non-cropped mask with init_image. But instead of it we resize init_image to the size of generated crop.

Edit:

What is required for this to work is that the generation matches the original image and the mask, not the inpaint area, so to fix your example, you just need to set the width and height to match the other two, the inpainted area will get upscaled and pasted back correctly then.

Does it mean if I want to inpaint 1024x1024 square on the 4096x4096 image, I have to pass heigh=4096, width=4096 to match the size?

asomoza commented 3 months ago

I fail to understand your logic here:

If we cut a crop from the original image and from the mask correspondingly and generate image of the same size (not the size of the original mask but the size of crop) that means we can past that newly generated image using crop coordinates of the original image.

I'll do what I think you're suggesting:

This is what you're masking:

masked

so the crop is this:

cropped_mask

And here is where I get lost, you're generating a 1024x1024 image, so where do you want to paste the generated part? Nor the original image or the mask are square images and the final image is like this:

result_square

So the original crop coordinates don't match with the generated image, so I don't understand how do you expect the code to know where to paste back the image and why are you assuming that the inpainting will match the generated square image.

Does it mean if I want to inpaint 1024x1024 square on the 4096x4096 image, I have to pass heigh=4096, width=4096 to match the size?

Yes and no, if you have a model that supports that resolution and have the VRAM to feed that big image, yes, with SDXL is not possible with a consumer card and also because the model doesn't support that resolution.

cortwave commented 3 months ago

Ok, we have following signature for the apply_overlay method:

    def apply_overlay(
        self,
        mask: PIL.Image.Image,
        init_image: PIL.Image.Image,
        image: PIL.Image.Image,
        crop_coords: Optional[Tuple[int, int, int, int]] = None,
    ) -> PIL.Image.Image:

For my example I dumped all input parameters: mask mask init_image image crop_coordinates: (55, 102, 616, 663) Here is default output: output

How I assume it should really work (I omit a code with alpha-blending for simplicity, it doesn't affect anything such as mask is binary):

x0, y0, x1, y1 = (55, 102, 616, 663)
w = x1 - x0
h = y1 - y0
crop_generated = cv2.resize(generated, (h, w), cv2.INTER_LANCZOS4)
init_image[y0:y1, x0:x1] = crop_generated

Result of my code snippet: Image from localhost

As result we have inpainted image with original image size and aspect ratio, not distorted as in your example

asomoza commented 3 months ago

what I don't understand in your example and logic is that the mask is not a square but you somehow are trying to generate a square inpainting, so in short, how is that this mask:

can generate this image:

FYI, what gets generated in the inpainting pipeline is the white part of the image, and when you use padding_mask_crop, this is the part that gets pasted back in the original image.

Also, I already said this, if you use any other of the webuis, this is the default, even more, I never have seen something like you're suggesting so if you know any other lbrary/software where the inpainting works like you're suggesting, it will be easier if you provide a link to it.

cortwave commented 3 months ago

When we inpaint with padding_mask_crop the generated area is not white area but smth like this: mask

From the official doc:

padding_mask_crop (int, optional, defaults to None) — The size of margin in the crop to be applied to the image and masking. If None, no crop is applied to image and mask_image. If padding_mask_crop is not None, it will first find a rectangular region with the same aspect ration of the image and contains all masked area, and then expand that area based on padding_mask_crop. The image and mask_image will then be cropped based on the expanded area before resizing to the original image size for inpainting. This is useful when the masked area is small while the image is large and contain information inreleant for inpainging, such as background.

Pay attention to this part

it will first find a rectangular region with the same aspect ration of the image and contains all masked area, and then expand that area based on padding_mask_crop

Regarding this:

Also, I already said this, if you use any other of the webuis, this is the default, even more, I never have seen something like you're suggesting so if you know any other lbrary/software where the inpainting works like you're suggesting, it will be easier if you provide a link to it.

Inpainting pipeline can be much easier to use, all the things we need to do - just to past a generated image to the right place.

asomoza commented 3 months ago

Oh yeah, you're right, thanks for the correction. I think we should paste just the white masked part and not the whole square but that's another issue.

So if I finally understand correctly (sorry for that), what you want is this:

The final image should be the same size as the original image.
Like it is right now, the masked area should be calculated automatically and be a square.
Upscale the masked area and the corresponding original cropped content to the width and height specified.
The generation should be the same size of the up-scaled masked square area, so for example, a square of 1024x1024 inside a 4000x6000 image.
The generated square image should get automatically pasted in the correct position over the original image which can have a different size and aspect ratio.

I'll let @yiyixuxu decide about this, because for me, more than a bug (which I think we just need to do a validation to fix) you're asking for a new feature that seems to be a breaking change.

cortwave commented 3 months ago

Yes, I think it could be a great feature for the inpainting pipeline. However, I believe that image_processor.apply_overlay lacks sufficient documentation, making it difficult to understand how it's supposed to work. While the inpainting pipeline itself seems fine, the logic for applying the overlay appears to be somewhat flawed.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.