StableDiffusionControlNetImg2ImgPipeline can't follow pipeline parameters to control the size of output

sunhaozhepy commented 8 months ago

Describe the bug

I wish to turn an image of resolution 512x631 into a new image of the same resolution. It is vital that they should have the same resolution, otherwise I can't use them in downstream tasks. And the StableDiffusionControlNetImg2ImgPipeline keeps returning images of resolution 512x624, regardless of whether I tell the model to output 512x631 when calling the pipeline.

Reproduction

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda")
# I think the controlnet bit is irrelevant, but I still put the code here so that we could better figure out what is going wrong
controlnet = ControlNetModel.from_pretrained("controlnet_celeba_hq", torch_dtype=torch.float16).to("cuda")

pipeline = StableDiffusionControlNetImg2ImgPipeline.from_pretrained("SG161222/Realistic_Vision_V5.1_noVAE", torch_dtype=torch.float16, safety_checker=None, controlnet=controlnet).to("cuda")
pipeline.vae = vae
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)

pipeline.load_lora_weights("./lkw-lora-sd-v1-5")

prompt = "a man with a blue shirt"
negative_prompt = "(deformed iris, deformed pupils, semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime:1.4), text, close up, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck"

strength = 0.3

img = Image.open("test.png")

landmarks = get_landmarks(img)
condition_image = get_condition_image(landmarks, size=img.size)

output = pipeline(prompt, image=img, control_image=condition_image, strength=strength, negative_prompt=negative_prompt, num_inference_steps=30, guidance_scale=5, height=631, width=512, cross_attention_kwargs={"scale": 1}).images[0]

Logs

No response

System Info

diffusers version: 0.24.0
Platform: Linux-5.4.15-1.el7.elrepo.x86_64-x86_64-with-glibc2.27
Python version: 3.11.4
PyTorch version (GPU?): 2.1.0+cu121 (True)
Huggingface_hub version: 0.20.1
Transformers version: 4.35.0.dev0
Accelerate version: 0.23.0
xFormers version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@sayakpaul @patrickvonplaten

sayakpaul commented 8 months ago

Even if the reproducible code snippet doesn't matter, consider providing a more complete and minimal reproducible snippet. In your case, there are multiple unknowns such as: get_landmarks(), ./lkw-lora-sd-v1-5, etc.

So, I recommend providing something we can reproduce from our official code examples such as the one from here: https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet#diffusers.StableDiffusionControlNetImg2ImgPipeline

Cc: @yiyixuxu

sunhaozhepy commented 8 months ago

I see. Why not try this official pipeline, since I managed to reproduce my error on it:

from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers.utils import load_image
import numpy as np
import torch
from torchvision.transforms import Resize

import cv2
from PIL import Image

# download an image
image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)

transform = Resize((631, 512))
image = transform(image)

np_image = np.array(image)

# get canny image
np_image = cv2.Canny(np_image, 100, 200)
np_image = np_image[:, :, None]
np_image = np.concatenate([np_image, np_image, np_image], axis=2)
canny_image = Image.fromarray(np_image)

# load control net and stable diffusion v1-5
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)

# speed up diffusion process with faster scheduler and memory optimization
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# generate image
generator = torch.manual_seed(0)
image = pipe(
    "futuristic-looking woman",
    num_inference_steps=20,
    generator=generator,
    image=image,
    control_image=canny_image,
).images[0]

# for me output is (512, 624)
print(image.size)

sayakpaul commented 8 months ago

Hmm from a quick glance seems like we fix the height and width here:

https://github.com/huggingface/diffusers/blob/a3d31e3a3eed1465dd0eafef641a256118618d32/src/diffusers/image_processor.py#L307

sunhaozhepy commented 8 months ago

Hi! Is there any update on this?

By the way, I've tested on the default pipeline and found that even if we specify the height and the width, it'll still be ignored:

from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers.utils import load_image
import numpy as np
import torch
from torchvision.transforms import Resize

import cv2
from PIL import Image

# download an image
image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)

transform = Resize((631, 512))
image = transform(image)

np_image = np.array(image)

# get canny image
np_image = cv2.Canny(np_image, 100, 200)
np_image = np_image[:, :, None]
np_image = np.concatenate([np_image, np_image, np_image], axis=2)
canny_image = Image.fromarray(np_image)

# load control net and stable diffusion v1-5
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)

# speed up diffusion process with faster scheduler and memory optimization
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# generate image
generator = torch.manual_seed(0)
image = pipe(
    "futuristic-looking woman",
    num_inference_steps=20,
    generator=generator,
    image=image,
    control_image=canny_image,
    height=631,
    width=512
).images[0]

# for me output is still (512, 624)
print(image.size)

andypotato commented 7 months ago

@sunhaozhepy correct me if I'm wrong, but width and height of images generated with Stable Diffusion should always be multiples of 8. Your height of 631 is invalid, therefore I assume it will fall back to 624.

hi-sushanta commented 7 months ago

Please follow this link to understand better what is saying about @andypotato: solution

sunhaozhepy commented 7 months ago

That makes sense! Thank you all for your help, @andypotato @hi-sushanta @sayakpaul, I'll probably change the resolution of my original images.

huggingface / diffusers