huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.2k stars 5.21k forks source link

ControlNet v1.1 #3095

Closed takuma104 closed 1 year ago

takuma104 commented 1 year ago

Today, ControlNet v1.1 was released. As for the current situation, it seems to be positioned as a preview, and they are particularly working on improving the annotator (image preprocessing) code. It is said that most of the model weights are already production-ready.

img

Model weights:

https://huggingface.co/lllyasviel/ControlNet-v1-1

ControlNet 1.1 include 14 models (11 production-ready models, 2 experimental models, and 1 unfinished model):

control_v11p_sd15_canny control_v11p_sd15_mlsd control_v11p_sd15_depth control_v11p_sd15_normalbae control_v11p_sd15_seg control_v11p_sd15_inpaint control_v11p_sd15_lineart control_v11p_sd15s2_lineart_anime control_v11p_sd15_openpose control_v11p_sd15_scribble control_v11p_sd15_softedge control_v11e_sd15_shuffle control_v11e_sd15_ip2p control_v11u_sd15_tile

The weights have not been converted for Diffusers yet, but I think we can convert them using scripts/convert_original_controlnet_to_diffusers.py.

Addendum:

I have released the converted weights for test purpose. To use them, specify the subfolder in the naming convention up to the "pth" like this:

controlnet = ControlNetModel.from_pretrained('takuma104/control_v11', subfolder='control_v11p_sd15_canny')
pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet).to("cuda")
image = pipe(prompt="...", image=conditional_image).images[0]

At the moment, I have confirmed the normal operation for canny, depth, mlsd, normalbae, openpose, scribble, seg, softedge, lineart and lineart_anime. For normalbae, it seems that the control images created in v1.0 are no longer compatible, and the correct images are not generated out. It seems necessary to recreate them with the new annotator.

Model architecture:

The NeuralNetwork structure is expected to remain the same as v1.0 until v1.5, so I haven't tested it yet, but it will most likely work almost as-is with the current StableDiffusionControlNetPipeline. However, some changes seem to be necessary for proper usage.

This was a quick report. I'm thinking of trying to proceed with testing and verification on my end as well.

Conobi commented 1 year ago

On my side, for the v1.1 hough/mlsd model, everything works successfully, howewer for the normalbae model, I'm unable to unpickle the weights.

python ../scripts/convert_original_controlnet_to_diffusers.py --checkpoint_path control_v11p_sd15_normalbae.pth --original_config_file control_v11p_sd15_normalbae.yaml --dump_path control_v11p_sd15_normalbae --device cpu:

Traceback (most recent call last):
  File "/tmp/controlnet-v11/diffusers/convert/../scripts/convert_original_controlnet_to_diffusers.py", line 80, in <module>
    controlnet = download_controlnet_from_original_ckpt(
  File "/tmp/controlnet-v11/diffusers/src/diffusers/pipelines/stable_diffusion/convert_from_ckpt.py", line 1346, in download_controlnet_from_original_ckpt
    checkpoint = torch.load(checkpoint_path, map_location=device)
  File "/tmp/controlnet-v11/lib/python3.10/site-packages/torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/tmp/controlnet-v11/lib/python3.10/site-packages/torch/serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
takuma104 commented 1 year ago

Hi @Donokami , Hmm, it seems like further investigation is needed. On my end, it appears that all the conversions were successful. For now, I have released the converted weights and added the description to the initial post. By specifying control_v11p_sd15_normalbae for the subfolder, we should be able to use normalbae.

sayakpaul commented 1 year ago

@patrickvonplaten cc ^

@takuma104 thanks for your hard work as always! We're internally also working on it. Should we ready soon :)

Will keep this issue open until that's done.

ghpkishore commented 1 year ago

There seems to be an issue with using 5 controlnet with gradio in MulticontrolNet with diffusers. I tried running it without gradio and am able to fit even 6 controlnets to generate output images. However, when I run it with gradio, After completing the steps, the code fails. The code works with upto 4 controls, but fails when it increases to 5.

By failure, I mean, the SSH connection to my server gets disconnected. Felt very weird.

sayakpaul commented 1 year ago

Could you open a separate issue for this? If the code runs in a non-gradio environment, then I suggest opening the issue in the Gradio repository.

takuma104 commented 1 year ago

I have created a comparison with the reference implementation. I have created all the conditional images using the new gradio_annotator.py. https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare_v11/README.md

From version 1.1, ControlNet no longer includes the base model (such as SD1.5), so the results now match almost pixel-perfect. The slight differences in brightness may be due to the difference in rounding algorithms when converting from float to int pixel values.

I have checked the following: canny, depth, mlsd, normalbae, openpose, scribble, seg, softedge, lineart, lineart_anime

In terms of photorealism, the new softedge seems to perform quite well. The lineart and lineart_anime models also demonstrate impressive coloring performance from hand-drawn line art.

The remaining tasks for verification are as follows. inpaint, shuffle, ip2p, tile

sayakpaul commented 1 year ago

Immense thanks to @patrickvonplaten ❤️

New Controlnet v1.1 checkpoints have been released on the Hub! The release includes 14 new checkpoints with some cool applications such as Instruct-Pix2Pix ControlNet.

Model cards contain all the details you need to try it out 🌠 https://huggingface.co/models?sort=downloads&search=lllyasviel%2Fcontrol_v11

Therefore, I am closing this issue. But feel free to reopen.

patrickvonplaten commented 1 year ago

@takuma104 please let me know if some checkpoints don't work as expected. I think the inpainting controlnet checkpoint still has some issues

takuma104 commented 1 year ago

@patrickvonplaten The status of my verification of remaining four models are as follows.

inpaint (v1.1)

It is necessary to set the mask pixel to -1 if it is a tensor, which seems to be unique here. https://github.com/lllyasviel/ControlNet-v1-1-nightly/blob/main/gradio_inpaint.py#L35

There is no need to use InpaintPipeline; it can be done with the usual StableDiffusionControlNetPipeline. The code example is following. I feel that the generated quality is slightly lower than the original, so I will write a code that can completely compare with the original and investigate.

import numpy as np
from diffusers import ControlNetModel, StableDiffusionControlNetPipeline
import torch
from diffusers.utils import load_image

def make_inpaint_condition(image, image_mask):
    image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
    image_mask = np.array(image_mask.convert("L"))
    assert image.shape[0:1] == image_mask.shape[0:1], "image and image_mask must have the same image size"
    image[image_mask < 128] = -1.0 # set as masked pixel 
    image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
    image = torch.from_numpy(image)
    return image

controlnet = ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_inpaint', 
                                             torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained('runwayml/stable-diffusion-v1-5', 
                                                         controlnet=controlnet, 
                                                         torch_dtype=torch.float16, 
                                                         safety_checker=None).to('cuda')

original_image = load_image('https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_v11/control_images/pexels-sound-on-3760767_512x512.png')
mask_image = load_image('https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_v11/control_images/mask_512x512.png')

pipe(prompt="best quality", 
     negative_prompt="lowres, bad anatomy, bad hands, cropped, worst quality", 
     generator=torch.manual_seed(2),
     num_inference_steps=20,
     guidance_scale=9.0,
     image=make_inpaint_condition(original_image, mask_image)).images[0]

The result comparison:

Original Image Mask Image Generated Image

shuffle

As I wrote in the first post, if we want to achieve compatibility to original, we might need to modify the ControlNetModel. I plan to write a patch for this over the weekend, so once the PoC is done, I intend to open a PR.

ip2p

Not yet. But the code in gradio_ip2p.py does not seem to be doing anything particularly special, and based on the results from this page, it does appear to be fine.

tile

Not yet. Since it is currently in an Unfinished status, it would be wise to hold off on addressing it until it at least changes to an Experimental status. In gradio_tile.py, quite different processing is being performed, and a dedicated pipeline specific to this may be necessary.

patrickvonplaten commented 1 year ago

That's a great summary! Would you like to open a PR to add the inpainting example to: https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint ?

Still need to find time to take a deeper look here though!

ghpkishore commented 1 year ago

@takuma104 for the controlnet inpaint code you shared , the entire image gets disturbed. Given that there is a need to ensure that only the masked area gets inpainted, how does that hold good here? This is from the mikubil's repo : https://github.com/Mikubill/sd-webui-controlnet/issues/968 where they give guidelines on usage of inpaint model with inpainting functionality. Therefore, do you think it might be necessary to use inpaint pipeline ?

takuma104 commented 1 year ago

@patrickvonplaten I just opened a PR for control_v11p_sd15_inpaint. Please make appropriate modifications to the wording as needed.

@ghpkishore Thanks to let me know! I think it might be possible using the Inpaint Pipeline, so I'll give it a try.

patrickvonplaten commented 1 year ago

Very cool! I think "tile" is the only checkpoint that is not tested yet, but it's also unfinished, so I guess we can wait until it's ready? https://github.com/lllyasviel/ControlNet-v1-1-nightly#controlnet-11-tile-unfinished

takuma104 commented 1 year ago

@patrickvonplaten You might already know, but tile has happily been promoted to experimental status. It seems that adjustments in the code are necessary, so I'll think about the best approach. https://github.com/lllyasviel/ControlNet-v1-1-nightly#controlnet-11-tile

takuma104 commented 1 year ago

(new) tile

From the processing flow of the reference gradio_tile.py, it can be interpreted as an Img2Img with ControlNet. It seems that it will be fine to enlarge the input image up to the desired output size using LANCZOS or similar methods (general image resizing, not super-resolution), and use that as the condition_image for ControlNet and input for Img2Img. The code using the stable_diffusion_controlnet_img2img Community Pipeline is as follows. I will conduct a detailed verification, but subjectively it seems to be working fine.

import torch
from PIL import Image
from diffusers import ControlNetModel, DiffusionPipeline, DDIMScheduler
from diffusers.utils import load_image

def resize_for_condition_image(input_image: Image, resolution: int):
    input_image = input_image.convert("RGB")
    W, H = input_image.size
    k = float(resolution) / min(H, W)
    H *= k
    W *= k
    H = int(round(H / 64.0)) * 64
    W = int(round(W / 64.0)) * 64
    img = input_image.resize((W, H), resample=Image.LANCZOS)
    return img

controlnet = ControlNetModel.from_pretrained('takuma104/control_v11', 
                                             subfolder='control_v11f1e_sd15_tile',
                                             torch_dtype=torch.float16)
pipe = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    custom_pipeline="stable_diffusion_controlnet_img2img",
    controlnet=controlnet,
    torch_dtype=torch.float16).to('cuda')
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
pipe.enable_xformers_memory_efficient_attention()

source_image = load_image('https://github.com/lllyasviel/ControlNet-v1-1-nightly/raw/main/test_imgs/dog64.png')

condition_image = resize_for_condition_image(source_image, 1024)
pipe(prompt="best quality", 
     negative_prompt="blur, lowres, bad anatomy, bad hands, cropped, worst quality", 
     image=condition_image, 
     controlnet_conditioning_image=condition_image, 
     width=condition_image.size[0],
     height=condition_image.size[1],
     strength=1.0,
     generator=torch.manual_seed(0),
     num_inference_steps=32,
     ).images[0]
Input Image (64x64) Output Image (1024x1024)
xhinker commented 1 year ago

(new) tile

From the processing flow of the reference gradio_tile.py, it can be interpreted as an Img2Img with ControlNet. It seems that it will be fine to enlarge the input image up to the desired output size using LANCZOS or similar methods (general image resizing, not super-resolution), and use that as the condition_image for ControlNet and input for Img2Img. The code using the stable_diffusion_controlnet_img2img Community Pipeline is as follows. I will conduct a detailed verification, but subjectively it seems to be working fine.

import torch
from PIL import Image
from diffusers import ControlNetModel, DiffusionPipeline, DDIMScheduler
from diffusers.utils import load_image

def resize_for_condition_image(input_image: Image, resolution: int):
    input_image = input_image.convert("RGB")
    H, W = input_image.size
    k = float(resolution) / min(H, W)
    H *= k
    W *= k
    H = int(round(H / 64.0)) * 64
    W = int(round(W / 64.0)) * 64
    img = input_image.resize((W, H), resample=Image.LANCZOS if k > 1 else Image.AREA)
    return img

controlnet = ControlNetModel.from_pretrained('takuma104/control_v11', 
                                             subfolder='control_v11f1e_sd15_tile',
                                             torch_dtype=torch.float16)
pipe = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    custom_pipeline="stable_diffusion_controlnet_img2img",
    controlnet=controlnet,
    torch_dtype=torch.float16).to('cuda')
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
pipe.enable_xformers_memory_efficient_attention()

source_image = load_image('https://github.com/lllyasviel/ControlNet-v1-1-nightly/raw/main/test_imgs/dog64.png')

condition_image = resize_for_condition_image(source_image, 1024)
pipe(prompt="best quality", 
     negative_prompt="blur, lowres, bad anatomy, bad hands, cropped, worst quality", 
     image=condition_image, 
     controlnet_conditioning_image=condition_image, 
     width=condition_image.size[0],
     height=condition_image.size[1],
     strength=1.0,
     generator=torch.manual_seed(0),
     num_inference_steps=32,
     ).images[0]

Input Image (64x64) Output Image (1024x1024)

Thanks for sharing the code!, a small bug in your image resize code H, W = input_image.size should be W, H = input_image.size

:)

takuma104 commented 1 year ago

@xhinker Thanks! That's right. I just fixed above code.

patrickvonplaten commented 1 year ago

Amazing work @takuma104 ! Would you like to add your example here: https://huggingface.co/lllyasviel/control_v11u_sd15_tile ?

It seems to work very nicely :-)

adhikjoshi commented 1 year ago

(new) tile

From the processing flow of the reference gradio_tile.py, it can be interpreted as an Img2Img with ControlNet. It seems that it will be fine to enlarge the input image up to the desired output size using LANCZOS or similar methods (general image resizing, not super-resolution), and use that as the condition_image for ControlNet and input for Img2Img. The code using the stable_diffusion_controlnet_img2img Community Pipeline is as follows. I will conduct a detailed verification, but subjectively it seems to be working fine.

import torch
from PIL import Image
from diffusers import ControlNetModel, DiffusionPipeline, DDIMScheduler
from diffusers.utils import load_image

def resize_for_condition_image(input_image: Image, resolution: int):
    input_image = input_image.convert("RGB")
    W, H = input_image.size
    k = float(resolution) / min(H, W)
    H *= k
    W *= k
    H = int(round(H / 64.0)) * 64
    W = int(round(W / 64.0)) * 64
    img = input_image.resize((W, H), resample=Image.LANCZOS)
    return img

controlnet = ControlNetModel.from_pretrained('takuma104/control_v11', 
                                             subfolder='control_v11f1e_sd15_tile',
                                             torch_dtype=torch.float16)
pipe = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    custom_pipeline="stable_diffusion_controlnet_img2img",
    controlnet=controlnet,
    torch_dtype=torch.float16).to('cuda')
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
pipe.enable_xformers_memory_efficient_attention()

source_image = load_image('https://github.com/lllyasviel/ControlNet-v1-1-nightly/raw/main/test_imgs/dog64.png')

condition_image = resize_for_condition_image(source_image, 1024)
pipe(prompt="best quality", 
     negative_prompt="blur, lowres, bad anatomy, bad hands, cropped, worst quality", 
     image=condition_image, 
     controlnet_conditioning_image=condition_image, 
     width=condition_image.size[0],
     height=condition_image.size[1],
     strength=1.0,
     generator=torch.manual_seed(0),
     num_inference_steps=32,
     ).images[0]

Input Image (64x64) Output Image (1024x1024)

Can we have it without any dependency on community pipelines?

patrickvonplaten commented 1 year ago

Yes agree we should move this to src/diffusers/pipelines will allocate time for this today (hopefully :crossed_fingers:)

patrickvonplaten commented 1 year ago

First PR here: https://github.com/huggingface/diffusers/pull/3386 should be done by tomorrow.

hosseinsarshar commented 1 year ago

I second @ghpkishore 's point. The HF example changes the unmasked parts of the image. It also adds a green filter to the generated image. I noticed that the gradio_inpaint.py script is changed. The new logic seems to work much better than the previous example. It keeps the unmasked parts of the image unchanged. @takuma104 could you kindly give it another try that hopefully fixes the HF example?

patrickvonplaten commented 1 year ago

Hey @classicboyir,

Actually we could get this working by making use of the callback function: https://github.com/huggingface/diffusers/blob/886575ee43c3e7060d74e2feb2018111e0998013/src/diffusers/pipelines/controlnet/pipeline_controlnet.py#L750

Just make sure the passed callback function has access to the mask and then we can make sure to not change the corresponding part of the image.

ghpkishore commented 1 year ago

@patrickvonplaten can you elaborate on what you meant by callable function having access to mask. Can you provide an example on how to use it?

ghpkishore commented 1 year ago

Hey @classicboyir,

Actually we could get this working by making use of the callback function:

https://github.com/huggingface/diffusers/blob/886575ee43c3e7060d74e2feb2018111e0998013/src/diffusers/pipelines/controlnet/pipeline_controlnet.py#L750

Just make sure the passed callback function has access to the mask and then we can make sure to not change the corresponding part of the image.

@patrickvonplaten Also shouldn't this be the default assumption on how the masking should work? If the mask is given then it shouldn't be changed.

hosseinsarshar commented 1 year ago

@patrickvonplaten to @ghpkishore 's point, shouldn't this be the default behavior? You'd expect inpainting to keep the unmasked parts untouched.

adhikjoshi commented 1 year ago

How can we use tile with this?

https://github.com/huggingface/diffusers/pull/3386

Any examples?

patrickvonplaten commented 1 year ago

https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile#example

hosseinsarshar commented 1 year ago

@patrickvonplaten to @ghpkishore 's point, shouldn't this be the default behavior? You'd expect inpainting to keep the unmasked parts untouched.

@patrickvonplaten any thoughts on this? plus do you have any example on how to achieve this with callbacks? I assume at every step, you need to restore the unmask part of the latent, is this a correct high-level description of the workflow?

adhikjoshi commented 1 year ago

https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile#example

Without using custom_pipeline?

Have Checked your controlnet img2img pipeline.

Can we use it instead? That's example I'm looking for.

patrickvonplaten commented 1 year ago

Yeah, actually I'll try to adapt the inpaint pipeline so that inpainting get be used natively with all CKPT models. Will keep you updated here. Also related to: https://github.com/huggingface/diffusers/issues/3497#issuecomment-1557767030

patrickvonplaten commented 1 year ago

See: https://github.com/huggingface/diffusers/pull/3533

adhikjoshi commented 1 year ago

See: https://github.com/huggingface/diffusers/pull/3533

Works well :)

Will controlnet tile will work same way?

patrickvonplaten commented 1 year ago

Yes I think it should, feel free to give it a try and let me know

lmxhappy commented 1 year ago

what does make_inpaint_condition do?