huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.64k stars 5.3k forks source link

[Pipelines] Add a ControlNet pipeline #2331

Closed apolinario closed 1 year ago

apolinario commented 1 year ago

Model/Pipeline/Scheduler description

ControlNet by @lllyasviel is a neural network structure to control diffusion models by adding extra conditions.

It has integration with Stable Diffusion and 8 pre-trained models that conditions the models on different attributes (such as edge detection, scribbles, depth maps, semantic segmentations and more)

Would be great if this was added to diffusers as a pipeline (and probably with a method to load the different models).

Scribble maps condition model Human pose condition model

Open source status

Provide useful links for the implementation

Original code: https://github.com/lllyasviel/ControlNet Pre-trained models: https://huggingface.co/lllyasviel/ControlNet/tree/main/models Hugging Face Spaces demo: https://huggingface.co/spaces/RamAnanth1/ControlNet

takuma104 commented 1 year ago

Hi @apolinario,

ControlNet seems to offer great output quality and seems to be very versatile. I am very interested in getting involved.

I'm thinking of first trying to proceed with the PoC level this next week or so as follows. This is just my idea, so please let me know if you have any suggestions on how this should be done.

  1. Load and convert ckpt: needed for E2E comparison with reference implementation
    • add function to convert_from_ckpt.py
  2. ControlNet class equivalent: inherit from ~UNet2DConditionModel class~ Mixins
    • Porting input_hint_block and zero_conv
  3. ControlledUnetModel class equivalent: inherit from UNet2DConditionModel class
    • Override forward()
  4. Minimal pipeline.
    • Do not implement the control embedding (c_concat) calculation in the pipeline, but make it an argument of pipe.__call__() to avoid pipeline abuse.
  5. Testing
    • Ensure that all current unit tests pass
    • E2E compare with reference implementation. Output is using the same sampler (DDIM) to check pixel match.
    • Additional unit tests for new ControlNet and ControlledUnetModel equivalent classes and new pipeline.
takuma104 commented 1 year ago

This is a WIP code for 1. https://github.com/huggingface/diffusers/compare/main...takuma104:diffusers:controlnet

It can now convert to Diffusers format without error messages. However, the ControlNetModel implementation is not yet complete, so the data is not correct. After writing all this, I wonder if I should write it as a Community Pipeline? I am concerned about the impact as this will be a fairly large modification.

apolinario commented 1 year ago

Very cool @takuma104, thanks a lot! cc @patrickvonplaten, @patil-suraj, @williamberman on best practices and trade-off of impact/modifications.

I think this technique is showing how adding extra conditions to diffusion models (in general) has huge potential and how the pre-trained models are impactful

So I think an official pipeline makes sense here, but the trade-off of the modifications it requires may be worth debating a bit.

takuma104 commented 1 year ago

@apolinario, Thanks! Sounds nice, I too think this model is worth making official.

@patrickvonplaten, @patil-suraj, @williamberman Here is an overview of my current code diff. My basic policy is to try to avoid interfering with the original code as much as possible, but I'm not sure if this is a good idea in terms of future maintainability, so I'd like your feedback.

takuma104 commented 1 year ago

About file placement: I think the file layout depends on how we view ControlNet: is it an extension of StableDiffusion like LoRA or Hypernetwork, or is it a different model since it uses two Unet equivalents and is quite special?

If the latter, we could put all the necessary dependency files in the src/diffusers/pipelines/control_net folder, including the model file. Would this be preferable?

xvjiarui commented 1 year ago

Wow, nicely done! Looking forward to the completed version!

Since one important application of ControlNet is to initialize from Stable Diffusion and fine-tune on user's customized dataset, would it make sense to have some instructions and methods for this?

takuma104 commented 1 year ago

Hi @xvjiarui , thanks! I agree with you. I was thinking of minimizing the PoC, so I'm thinking of just supporting inference. Once the PoC is complete, could you try writing some code for fine-tune? :)

xvjiarui commented 1 year ago

Sure. Looking forward to it.

williamberman commented 1 year ago

Super cool model!

I've only skimmed the paper so these are just preliminary thoughts. I think we can add this model cleanly but there are some nuanced considerations. We definitely don't want to add a separate unet implementation if possible. I think we have to mainly consider inference here as for training and weight loading, there is a whole independent unet that has to be trained/loaded (i.e. there are no lora-esque considerations for efficient weight training/loading maybe modulo loading the original unet).

image

My basic understanding: controlnet uses two separate unets.

unet 1: Same architecture and weights as original stable diffusion. The only difference here is that residual connections are added together with outputs from unet2. This means that our existing conditional unet has to take these "additional residuals" as input[^1].

unet 2: zero conv modified architecture stable diffusion unet. This is the trained unet which takes in the additional conditioning. The only (?) architecture modification from the original unet is the zero conv blocks. It looks like the output of the decoder blocks are fed through the zero convs before being passed to the residual connections in unet1.

We will have to return the outputs of the unet decoder for passing to the residuals of unet1[^2].

If the zero conv blocks are only used "between" unets i.e. unet2 doesn't use the zero conv output before passing to the next layer in the decoder, we should be able to avoid any changes to existing unet blocks. The zero conv layers can all be stored as a separate model independent of unet 2[^3].

If the zero conv blocks are used w/in unet2, we will either have to duplicate the unet blocks to separate unet blocks that include the zero conv layer, or we will have to add a flag to the relevant unet blocks for them to include an optional zero conv layer[^4].

Separately, There is an additional model maps the additional conditioning into the same dimensionality as the latent images. It sounds straight forward to add.

[^1]: This is a small and ok change. [^2]: This is also a small and ok change. [^3]: This is just a helper model local to the pipeline, also ok. [^4]: The flag is a small change but it is network specific. The separate blocks are isolated but will be more copy-paste. I'm leaning towards the flag but one of the two should be ok

takuma104 commented 1 year ago

Hi @williamberman ,

Thanks for pointing me in the right direction! I was aware that I was building a bunch of copy-paste and was concerned about maintainability. As you said, it would not be difficult to implement unet1 and unet2 by changing the behavior of UNet2DConditionModel's __init__() and forward() arguments for both unet1 and unet2. I'll make a plan for implementation.

I think I have implemented a PoC version of the forward path equivalent to unet1 and unet2, and the difference from the original code is as follows. FYI.

williamberman commented 1 year ago

Thanks! Feel free to keep me updated :)

takuma104 commented 1 year ago

I think I could minimize the diff by changing the behavior of UNet2DConditionModel with arguments. https://github.com/huggingface/diffusers/compare/main...takuma104:diffusers:controlnet

Here is a summary.

williamberman commented 1 year ago

@patrickvonplaten pointed out to me that unet 2 does vary more from the our existing unet in that it doesn't have a traditional decoder, it looks like it only passes residuals directly to zero convs (at least from the diagram). It makes sense then to have a separate control net model.

@takuma104 love the progress you've made, I'm going to jump in here because we want to push this in since control nets are so cool. Can you send me your latest commit so I can pull in your current state? I'll absolutely make sure you get credit on the pr etc and would love your help/guidance on it :)

takuma104 commented 1 year ago

@williamberman @patrickvonplaten Thanks a lot! Thanks to your advice, I think the code is now very easy to understand and maintain. The only thing left to do is to implement the pipeline (and its minimum test), so I think a PoC will be ready in the next few days. I'll open a PR when it's ready.

takuma104 commented 1 year ago

PoC is now complete! For some unknown reason it didn't pass the pixel match test completely, but it looks fine in my subjective evaluation. The image on the top row is the Diffusers version compared to the @lllyasviel 's reference implementation.

The leftmost image is the control image and is a Canny Edge of Girl with a Pearl Earring. It may not have been the best choice as since this original image seems to be overlearned, but I think everyone generated is wearing earrings.

fig_diffusers_controlnet

I'll open a PR soon.

Here is the code I used to generate this figure: https://gist.github.com/takuma104/c21f41b09ace36c3ae312383838a6969

SamPruden commented 1 year ago

Is it possible to combine multiple ControlNet conditions at once, and if so, should the Diffusers implementation be built to support that?

I've skimmed the ControlNet paper and can't see a mention of combining control models. However, it looks like it's simply applying additive changes to the latents at various blocks in the UNet. If these are additive nudges towards the constraint, can they be composed into joint constraints?

It seems to me that composability would be incredibly powerful here if it could be made to work. An API that allows the user to provide an arbitrary list of controllers would be a dream, but I don't know whether the signals combine in that way.

takuma104 commented 1 year ago

Hi @SamPruden ,

I don't know how effective the idea of using multiple ControlNets to synthesize control embedding and apply it to single Unet is, but I was personally interested in it and thought I would experiment with it (after finishing my PR work).

In principle it is very simple: add multiple ControlNets to the pipe and add its output.

This part of unet's inference loop, which now looks like this

control = self.controlnet(latent_model_input, t, encoder_hidden_states=prompt_embeds, controlnet_hint=controlnet_hint)
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds, cross_attention_kwargs=cross_attention_kwargs, control=control).sample

For example, it should be possible to achieve this by simply adding and combining the controlnet outputs like this. (I repeat, I am not sure of the validity of this.)

control1 = self.controlnet1(latent_model_input, t, encoder_hidden_states=prompt_embeds, controlnet_hint=controlnet_hint1)
control2 = self.controlnet2(latent_model_input, t, encoder_hidden_states=prompt_embeds, controlnet_hint=controlnet_hint2)
control = [c1 + c2 for c1, c2 in zip(control1, control2)]
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds, cross_attention_kwargs=cross_attention_kwargs, control=control).sample
SamPruden commented 1 year ago

Very cool @takuma104, I'll be interested to see how well this works! If it's highly effective it would be cool to allow the controlnet param to be a list, or to have some CombinedControlNet concept, but obviously that depends on results.

My intuition says that taking the mean of control signals probably makes more sense than taking the sum, but I'm not sure on that.

takuma104 commented 1 year ago

@SamPruden I see, I did not have the idea of averaging. I have a feeling it will lead to more better results.

If you don't mind, could you clone the code in my repository, which is still under development, and give it a try to experiment? I have created a Colab notebook for basic pipeline usage, so please refer to that. I also have 2 ready-to-use models (canny edge, openpose). The quickest way to modify the pipeline is to rewrite the StableDiffusionControlNetPipeline class directly I think.

eeyrw commented 1 year ago

@takuma104 Tried Colab demo, great work!

takuma104 commented 1 year ago

Thanks @eeyrw ! Enjoy :)

SamPruden commented 1 year ago

If you don't mind, could you clone the code in my repository, which is still under development, and give it a try to experiment?

I've taken a preliminary pass at this. Code is a mess at the moment. Results are mixed.

The good news

First canny mask:

image

Second canny mask

image

Image generated jointly using two separate canny ControlNets:

image

So combining multiple ControlNets of the same type seems to Just Work. (I haven't tested overlaps yet.)

Admittedly, it's not very useful in this case, as I could have simply merged the masks and used a single model.

The bad news

I've only gotten junk out of combining canny + pose so far. However, I've been getting pretty bad outputs from the pose model anyway. Running the pose model on its own with the prompt "A photograph of a man running in a field with a dog beside him, high quality" seems to completely ignore the dog. Perhaps there's an issue with the pose model not playing nicely with additional elements. If so, this may not be a problem with the concept of combining control models.

@takuma104 if you get some other control models ported it would be interesting to test whether we can combine canny + scribble, for instance.

SamPruden commented 1 year ago

Pose + canny multi control working!

image

DEMO: https://colab.research.google.com/gist/SamPruden/7f1ce489a9f04be4ad237e40ca4936ee/controlnetdiffusersdevelopment-multi-model-demo.ipynb

However, this required a prompt: "a man jumping in a field with a dog"

If we run it with only the default prompt, it seems to consistently pickup the canny edges and ignore the pose: image

Perhaps this is fixable with some type of control weighting. More research needed.

@takuma104 You were right, I was wrong. Summing the signals works better than taking the mean, at least for the examples I've tried so far.

Hypothesis: When the signals are touching different parts of the image, they touch fairly different parts of the latent space. As such, they don't interfere with each other. Taking the mean effectively cuts the strength of each in half, and therefore is less effective.

This may cause issues when they do overlap more in the latent space, but perhaps a slightly clever way of merging the signals can deal with this, e.g. some type of max-clamped addition. I'll do some experiments at some point. I haven't empirically seen this causing a problem yet.

Overall, I think this is very promising! Details need to be worked out, but the basic idea of merging control models holds promise. I'll probably take another run at it tomorrow and see if I can move beyond an ugly prototype into something that might actually be useful.

takuma104 commented 1 year ago

@SamPruden Great result! Thanks! I'll have to figure out how to add ControlNet to the pipeline for this. However, I will not include it in this PR, but will open another PR once this one is merged.

The idea at the moment is as follows:

The code should look something like this:

from diffusers import StableDiffusionControlNetPipeline
from diffusers import UNet2DConditionModel

pipe = StableDiffusionControlNetPipeline.from_pretrained("takuma104/control_sd15_canny").to("cuda")
control_openpose = UNet2DConditionModel.from_pretrained("takuma104/control_sd15_openpose", subfolder="controlnet").cuda()
pipe.append_controlnet(control_openpose)
image = pipe(prompt="best quality, extremely detailed", 
             controlnet_hint=[canny_control_image, openpose_control_image]).images[0]
image.save("generated.png")
takuma104 commented 1 year ago

I will answer here what I received in the PR thread (because I want the PR thread to be about the implementation itself).

@Mystfit Thanks! I will convert all the models in lllyasviel/ControlNet and publish them all for testing after this.

@cian0 This is still possible at the moment. I haven't tested it yet, but I think it should work since the code looks like this:

from diffusers import StableDiffusionControlNetPipeline
from diffusers import UNet2DConditionModel

pipe = StableDiffusionControlNetPipeline.from_pretrained("takuma104/control_sd15_canny").to("cuda")
pipe.unet = UNet2DConditionModel.from_pretrained("gsdf/Counterfeit-V2.5", subfolder="unet").cuda()
image = pipe(prompt="best quality, extremely detailed", 
             controlnet_hint=canny_control_image).images[0]
image.save("generated.png")

The gsdf/Counterfeit-V2.5 here is an example, and I think the SD1.x series model would also work.

@Mystfit Thanks for the report! It is true that RGB->BGR conversion is required when controlnet_hint is specified in PIL.Image type. I will fix it.

SamPruden commented 1 year ago

I'll have to figure out how to add ControlNet to the pipeline for this. However, I will not include it in this PR, but will open another PR once this one is merged.

Cool!

Some disorganised thoughts after the prototype:

I haven't fully experimented with this yet, but it would be good to have a weighting parameter for each controller. This would cover the sum vs mean question - set all weights to 1/n and you have a mean. It's probably useful to control the relative strength of different controllers for artistic purposes, too.


The pipeline doesn't actually need to load the main diffusion model from ControlNet. Plain runwayml/stable-diffusion-v1-5 works fine. It produces different results for the same seed for some reason, though - presumably it's a slightly different version.

Should takuma104/control_sd15_canny contain the main model at all, or should it only be the controller?

This raises a question - why is the ControlNet pipeline needed? If it works with the same base model, does it make more sense for the main pipeline to simply have optional controlnet support? This pattern would allow for future additions like append_classifier_guidance(), too. Separate pipelines for each control method goes against a broader goal of composability.

It could theoretically look like this:

from diffusers import StableDiffusionPipeline
from diffusers import UNet2DConditionModel

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").cuda()
control_canny = UNet2DConditionModel.from_pretrained("takuma104/control_sd15_canny").cuda()
control_openpose = UNet2DConditionModel.from_pretrained("takuma104/control_sd15_openpose").cuda()
pipe.append_controlnet(control_canny)
pipe.append_controlnet(control_openpose)
image = pipe(prompt="best quality, extremely detailed", 
             controlnet_hint=[canny_control_image, openpose_control_image]).images[0]
image.save("generated.png")

I still think that simply summing control signals may cause problems sometimes. The pose + canny controllers aren't ideal for testing this. A more typical scenario might be that somebody tries to combine canny + normal map control in order to better preserve detail. In this scenario, both controllers pushing in the same direction may "constructively interfere" and overshoot their target leading to bad results. We should do experiments with other accumulators. One that might be worth trying is clamp(sum(controllers), K * min(controllers), K * max(controllers)), i.e. limiting constructive interference by bounding it within the extremes of what any individual controller wants, up to some configurable factor.


In principle, there's no reason that ControlNet hints must be images. I believe the ControlNet architecture should be able to extend to handling things like additional text prompts, and we should make sure the implementation here doesn't block that. I.e. we should be lenient about passing arbitrary hints through to the ControlNet in whatever format it wants them. Unfortunately, that clashes with your hint conversion system. Perhaps we actually want the ControlNet type to be something closer to (hint_preprocessor, UNet2DConditionModel) to cover this.


The existing pretrained ControlNets are probably not ideal for composition. For example, the Canny controller seems to assume that its hint contains all canny edges in the image, meaning that blank areas of the hint get pushed towards being empty in the result. It might be more useful to have a version of the canny controller trained with partially masked hints, so that we can use canny edges to enforce detail in a certain region without damaging the rest of the image.

I also foresee it being useful to have controllers trained with mask inputs, i.e. the hint passed in might be of the form (mask, hint) and the controllers can be trained to regionally restrict their influence. This technique might even be able to support regionally restricted additional text prompts, although I have no idea what the training dataset for that would look like.

takuma104 commented 1 year ago

An idea for discussion-1 on the PR page, I have created a library called controlnet_hinter. This library focuses on converting images to controlnet_hint. This is the part I originally planned to include in diffusers, but due to its heavy dependency on other libraries, I think it will be difficult to release it as diffusers. I will (personally) release it as a separate library. I have already piped it, so I think users can use it right away as is.

As for the contents, I just cut and pasted the annotator & gradio part of lllyasviel/ControlNet.

The usage is as follows. The control_image here can be any image. hint_canny() handles the canny edge. And another method like OpenPose is also supported like hint_openpose().

from diffusers import StableDiffusionControlNetPipeline
import controlnet_hinter

pipe = StableDiffusionControlNetPipeline.from_pretrained("takuma104/control_sd15_canny").to("cuda")
controlnet_hint = controlnet_hinter.hint_canny(control_image)
image = pipe(prompt="best quality, extremely detailed", controlnet_hint=controlnet_hint).images[0]
image.save("generated.png")
takuma104 commented 1 year ago

@SamPruden Thanks for the suggestion! I think it is important to think about the future now.

This raises a question - why is the ControlNet pipeline needed? If it works with the same base model, does it make more sense for the main pipeline to simply have optional controlnet support?

Yes, as mentioned in discussion 2 of the PR, it is possible to integrate it into the StableDiffusionPipeline instead of creating a StableDiffusionControlNetPipeline.

I would like to discuss this point now as it relates to the current PR. I think there are several points of discussion on whether it is better to separate the pipelines or integrate them:

  1. Usability: What is more out-of-the-box? Ease of understanding?
  2. Maintainability: Which is better?
  3. Extensibility: Which is better?
  4. Data structure: Currently the pipeline model is closely related to the Diffusers data structure, and it seems that the argument of self.register_modules() in __init__() is directly related as a folder hierarchy. How to save controlnet then? Or what if there are multiple controlnets?

My current opinion is:

  1. I think the code written by @SamPruden is clear & clean enough. So I think either is fine on this point. I think the "out-of-box?" point is about the data structure in 4.
  2. I think it would be easier to maintain the code if it were integrated in terms of code diffs. I would especially like to hear from the Diffusers team.
  3. if we want to extend it more easily, I think it would be better to keep it separate.
  4. lllyasviel's original pre-trained model seems to have fine-tuned SD Unet decoders (Upblocks), so strictly speaking, a ControlNet and a Unet trained together should be used as a set. However, in most use cases, vanilla SD or any other fine-tuned SD models can be used for the Unet part.

So, I'm not sure if I can make a decision. Hmmm.

@williamberman @patrickvonplaten If you have any comments, please let us know.

takuma104 commented 1 year ago

@SamPruden

In principle, there's no reason that ControlNet hints must be images.

I agree with you on this point, and this is another area in which I am looking forward to the future of ControlNet. I intentionally allow controlnet_hint to be specified as a torch.FloatTensor type with almost no preprocessing. The preprocessing is separated for PIL.Image, np.ndarray (OpenCV compatible BGR) and torch.FloatTensor respectively.

SamPruden commented 1 year ago

Thinking slightly out of the box for a minute, what if controllers were passed as arguments to __call__() instead of being added to the pipeline? Conceptually, this would make them additional parts of the prompt, instead of additional parts of the model.

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").cuda()

canny_controller = CannyController(canny_control_image, weight = 0.8)

face_description_controller = FaceDescriptionController(
    "old and hagged skin with youthful eyes",
    mask = face_location_mask
)

image = pipe(prompt="best quality, extremely detailed", 
             controllers=[canny_controller, face_description_controller]).images[0]

This way, controllers can take any arbitrary arguments and process them however they want. The only standard interface they have to adhere to is returning correctly shaped latents to be accumulated into the main model. IMO this gives absolute maximum flexibility for different ControlNet designs, and also minimizes the burden on the pipeline - its only job is calling the controller and accumulating the latent, it no longer cares about managing hints.

It also allows for another level of flexibility. Say that somebody wanted to use their own latent accumulator, they could create a MultiController like controllers = [MeanController(canny_controller, face_description_controller)].

If we further wanted to simplify the pipeline, we could say that it always takes only a single controller. MultiControllers could be the standard way of combining controllers.

This also avoids the data structure questions @takuma104 raises in 4. No modules to register.

This might be too much of a departure from the standard library patterns, but I thought it was at least worth putting out there.

SamPruden commented 1 year ago

Revisiting this line of the discussion for a moment:

face_description_controller = FaceDescriptionController(
    "old and hagged skin with youthful eyes",
    mask = face_location_mask
)

I like the idea that what we're building here is more generic than ControlNet. ControlNet is a specific network architecture, but we could generically call this feature something like LatentNudging (pending better name). This generic framework would support any technique that involves adding some value to the latents, regardless of how the controller generated those values.

The ControlNet paper specifies exactly which blocks have their latents nudged. We could be more generic than that. A controller can return Optional[FloatTensor] for every block, and choose which ones it wants to change. Who knows which latents future control methods will care about?

takuma104 commented 1 year ago

@SamPruden It's certainly true that there's no need to add ControlNet to the pipeline at all, and that was an eye-opener for me. I think your suggestion is very promising, and it's also related to the T2I-adapter issue. I have a day off from work starting the day after tomorrow, so I plan to create a PoC and test it. I will leave the current PR branch as it is and clean up the unit tests first. After that, I plan to create a branch and move forward with the PoC.

SamPruden commented 1 year ago

@takuma104 Oh, T2I Adapter looks like exactly the type of thing it would be great to support with a general solution! Interestingly, it looks like it does the same trick of adding to the hidden states, but does so in the downward blocks instead of the upward blocks, so a general purpose solution that's able to add in both places seems well motivated.

Unfortunately, the adapter looks like it does something a little more complicated by adding before the downsample in the attention block, at least according to @HimariO's prototype code:

https://github.com/HimariO/diffusers-t2i-adapter/blob/fc899bd35c39e8ecdf793fc0a393001f4f21a2bd/src/diffusers/models/unet_2d_blocks.py#L851-L861

Workig out an API that's able to add in all of the different places it may want to might prove a little tricky. I might have an idea, but I'll get back to you on that.

I've actually got a half built prototype of this already, so with luck maybe I'll get that published somewhere before you get back to working on this in a few days. It should be able to handle both ControlNet and T2I Adapter, as well as even more general concepts. It also pushes a huge amount of modular customisation into user code, which is important IMO.

HimariO commented 1 year ago

Hey @SamPruden, I just realized that I have a similar idea to you and @takuma104, and I recently put up a proof of concept for it. Would you mind taking a look at it and offering any suggestions or feedback? Thanks!

SamPruden commented 1 year ago

Very cool @HimariO! I think we'll definitely all be able to merge our efforts on this and come up with a very general solution.

I'm experimentally pursuing an approach that allows arbitrary black box manipulations of each hidden sample, rather than being restricted to adding a precomputed residual*. Both ControlNet and T2I only require residuals, but future methods may want to do something different.

_Of course, anything can be turned into a residual by new - old, so I may decide against this. However, this isn't computationally efficient in all cases. I'd rather not have scenarios that look like sample = old_sample + (new_sample - old_sample)_

I'm also pursuing modular customisation of how residuals are accumulated. This allows for custom user designs that do things like weighted summation, clamped summation, or anything else people may want.

I'm 100% pipeline and model agnostic, and adding support to any particular pipeline should be as simple as passing one argument through __call__.

I really like all of this modularity personally, but it may be too much complexity. My approach differs from @HimariO's quite a lot because of this different goal. I'll try to get some code posted relatively soon to demonstrate my thinking, then we can do a pros/cons thing with each.

* I'm not 100% sure on whether residual in the right word in this context.

takuma104 commented 1 year ago

Hi @HimariO The combination of SideloadProcessor/SideloadMixin/Sideload is cool. I understand that this allows for interference with any module output of UNet. I had some trouble understanding the implementation of SideloadProcessor's processing, so I left a comment on that directly.

Once I finish writing my unit tests, I plan to create a PoC that shows how much ControlNet's code will change assuming that Sideload-related changes are merged into the main branch.

I feel that SideloadProcessor is promising. By default, SideloadProcessor has this addition behavior, but I think that in the future, users could create classes that inherit from SideloadProcessor and use them for methods other than addition.

@SamPruden I'm looking forward to seeing your code. I'd love to discuss it with you. Since 0.13.1 has just been released, I think we have some time until the next major release. Let's take our time and make it something good.

SamPruden commented 1 year ago

I've just put a controllers branch up with my prototype. It doesn't actually include ControlNet or T2I yet, just the infrastructure on top of which they can be implemented. I did a rewrite today and I've reduced my initally complex pass at the problem to something that's actually pretty small and simple. Probably underwhelming considering how long I took to get this public!

I've made a little bit of a mess by forking @takuma104's then branching from main, so I haven't actually incorporated the changes made to UNet2D to support ControlNet. Whoops. "Move fast and break things" is the best excuse for sloppiness.

https://github.com/SamPruden/diffusers/tree/208a1dc78735a181ff148887f2dc10f8f1fe6bdc

Diff: https://github.com/huggingface/diffusers/compare/main...SamPruden:diffusers:208a1dc78735a181ff148887f2dc10f8f1fe6bdc

The main idea is:

SamPruden commented 1 year ago

Notable differences between my approach and @HimariO's:

My approach allows patching of values (equivalent to "sideloading") to depend on the original value. Whilst ControlNet and T2I both compute their residuals at the start, some future approach may want to make use of the original sample value to compute the new one.

My approach lifts the accumulation of multiple controllers into user code. I've got ClampedAccumulatorController as an example of doing something non-trivial there. Another example use case would be implementing custom weighting of the different conrollers, where one controller gets its weight turned down during the final steps, i.e. controller scheduling.

Pipelines need minimal modification to support Controllers. Controller is passed through Pipeline.__call__() and all the pipeline needs to do is forward it to Model.__call__(). Any UNet2D pipeline can have controller support added with two lines of changes. Support for other models is also easy.

@HimariO's is a far more complete and professional implementation than mine for now. I'm still Very Prototype.

takuma104 commented 1 year ago

@SamPruden Thanks for sharing the code. I think I roughly understand the concept after looking at the explanation and the code. This direction also seems promising! It may be particularly effective in scenarios such as merging the results of multiple ControlNets from your previous experiments as an application.

Is the remaining major part the processing of step_patcher in unet_2d_condition.py? I think @HimariO's set_sideload_processor() would be a good reference for this. Once this part is completed and the code is working, I think I can apply ControlNet on my side as well.

I thought of an example ControlNet implementation, would it look like this?

class ControlNetController(Controller[TControllerParams]):
    def __init__(self, controller: Controller, controlnet: UNet2DConditionModel, hint: torch.Tensor):
        super().__init__(controller)
        self.controlnet = controlnet
        self.controlnet_hint = hint

    def __call__(self, *args: TControllerParams.args, **kwargs: TControllerParams.kwargs) -> StepPatcher:
        controls = self.controlnet(controlnet_hint=self.controlnet_hint, **kwargs) 
        # Currently controlnet implementation is returned as a list, but will be changed to return 
        # in dictionary format like {"up_blocks.0.resnets.2": , ...}
        return DictResidualStepPatcher(self, dict=controls)

# usage
controlnet = UNet2DConditionModel.from_pretrained('aaa/bbb', subfolder='controlnet')
hint = controlnet_hinter.hint_canny('control_imgage.png')
controller = ControlNetController(controlnet=controlnet, hint=hint)

pipe = StableDiffusionPipeline.from_pretrained('ccc/ddd')
pipe(prompt='test', controller = controller).image[0] 
SamPruden commented 1 year ago

@takuma104 Yep that looks very similar to what I was thinking for ControlNetController. It may make sense to have it load its own model and do ControlNetController.from_pretrained("a/b") or something similar, but I haven't thought about that much yet.

I would put hint preprocessing in the constructor too. We could further derive from ControlNetController to have specialised cases like FacePromptController that take different parameters such as text promps.

The remaining processing would just be a liberal sprinkling of sample = step_patcher("up_blocks.0.resnets.2", sample) throughout the model. That can go in unet_2d_condition.py around the blocks, but it looks like T2I also wants some hooks inside some of the blocks, so they'd need to be added there too. @HimariO's mixin trick looks like a nice way to do this.

SamPruden commented 1 year ago

It may be particularly effective in scenarios such as merging the results of multiple ControlNets from your previous experiments as an application.

I have some vague ideas about doing crazy things like learning ControlNet weight schedulers for particular controller combinations, e.g. learning the optimal weighting (as a function of step) to do Canny + Normal. I don't know whether that's a good idea, but I like the fact that the Controller/StepPatcher formulation allows me to build that in user code to do the experiment.

Maybe that type of experimental use case is extremely rare, but IMO it's a valuable thing to support.

SamPruden commented 1 year ago

Is the remaining major part the processing of step_patcher in unet_2d_condition.py? I think @HimariO's set_sideload_processor() would be a good reference for this.

I realised that I didn't address this very directly. I was actually planning on just manually putting all of the hooks into the blocks Automating it with a mixin is a nice idea! I don't know if it can serve 100% of scenarios (what if somebody wants a hook in the middle of a block, which isn't completely inconceivable in the future) but it would be easy enough to add support for those things via the mixin.

The one thing that does feel a little weird about it to me is that it adds state into the model. It just doesn't feel right to me that the model holds onto a copy of the Controller after __call__. It may have implications for memory use, although that's not hugely likely to be a problem.

Overall, I think I really like @HimariO's SideloadMixin idea! But I would add an extra step at the end of the forward pass to unset_sideload_processor() so it's not keeping state around between calls.

geekyayush commented 1 year ago

Hey @takuma104 I have a question.

Can I use my dreambooth fine-tuned model with this? If yes, then how do I generate the controlnet version e.g. canny from my diffuser dreambooth model?

I would really appreciate it if you could answer.

Thank you!

takuma104 commented 1 year ago

Hi @geekyayush , I think this method can be used for Dreambooth model from the SD1.x variant model.

hafriedlander commented 1 year ago

I have a modified version of @HimariO's T2I adapter that patches the unet using accelerate module hooks. It wraps the down_blocks to modify the result just during foward. I'm nearly positive the same technique could be used for ControlNet.

I'll post some code later, but having done it I think I'm kind of against any aspect-oriented crosscutting like this - all these approaches rely on very specific existing model structure, and any change to the unet itself could cause breakages that are hard to detect.

Something like the recent CrossAttention Processor change where there's explicit interface would be my vote.

SamPruden commented 1 year ago

I've been dabbling with turning my broken prototype into a full working POC. I got it as far as fully running ControlNet with my proposed API, and just needing a little tidyup to go public. However, it looks like there's signicant work done in @takuma104's PR. Is it still worth persuing alternative approaches to this, or have things settled on that PR for now?

For the record, I still believe that my proposal is the best for a few reasons:

t00350320 commented 7 months ago

hi, @takuma104 and @SamPruden , i wanted to produce a picture with multi controlnets(two persons) including openpose and canny , but it was found that "MultiControlNetModel" in the "multicontrolnet.py" file doesn't support multi encoder_hidden_states ( texts prompts ), so i modified it like this

        for i, (hidden_state,image, scale,added_cond_kwarg, controlnet) in enumerate(zip(encoder_hidden_states,controlnet_cond, conditioning_scale,added_cond_kwargs, self.nets)):
            down_samples, mid_sample = controlnet(
                sample=sample,
                timestep=timestep,
                encoder_hidden_states=hidden_state,
                controlnet_cond=image,
                conditioning_scale=scale,
                class_labels=class_labels,
                timestep_cond=timestep_cond,
                attention_mask=attention_mask,
                added_cond_kwargs=added_cond_kwarg,
                cross_attention_kwargs=cross_attention_kwargs,
                guess_mode=guess_mode,
                return_dict=return_dict,
            )

so different text prompts are sent to corresponding ControlNet, then unet hidden_states doesnot include any special prompt with person . but the final result seems two person's controlnet prompts didnot work.

So what's wrong with this workflow? How to denoise multi latent space with different prompts precisely without inpaint method?