[Pipeline] AnimateDiff + SparseControl + ControlNet

aihopper commented 2 months ago

As the title suggests, we would like a pipeline that supports all three techniques. Currently, we have standalone pipelines for SparseCtrl and ControlNet. A combination of the two might be interesting to see!

Right now video prediction is hard to control as the new frames are highly dependent on the prompt, if we could use images we would have better/finer control. This pipeline would enable apps like Blender to generate new images based on past reference frames and a depth buffer.

Looking at the code it looks like this is doable but before I try, I would like to get the input and suggestions of more expert people on this possible approach (@a-r-r-o-w or @DN6 :) ):

make pipeline_animatediff_sparsectrl.py and pipeline_animatediff_controlnet.py as similar as possible so diffing shows as much as common code as possible
refactor the blocks of code that are different into functions
have these functions work together in a new single pipeline

Does this make sense?

a-r-r-o-w commented 2 months ago

Thanks, might be interesting but I don't really see how this would play out. Not really sure it'd work because ControlNet can already use input videos as condition, while SparseCtrl is more useful in cases where you want to interpolate between intermediate keyframes. Still could be something cool if you have any usecase ideas, so feel free to open a PR to the community folder with the pipeline.

I wouldn't recommend making any changes to the pipelines or modeling blocks themselves to support this - a new pipeline would be okay. We generally start with a community pipeline and if there is high usage, we can move it to core diffusers. If you have ideas on refactoring existing pipelines, PRs would be welcome so long as the scope of changes is minimal and ensures backward compatibility :)

sam598 commented 2 months ago

I think this makes sense if you wanted to have finer control over the motion. Like if you used a skeletal controlnet to interpolate between images.

But I think SparseCtrl still needs to be fixed first, the RGB implementation is apparently broken.

a-r-r-o-w commented 2 months ago

Would love to see some examples on how this would work and learn.

RGB SparseCtrl is indeed broken, but unfortunately I haven't had the bandwidth to look into what the issue is. Implementation-wise, diffusers looks mostly similar to the original implementation, and I can't spot differences just by looking at the code which calls for some layerwise numerical debugging. I'll try and open an issue for any community contributors to help look into if I'm unable to find time for this soon, but feel free to report an issue with expected vs current behaviour.

aihopper commented 2 months ago

I think this makes sense if you wanted to have finer control over the motion. Like if you used a skeletal controlnet to interpolate between images.

yes, this is all about getting finer control :)

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yiyixuxu commented 1 month ago

I think we can keep this open, in case anyone wants to make a community pipeline out of it , no? @a-r-r-o-w

a-r-r-o-w commented 1 month ago

Sounds good!

aihopper commented 1 month ago

Note that ComfyUI allow doing this, so more than writing a pipeline what would be more desirable is to have a similar level of flexibility as ComfyUI

a-r-r-o-w commented 1 month ago

I think the comparison you're making here is unfair in many ways. It's quite common to hear that "Diffusers does not have X but Comfy does", "Comfy is way more hackable than Diffusers and so it's the best thing in the world", etc. When you say that Comfy allows doing X but Diffusers doesn't, and that you'd like to have more flexibility, note that things come at a compromise - let me explain.

What most people get wrong about comparing Diffusers and Comfy is that it is not an apples-to-apples comparison. Comfy is an end-user tool/product that anyone can jump in and start using. Diffusers is a library geared towards building the necessary blocks for inference, training, readability, and understandability (please read more about our philosphy if you're interested). If you want to make a comparison, compare an end product (ComfyUI) with an end product built atop Diffusers - InvokeAI and SD.Next for UIs, Kohya and SimpleTuner for training, and countless other projects for inference acceleration, and , etc.

Since I recently joined the team, I’m still learning about the long-term objectives, but here are some goals I’m aware of (but I may be wrong):

Ease of readability. Any student/researcher trying to learn about diffusion can easily understand the high-level components involved with Diffusers. They just need to look at the pipeline, modeling and scheduler aspects (which are the core parts of diffusion models). Most of our pipelines have 80-90% similar code and only differ in task-specific details.

With UIs and other end products/tools, it is extremely difficult unless you're already well-experienced, which is hardly the case for most students. Try debugging through layers and layers of abstraction and python voodoo in the ComfyUI and its extensions' codebase to make the engine work, or the original LDM implementations by StabilityAI, and you'll see what I mean very soon. Because Comfy is so robust with the engine they've built, they are not limited to just diffusion, and you can pretty much do anything from captioning to speech to 3D asset generation.
Trainability and use in research. Due to the strong decoupling of schedulers, modeling and other necessary components for diffusion to work, it is much easier for a researcher to just pick up parts, make modifications and experiment with ideas. There are many many research works out there that build atop Diffusers because it simplifies the important bits and reduces time to fight bugs and do the thing you actually want to do. We are also mostly compatible with all industry pytorch standards, making it easy for using inference/training acceleration out-of-the-box (for which there are many available examples), or with very minimal changes.
Composability. I'd go as far as saying that Diffusers is much more composable than Comfy. You have full control over the code and can build whatever you want using all the different models and schedulers, all easily loadable from the HF Hub with one single LoC, compose them however you like and build whatever you want limited only by your imagination and experience. With Comfy, you are limited by the nodes that other developers create and can't do anything beyond that - unless ofcourse, you actually know the codebase and how to work around it, so you can create something new that others can then use, in which case it is not so different from what you're doing with Diffusers pipelines. Our pipelines are "minimal examples" of the possibilities. They are the simplest possible "workflows" that would be possible in Comfy. This is what most people don't seem to understand.

...and more I could go on and on about. But I hope you get the gist.

What you're asking for here with SparseControl and ControlNet is maybe a 5-10 line change for someone to do, and they are free to, but it is not really what either were created for. SparseControl is a different research work solving different problems. ControlNet is a different research work solving different problems. We have great support for both individually. Composing and creating something out of it is the job of an end user of the library. Many startups and companies compose different things and create an end product atop the Diffusers library. The "building an end product/workflow/pipeline" is not what we're here for. The job of a library is only to provide the necessary tools to make that possible. If you think, we should support this because it has high amount of usage in the community (which I don't think it has because I'm quite active with AnimateDiff), I'd be happy to work on it myself unless someone else does it first. If you really believe we should support this just because Comfy does, then you are missing the point about why Diffusers exists, and why it is so unfeasible for us to support all kinds of different permutations of things as pipelines.

TLDR; The philosphy guide contains everything above, and is probably worded better. ComfyUI and Diffusers have different target user bases, and are meant to do different things. If we don't support X, there is a good reason for why it is not supported. What "workflow" is to ComfyUI, is "pipeline" to Diffusers. There are only so many pipelines that can be officially maintained. This is why we also support community pipelines that can be either added to our community/ folder, or be located on the Hub - the DiffusionPipeline class supports loading it as long as it was written in the expected format. Our goal is not to support all kinds of tricks/hacks figured out by the community, it is to make research more accessible and easily understandable. Often times, when a "trick/hack" is so good that without it, most users building atop Diffusers would lag behind, we often incorporate it.

a-r-r-o-w commented 1 month ago

I get where you're coming from though, when you say you want more flexibility and making it easier to slap together a new pipeline in a few lines of code. There are some ongoing discussions/planning in the team on how to make it possible. It will take time, but there's hope to get there eventually. @asomoza is working on something intereting for this, so it is only a matter of time before you get this flexibility built into Diffusers for end user stuff.

aihopper commented 1 month ago

I didn't say ComfyUI is superior or anything like that, I understand that both frameworks have different goals and made different trade offs, so I never compared them in their totality, all I said is that extracting the adapters into their own modules is both achievable and valuable. And to prove this I pointed to ComfyUI, but in the same way papers point to other papers, as a reference and as a source for inspiration.

asomoza commented 1 month ago

HI and thanks for your suggestions.

Support for remixing parts of the pipelines will come eventually, we want to do it but it's not easy because we have a lot more restrictions than ComfyUI as @a-r-r-o-w explained in detail.

But I want to add that we support the normal users and developers as much as researchers and one goal we have is to be able to achieve what you're suggesting without breaking the research and learning part, webuis only care about end users so they can take a lot of liberty in breaking stuff, they only need to ensure that the latest version works.

I pretty much use or used all the apps and I can tell you that nothing gives the freedom as to just be able to break apart a pipeline and do whatever you want with them, no need for an UI, search for a workflow or a custom node but not everyone can do this, the same as not everyone can use ComfyUI and prefer the linear UIs. This is something we don't want to lose by introducing layers and layers of abstractions, automatic optimizations and all-in-one pipelines.

We want to support all kind of users and use cases, that's our final goal as a library.

aihopper commented 1 month ago

This is something we don't want to lose by introducing layers and layers of abstractions, automatic optimizations and all-in-one pipelines.

I agree, that is how develop frameworks and that is why I never requested such a thing. I was just sharing my feedback to see if we common goals hoping that this would also be valuable for you.

Support for remixing parts of the pipelines will come eventually,

And this was exactly what I was requesting, OK so the answer is a yes, then we are good :)

huggingface / diffusers

[Pipeline] AnimateDiff + SparseControl + ControlNet #9329