Merging Stable diffusion pipelines just makes sense

leszekhanusz commented 2 years ago

Following the Philosophy, it has been decided to keep different pipelines for Stable Diffusion for txt-to-img, img-to-img and inpainting.

Here is the result:

PR #549 : code duplicated 4 times (onnx included)
PR #537 : modification only applied to txt-to-img pipeline
PR #521 : code duplicated 4 times (onnx included)
PR #511 : modification only applied to txt-to-img pipeline
PR #506 : modification only applied to txt-to-img pipeline
PR #472 : code duplicated 4 times (onnx included)
PR #371 : modification only applied to txt-to-img pipeline

It's getting even worse! Now with onnx and the obvious request to have img-to-img and inpainting pipelines with onnx (#510), that would mean that every modification will need to be duplicated 6 times in total, with 6 times the number of tests.

This in untenable...

leszekhanusz commented 2 years ago

I just noticed that in PR #549, the code was not duplicated exactly the same way. Sometimes multiplying by batch_size, sometimes not.

Proof again than duplicating code is never good and introduce bugs.

zledas commented 2 years ago

I'm new here, but I also noticed, that most of the code is just duplicated, which is very hard when you want to make changes to all three :(

shirayu commented 2 years ago

Good suggestion. I agree with you.

anton-l commented 2 years ago

Hi @leszekhanusz you raise an important question, as the number of Stable Diffusion pipelines is rising, and we don't have a robust copy mechanism like the one in transformers yet.

Inviting @patil-suraj @patrickvonplaten @pcuenca to discuss possible solutions here :)

In the meantime, I'll address the mismatching pipeline issues manually, thank you for a thorough compilation!

patrickvonplaten commented 2 years ago

We can just add copy-mechanism, we have in Transformers.

Quite strongly against creating a unified pipeline as it makes it impossible down the road to define classes like AutoPipelineForText2Image.

Also note that default values cannot be coherently defined anymore. In text2image we use 50 inference steps in image2image just 30 => I don't want to clutter the code with if type == ... statements.

We'll sure not merge onnx and non onnx pipeline because one should not have onnx installed to be able to use the pipeline and also optimization might very well be pipeline specific.

For more general arguments regarding this design see also here: https://huggingface.co/blog/transformers-design-philosophy

Overall:

Happy to add a "one-size-fits-it-all" community pipeline which we'll also support in pip install diffusers in 1,2 weeks by mirroring community pipelines an the Hub. If people use this extensively, we can also add this to src/diffusers
User experience is much more important for us than maintenance. We should not write code that is only concerned with improving our maintenance, we should mostly write code that is easy to use and won't have us run into big backwards breaking problems in the future.

What do you think @leszekhanusz ?

leszekhanusz commented 2 years ago

I've read your blog post and you make some good points but some of it seems to contradict you:

we decided to adopt a different design principle which we like to call the single model file policy. The single model file policy states that all code necessary for the forward pass of a model is in one and only one file.

Here we have a single model (Stable diffusion) which has 3 different pipelines (ignoring onnx for now) in different files doing more or less the same thing. We could argue that having all the code in a single pipeline is actually following your single model file policy.

Did you look at what the code looks like in the unified pipeline? You have if init_image is None: and if mask_image is not None: conditions and it's really not complicated to follow those branches. On the contrary, now it's getting easier to see the difference between img-to-img and inpainting than to try to compare code in different files.

For someone like me which is quite new to diffusion models, what's complicated is to understand the model, what is classifier free guidance, schedulers, torch broadcasting and shapes of tensors. Tuning out a block if mask_image is not None: is a piece of cake in comparison and allow to see more easily the difference between the different usages.

We can just add copy-mechanism, we have in Transformers.

To be honest, from an external eye, this seems like a hack to me.

Quite strongly against creating a unified pipeline as it makes it impossible down the road to define classes like AutoPipelineForText2Image.

I don't see why. If you set init_image and mask_image to None, then the pipeline is the same as the text-to-img pipeline.

Also note that default values cannot be coherently defined anymore. In text2image we use 50 inference steps in image2image just 30 => I don't want to clutter the code with if type == ... statements.

First of all, for most use-cases the user will provide the number of inference steps himself as it is quite an important parameter which he would probably like to make configurable.

But if you want to keep backward compatibility and allow custom default for each pipeline, I propose to use create a StableDiffusionCommonPipeline which will contain the unified pipeline. Then the current StableDiffusionPipeline used only for text-to-image could be replaced by something like:


class StableDiffusionPipeline(StableDiffusionCommonPipeline):

    def __call__(                                                                                                       
        self,                                                                                                           
        prompt: Union[str, List[str]],                                                                                  
        height: Optional[int] = 512,                                                                                    
        width: Optional[int] = 512,                                                                                     
        num_inference_steps: Optional[int] = 50,                                                                        
        guidance_scale: Optional[float] = 7.5,                                                                          
        eta: Optional[float] = 0.0,                                                                                     
        generator: Optional[torch.Generator] = None,                                                                    
        latents: Optional[torch.FloatTensor] = None,                                                                    
        output_type: Optional[str] = "pil",                                                                             
        return_dict: bool = True,                                                                                       
        **kwargs,                                                                                                       
    ): 
        super().__call__(
            prompt=prompt,
            height=height,
            ...
            init_image=None,
            mask_image=None,
            ...
        )

We'll sure not merge onnx and non onnx pipeline because one should not have onnx installed to be able to use the pipeline and also optimization might very well be pipeline specific.

Sure, I agree.

Happy to add a "one-size-fits-it-all" community pipeline which we'll also support in pip install diffusers in 1,2 weeks by mirroring community pipelines an the Hub. If people use this extensively, we can also add this to src/diffusers

That's great! From the user point of view that is really important. That "one-size-fits-all" pipeline should really be inside src/diffusers so that it could be integrated easily in other projects and automatically updated with an upgrade of diffusers.

User experience is much more important for us than maintenance. We should not write code that is only concerned with improving our maintenance, we should mostly write code that is easy to use and won't have us run into big backwards breaking problems in the future.

That's good to hear, but the user experience could be better right now.

okalldal commented 2 years ago

Another aspect to this from a user experience perspective is that right now naively loading all pipelines via from_pretrained takes three times the GPU memory. If you want to use all three in the same script or gradio interface you have to either load the models separately and init the pipelines via the initializer manually or juggle the pipelines in and out of your GPU memory.

leszekhanusz commented 2 years ago

@okalldal it's possible to merge all the existing pipelines into a single one. Available here.

patrickvonplaten commented 2 years ago

Very good points here! Thanks a lot for writing all of this down. I still don't agree with most of it here though :sweat_smile: Writing my thoughts down a final time before giving in haha.

We could argue that having all the code in a single pipeline is actually following your single model file policy. Clear no. Single model file policy means exactly to not build one model that fits it all but to have multiple modeling files. This was misunderstood. This was written for transformers so it cannot be applied one-to-one here.

On the contrary, now it's getting easier to see the difference between img-to-img and inpainting than to try to compare code in different files. Don't agree either. We now cannot set sensible defaults anymore, clutter the __call__ with input args (like strength latents mask_image ....) How would one know what guidance scale should be used for the pipelines, what num inference steps to set? It's not true that most people understand num_inference_steps well enough that it doesn't require a good default. People will mix functionality for different tasks and we'll get issues about I believe. Let's see!

A point that is not taken into account a lot here IMO: Users of diffusers include both "end-users" and "packages". "packages" are the ones that build on top of diffusers and many good "packages" is more important than direct "end-users". PyTorch is so highly used because every ML package can super easily build on it, it has an intuitive design, ... not because they started adding a stable diffusion model in their core library.

"end-users" usually prefer very much the "one-size-fits-it-all" solutions which makes a lot of sense! Writing the same code every time for your use case is annoying! But note that we will never be able to full fill everyone's use case!
"packages" are libraries that integrate this library into other libraries & Python packages and thus favor a cleaner logical API, having a clear distinction between different tasks, unified API, no backwards breaking problems. Unified API will be broken with this pipeline because it's not a "task" pipeline anymore but a "diffusion model specific" pipeline class. It won't fit in this table: https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines#pipelines-summary -> we'd have to give it multiple task names. This will have more consequences in the future. It won't fit in a "AutoText2ImagePipeline" class. E.g. having all image-to-image, in-painting, and text2image is very stable diffusion specific. Having everything in one pipeline make us prone to future breaking changes. Most arguments here fall away for the "library" users:

I can build a 10-line one size fits it all pipeline that uses no more memory:


from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline, StableDiffusionInpaintPipeline
from diffusers import DiffusionPipeline

stable_diffusion = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") submodels = {k: v for k,v in vars(pipeline).items() if not k.startswith("")}

text2img = StableDiffusionPipeline(sub_models) img2img = StableDiffusionImg2ImgPipeline(sub_models) inpaint = StableDiffusionInpaintPipeline(**sub_models)

def all_in_one(type, *args, *kwargs): if type == "text2img": return text2img(args, **kwargs) ...


**Note**: `{k: v for k,v in vars(pipeline).items() if not k.startswith("_")}` is very hacky but we could very well provide a nice function for this in the downstream `DiffusionPipeline` class (cc @patil-suraj @anton-l ) I think this would be useful anyways.
- Why is this better? This way the `diffusers` repo can focus on making clean task-specific pipelines and doesn't have to put the burden of maintenance higher lever functionality on us. E.g. now image stable diffusion has two more task that can be done with the same model => having an official one-fits-it-all pipeline puts us then in a tough spot: "where do we draw the line", "can we add the new features without making the code unreadable? Do we have to break existing behavior?" => it becomes obvious that the one-size-fits it all will quickly break
- I think a big problem is that we don't advertise well enough how `diffusers` is intended to be used. E.g. it's really not that hard to build something on top of `diffusers` that doesn't require 3x the memory usage IMO

Nevertheless, I do want to have this library be community driven, but I think the above aspects and the long-term vision for the library is not taken into account enough.

=> So if you want, happy to follow your advice, but added it as a "experimental" pipeline. However I won't maintain the pipeline and ping you quite a bit on issues I think @leszekhanusz haha ;-). Thanks for taking the time to write everything down though - "easy-of-use" for such a young library might indeed be the winning argument in the end of the day.

Think this is a cool discussion, so will also asked the Twitter community on their opinion :-) 

Also cc @keturn @shirayu @johnowhitaker @jonatanklosko @joaoLages @pcuenca @patil-suraj @anton-l @kashif @sgrigory very interested in hearing your final thoughts!

patrickvonplaten commented 2 years ago

Also here a twitter poll: https://twitter.com/PatrickPlaten/status/1572980566579941381

johnowhitaker commented 2 years ago

I think I'm on team 'three classes'. If you want to do a quick demo of one application, easy to just load, say, text2img. If you want multiple your example is very helpful, it'd be nice to document what the sub_models are and how to avoid loading duplicates like you do there. I also think a lot of people trying more complex things won't be loading three pipelines, they'll be defining their own sampling process anyway - where they can add all the extra features they want. That being said, I'm not the one having to maintain the code ;)

JoaoLages commented 2 years ago

I understand the '3 classes' arguments but, as @leszekhanusz mentioned, having the code duplicated for all of them is very error-prone.
IMO, the solution proposed by @leszekhanusz of having the backbone class is what I thought of.

To exemplify:

class CoreStableDiffusionPipeline:
    ...
    def _forward(prompt, init_image: Optional = None, mask_image: Optional = None, ...):
        ....

class StableDiffusionPipeline(CoreStableDiffusionPipeline):
    ...
    def __call__(prompt, ...):
        return self._forward(prompt, ...)

class StableDiffusionImg2ImgPipeline(CoreStableDiffusionPipeline):
    ...
    def __call__(prompt, init_image, ...):
        assert init_image is not None # avoid incorrect class usage
        return self._forward(prompt, init_image, ...)

class StableDiffusionInpaintPipeline(CoreStableDiffusionPipeline):
    ...
    def __call__(prompt, init_image, mask_image, ...):
        assert init_image is not None and mask_image is not None  # avoid incorrect class usage
        return self._forward(prompt, init_image, mask_image, ...)

This way we would still keep the 3 classes separated and the code would be way easier to maintain. Do you guys see any disadvantages in this solution?

patrickvonplaten commented 2 years ago

Yes! People now need to switch between multiple files CoreStableDiffusionPipeline and StableDiffusionInpaintPipeline . Maintaining different files is really not that big of a problem and much less of a concern in OSS than in industry settings

JoaoLages commented 2 years ago

Yes! People now need to switch between multiple files CoreStableDiffusionPipeline and StableDiffusionInpaintPipeline .

If that's a problem, the CoreStableDiffusionPipeline subclasses could be all in the same file. This solution is better to avoid adding/removing functionality in one of the pipelines but not in the others, that's the way I see it.

okalldal commented 2 years ago

As you say @patrickvonplaten it is not really difficult to get the pipelines to share the same models now either. The non hack way that already works is to just load each model separately via subdirectory instead of from a pipeline. Maybe a simple example for documentation is enough for most end-users in this regard and merging the pipelines are overkill. It's just that one tends to get spoiled by the "one liner that just works" huggingface experience that we all know and love 😉

keturn commented 2 years ago

Maintaining different files is really not that big of a problem and much less of a concern in OSS than in industry settings

That hasn't been my experience.

but there is a lot in this thread I could potentially ramble about, let me see if I can find a few things to focus on.

As for txt2img, img2img, and inpaint: I wouldn't want the interface for those three different tasks to all be squished in to one function. But classes are collections of data and methods, and

these three tasks have the same set of data that needs to be prepped beforehand (the tokenizer & text model, the autoencoder, and the diffusion model).
these tasks have a very high degree of overlap in their implementation.

That suggests to me that however we define the interface to those three tasks, they should share almost all the same data structures and implementations, leaving us better able to focus on what the distinguishing factors between the tasks really are.

Side note: this practice of having a class with basically one method (__call__ in this case) is not a super common pattern in my experience. I don't hate it, not at all, but you know you are allowed to define more than one public method on a class, right? 😉

Your example here:

stable_diffusion = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
sub_models = {k: v for k,v in vars(pipeline).items() if not k.startswith("_")}

text2img = StableDiffusionPipeline(**sub_models)

is super important! This is the thing that most of us new to the library do not understand at first. By now I understand it probably works like that, but there are still some aspects of it I'm still wary of. How safe is it to share those models between tasks? Are they immutable? Do the "sub-models" depend on their siblings at all? e.g. if I want a different scheduler or feature extractor in my second pipeline, is anything going to be mis-configured or accidentally cross-referenced to the one from the first pipeline?

In any case, I think emphasizing that with a prominent example (with that function to replace the dict comprehension) would go a long way to resolving the "multiple pipeline instances are too expensive in load time + RAM" concerns.

having an official one-fits-it-all pipeline puts us then in a tough spot: "where do we draw the line", […]

This is a big thing I'm not clear on too. What's the intended scope of 🤗🧨diffusers?

There's obvious demand for an image-synthesis engine to provide the implementation for web apps, other gRPC clients, desktop apps like Krita/GIMP/Blender/Photoshop, etc. There's a bunch of common tasks they all want to do, and they're all going to be built by people who don't know which tasks share which submodels with the pipelines for which other tasks, and they don't really care. They just want some memory-efficient implementation that gives them access to the whole toolbox, with maybe a couple knobs to turn if there are significant speed/memory trade-offs available.

(I'm talking about image synthesis here, but I expect the same will apply for voice and music as soon as those audio models drop.)

Does 🧨diffusers want to be that? Or should we plan for that to be a different library instead? (and who is hiring me to build it? 😄)

pcuenca commented 2 years ago

There are several topics being discussed here.

If we want to prevent the problem that people may load modules in inefficient ways, I think the best course of action would be to document how to do it with a snippet similar to @patrickvonplaten's (loved it lol). And fully address the very reasonable concerns that were raised about potential interdependencies. I think this would also make it clearer how pipelines are made up of pluggable components, which would be very helpful and instructive.

Another debate seems to revolve around code repetition, which is mostly about maintainability in my opinion. I agree that from a pure software practice perspective it feels a little bit "cringy", but as Patrick explained it's a conscious compromise. Maintainers take the hit of having to repeat portions of the code in exchange for code clarity and, very importantly, future evolution. I'd rather work towards something like AutoPipelineForText2Image than trying to come up with perfect abstractions for the current code, and having to revise them every time a new task, feature or model is introduced. Inconsistencies should arise in tests, and that should be another place to put effort into.

There could exist an intermediate abstraction layer in the blocks and data structures used by pipelines, as @keturn pointed out, even if pipelines are still independent. This is a good point that is worth exploring, also keeping in mind that other modalities and tasks may exist in the future.

I may be mistaken, but I think it could be beneficial to advertise the existing pipelines as examples of how to use the building blocks, and work towards those "one liners that just work" that @okalldal mentioned. I'm not sure that a single complex pipeline object with a lot of parameters would help towards that goal.

jamespeterthornton commented 2 years ago

+1 for factoring out shared functionality. Duplicate code is an issue, and it means there is work to be done in order to achieve good code quality here. DRY is a core best practice in software engineering.

jamespeterthornton commented 2 years ago

I wanted to better understand where the current approach is coming from, so I went back and re-read the blog post in more detail. There is is a reasonable motivation behind the original philosophy (making model code easily understandable for developers new to the library) but this case highlights two core issues:

The philosophy does not adequately emphasize the drawbacks of duplicate code. It reads like "we don't have to follow DRY at all" whereas it should at most say "we are breaking DRY under very specific circumstances, and otherwise our codebase must still follow this principle."
As a result, the philosophy is being applied far too broadly. The original article refers to shared logic for different models after development is complete, but here we see duplicate code for three different pipelines running the same model, which are still under development. This repo is not following the single model file policy anyways (e.g. U-Net is spread across multiple files that include implementation details for all three of these pipelines). Instead we just have duplicate pipeline code. The response that SD is not 1-to-1 with Transformers does not provide a concrete rationale for this duplicate code; if anything, that is another reason why the Transformers philosophy is inapplicable in this case.

If you have to keep the duplicate code in these pipelines, please at least add comments that clearly identify which blocks of code are duplicated, and which other files need to be updated when a developer changes one of them. The diffusers copy mechanism and better unit testing should be made priorities in order to support this, but in general, duplicate code is not the optimal solution here.

Overall, the primary benefit of this philosophy is being undermined in Stable Diffusion's case. This benefit is to improve the experience of a developer that is new to the library, by making it easier to understand how a model works. However, as we can see from this discussion, it is now being applied too broadly and actually degrading the development experience instead. As to package users, better code quality will always be preferable from the standpoint of a downstream package, because we need to be able to trust that new releases will not introduce new bugs (when we upgrade). Regarding the points raised about the API, this design decision should not be coupled with any aspect of the API. I'm building a downstream package currently and will probably rebuild all of this pipeline functionality to avoid these issues. That is why the decision to sacrifice code quality does not make sense in this case.

patrickvonplaten commented 2 years ago

Very nice ideas in this thread here!

To me the main concern seems to be about not factoring out shared functionality and code. I also agree that this could help in better understand differences Happy to factor out the code that all pipelines share into a function and then make sure this stays identical for the other pipelines.

Big +1 on @pcuenca that pipelines are rather supposed to be examples. We hope that most downstream application directly make use of schedulers and models instead of the very high level pipelines.

Will focus on two things from this PR:

add a function to retrieve all components
add better docs for this
factor out a function that includes shared common code
add copy mechanism so that the function can stay in the pipeline file but we ensure identical code across pipelines

jamespeterthornton commented 2 years ago

@patrickvonplaten, awesome! Sounds good to me. Thank you for making these updates, and I'll make sure to follow the advice regarding examples / schedulers and models as well. Looking forward to checking out the refactored pipelines.

AlphaJuliettOmega commented 2 years ago

I've been away from Python for a while so I hope these two questions aren't too painful to answer.

Why not a base Pipeline class that the other three can extend, that can contain duplicated code? My reasoning is that you could then look at each of the three pipeline files and the code inside them would be what makes them different to a Base/basic pipeline.

I see there is already a base class of sorts, which leaves me confused. It could just be that you needed to get to this point to see which code can actually be moved to the base class.

Why does self-contained mean 1 file in this context but the 3 pipelines are already extending the base class (another file)?

PR#700 just before my comment seems to be someone who has implemented a similar idea albeit not passing tests quite yet.

patrickvonplaten commented 2 years ago

Will wait with the refactor here until the schedulers API is updated: https://github.com/huggingface/diffusers/issues/336

AmericanPresidentJimmyCarter commented 2 years ago

It's odd seeing this entire discourse and having gone through the entire thing myself, and one of the things that kept me away from ever using diffusers. Yes, the unified "Pipeline" makes sense, and after having gyrated between diffusers and CompVis, but being frustrated by the ability to add new samplers to diffusers and many SoTA features/optimizations coming out for CompVis I just wrote my own unified "Pipeline" using CompVis.

https://github.com/AmericanPresidentJimmyCarter/stable-diffusion/blob/main/src/stable_inference/sampling.py#L358-L457

It's great, it's easy to use, and it's easy to manipulate if I need to play with it. I don't know why there's so much push-back against it.

patrickvonplaten commented 2 years ago

Hey @leszekhanusz , @keturn , @shirayu and everybody who is interested to review!

1.) - I've made a first draft of a "One-for-all-tasks Stable Diffusion" pipeline here: https://github.com/huggingface/diffusers/pull/821 . I'd be very happy if you could review the PR and give feedback on the naming (think "mega" is not a great name, but couldn't think of anything better).

Note: This is orthogonal to:

2.) - https://github.com/huggingface/diffusers/issues/551#issuecomment-1259415393 which I'll will try to tackle early next week.

So 2.) is more about more explicative and maintainable code and 1.) is to satisfy a "one-in-all" pipeline.

Keen to hear your thoughts :-)

kosukekurimoto commented 2 years ago

@leszekhanusz I agree with you completely. It looks like the pipeline is divided and organized, but when you actually use it, the UX is obviously bad.

patrickvonplaten commented 2 years ago

@kosukekurimoto thanks for the feedback, could you share some more detail in what is bad? Happy to try to make it cleaner next week!

kosukekurimoto commented 2 years ago

@patrickvonplaten It uses extra memory if the classes are separated😇

e.g.

class Predictor(BasePredictor):
    def setup(self):
        """Load the model into memory to make running multiple predictions efficient"""
        print("Loading pipeline...")
        scheduler = PNDMScheduler(
            beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear"
        )

        self.pipeImg2Img = StableDiffusionImg2ImgPipeline.from_pretrained(
            "CompVis/stable-diffusion-v1-4",
            scheduler=scheduler,
            revision="fp16",
            torch_dtype=torch.float16,
            cache_dir=MODEL_CACHE,
            local_files_only=True,
        ).to("cuda")

        self.pipeInPaint = StableDiffusionInpaintPipeline.from_pretrained(
            "CompVis/stable-diffusion-v1-4",
            scheduler=scheduler,
            revision="fp16",
            torch_dtype=torch.float16,
            cache_dir=MODEL_CACHE,
            local_files_only=True,
        ).to("cuda")

    def predict(〜):
        if predict_type == "text2img":
            # use stable-diffusion model
            self.pipeText2Img.scheduler = scheduler

            output = self.pipeText2Img(
                prompt=[prompt] * num_outputs if prompt is not None else None,
                width=width,
                height=height,
                guidance_scale=guidance_scale,
                generator=generator,
                num_inference_steps=num_inference_steps,
            )
        elif predict_type == 'img2img':
            # use material-stable-diffusion model
            self.pipeImg2Img.scheduler = scheduler

            output = self.pipeImg2Img(
                prompt=[prompt] * num_outputs if prompt is not None else None,
                init_image=init_image,
                strength=prompt_strength,
                num_inference_steps=num_inference_steps,
                guidance_scale=guidance_scale,
                generator=generator,
            )
        〜

patrickvonplaten commented 2 years ago

Hey everybody,

We've now merged the "all-in-one" community pipeline which doesn't require any additional memory to run all pipelines. Please have a look at: https://github.com/huggingface/diffusers/tree/main/examples/community#stable-diffusion-mega

patrickvonplaten commented 2 years ago

Components PR to make it easier to save memory between pipelines: https://github.com/huggingface/diffusers/pull/889

Will try to tackle the main "factor out functions" PR tomorrow

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patrickvonplaten commented 1 year ago

Very nice ideas in this thread here!

To me the main concern seems to be about not factoring out shared functionality and code. I also agree that this could help in better understand differences Happy to factor out the code that all pipelines share into a function and then make sure this stays identical for the other pipelines.

Big +1 on @pcuenca that pipelines are rather supposed to be examples. We hope that most downstream application directly make use of schedulers and models instead of the very high level pipelines.

Will focus on two things from this PR:

add a function to retrieve all components

add better docs for this

factor out a function that includes shared common code

add copy mechanism so that the function can stay in the pipeline file but we ensure identical code across pipelines

Closing this issue now as https://github.com/huggingface/diffusers/pull/1269 finishes off the 4 points above.

We have a nice "all components" function with a doc example here: https://huggingface.co/docs/diffusers/api/diffusion_pipeline#diffusers.DiffusionPipeline.components
We've addded more docs to the Stable Diffusion Tips section: https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion#tips
This PR: https://github.com/huggingface/diffusers/pull/1269 now factors out all functions of all of the 4 main pipelines we have. This should make the pipelines a lot easier to read/understand. It should also help significantly to compare the pipelines to each other
This PR: https://github.com/huggingface/diffusers/pull/1269 also adds the copy-mechanism to most of the factored out functions. Most factored-out functions that are identical are "inherited" from text2image and now stay easily in sync since one only has to run "make fix-copies" to copy all the functionality that is shared.

I believe this at least partly resolves this issue. Note that since we started this discussion the "official" in-painting checkpoint has come out which uses a different model architecture than CompVis/stable-diffusion-v1-4 (more latent channels) => so those models just need different code and IMO it's good that they are separated here.

Please leave a comment below if you would still want to have some major changes and don't think this issue is "solved" yet.

thealanreis commented 1 year ago

It's possible to implement enable_xformers_memory_efficient_attention for Mega?

patil-suraj commented 1 year ago

Hey @thealanreis yes! Should look very similar to how it's done in the StableDiffusionPipeline. Would you like to take a shot at this ? Happy to help :)

federicotorrielli commented 1 year ago

Hello @patrickvonplaten, thanks for the hard work for the community. I love diffusers! What about this PR tho? I would love to have a full compatibility with diffusers-interpret in the latest version and the changes does not seem that huge.

patrickvonplaten commented 1 year ago

Hey @federicotorrielli,

Thanks for the nice words! Regarding "changes does not seem that huge" I must disagree a bit. The intro of the PR is:

Having the option to run DiffusionPipeline.call while calculating gradients;

Having a output_latents flag (similar to output_scores/output_attentions/etc from transformers) that adds a latents attribute to the output;

Deactivating safety checker; removed this option (https://github.com/huggingface/diffusers/commit/12ca96990b0e6d7ef34a0c4bbcd9819ffe2752bd)

Passing text_embeddings directly instead of the text string;

Gradient checkpointing (this was already a feature in transformers too).

In general, for each of these points it'd be nice to open a separate issue/PR.

In more detail:

1. and 5. are solved by now.
1. is a big change, but we could discuss this in a "feature request" issue
1. This will probably soon be possible (we need this change for latent upscaling)
1. also a bigger change that would warrant a feature request issue

federicotorrielli commented 1 year ago

@patrickvonplaten thanks for the explanation. How can I help? Is there a possibility that, opening a feature request for point i and iv, would be implemented?

patrickvonplaten commented 1 year ago

I think iv. from my point of view should definitely be feasible. i) is a big change that will take quite some time. So maybe iv) if you'd like :-)

We'll have to see with the other maintainers of this library though

bghira commented 7 months ago

after this PR changed 78 files just to fix a type error i'm pretty frustrated about this issue still. it would be nice to see movement toward resolving it instead of citing the "philosophy" document whose goal seems to be producing a large quantity of broken / duplicate code.

huggingface / diffusers

Merging Stable diffusion pipelines just makes sense #551