Closed leszekhanusz closed 1 year ago
I just noticed that in PR #549, the code was not duplicated exactly the same way. Sometimes multiplying by batch_size
, sometimes not.
Proof again than duplicating code is never good and introduce bugs.
I'm new here, but I also noticed, that most of the code is just duplicated, which is very hard when you want to make changes to all three :(
Good suggestion. I agree with you.
Hi @leszekhanusz you raise an important question, as the number of Stable Diffusion pipelines is rising, and we don't have a robust copy mechanism like the one in transformers
yet.
Inviting @patil-suraj @patrickvonplaten @pcuenca to discuss possible solutions here :)
In the meantime, I'll address the mismatching pipeline issues manually, thank you for a thorough compilation!
We can just add copy-mechanism, we have in Transformers.
Quite strongly against creating a unified pipeline as it makes it impossible down the road to define classes like AutoPipelineForText2Image
.
Also note that default values cannot be coherently defined anymore. In text2image we use 50 inference steps in image2image just 30 => I don't want to clutter the code with if type == ...
statements.
We'll sure not merge onnx and non onnx pipeline because one should not have onnx installed to be able to use the pipeline and also optimization might very well be pipeline specific.
For more general arguments regarding this design see also here: https://huggingface.co/blog/transformers-design-philosophy
Overall:
pip install diffusers
in 1,2 weeks by mirroring community pipelines an the Hub. If people use this extensively, we can also add this to src/diffusers
What do you think @leszekhanusz ?
I've read your blog post and you make some good points but some of it seems to contradict you:
we decided to adopt a different design principle which we like to call the single model file policy. The single model file policy states that all code necessary for the forward pass of a model is in one and only one file.
Here we have a single model (Stable diffusion) which has 3 different pipelines (ignoring onnx for now) in different files doing more or less the same thing. We could argue that having all the code in a single pipeline is actually following your single model file policy.
Did you look at what the code looks like in the unified pipeline? You have if init_image is None:
and if mask_image is not None:
conditions and it's really not complicated to follow those branches.
On the contrary, now it's getting easier to see the difference between img-to-img and inpainting than to try to compare code in different files.
For someone like me which is quite new to diffusion models, what's complicated is to understand the model, what is classifier free guidance, schedulers, torch broadcasting and shapes of tensors. Tuning out a block if mask_image is not None:
is a piece of cake in comparison and allow to see more easily the difference between the different usages.
We can just add copy-mechanism, we have in Transformers.
To be honest, from an external eye, this seems like a hack to me.
Quite strongly against creating a unified pipeline as it makes it impossible down the road to define classes like AutoPipelineForText2Image.
I don't see why. If you set init_image
and mask_image
to None, then the pipeline is the same as the text-to-img pipeline.
Also note that default values cannot be coherently defined anymore. In text2image we use 50 inference steps in image2image just 30 => I don't want to clutter the code with if type == ... statements.
First of all, for most use-cases the user will provide the number of inference steps himself as it is quite an important parameter which he would probably like to make configurable.
But if you want to keep backward compatibility and allow custom default for each pipeline, I propose to use create a StableDiffusionCommonPipeline which will contain the unified pipeline. Then the current StableDiffusionPipeline used only for text-to-image could be replaced by something like:
class StableDiffusionPipeline(StableDiffusionCommonPipeline):
def __call__(
self,
prompt: Union[str, List[str]],
height: Optional[int] = 512,
width: Optional[int] = 512,
num_inference_steps: Optional[int] = 50,
guidance_scale: Optional[float] = 7.5,
eta: Optional[float] = 0.0,
generator: Optional[torch.Generator] = None,
latents: Optional[torch.FloatTensor] = None,
output_type: Optional[str] = "pil",
return_dict: bool = True,
**kwargs,
):
super().__call__(
prompt=prompt,
height=height,
...
init_image=None,
mask_image=None,
...
)
We'll sure not merge onnx and non onnx pipeline because one should not have onnx installed to be able to use the pipeline and also optimization might very well be pipeline specific.
Sure, I agree.
Happy to add a "one-size-fits-it-all" community pipeline which we'll also support in pip install diffusers in 1,2 weeks by mirroring community pipelines an the Hub. If people use this extensively, we can also add this to src/diffusers
That's great! From the user point of view that is really important. That "one-size-fits-all" pipeline should really be inside src/diffusers so that it could be integrated easily in other projects and automatically updated with an upgrade of diffusers.
User experience is much more important for us than maintenance. We should not write code that is only concerned with improving our maintenance, we should mostly write code that is easy to use and won't have us run into big backwards breaking problems in the future.
That's good to hear, but the user experience could be better right now.
Another aspect to this from a user experience perspective is that right now naively loading all pipelines via from_pretrained takes three times the GPU memory. If you want to use all three in the same script or gradio interface you have to either load the models separately and init the pipelines via the initializer manually or juggle the pipelines in and out of your GPU memory.
@okalldal it's possible to merge all the existing pipelines into a single one. Available here.
Very good points here! Thanks a lot for writing all of this down. I still don't agree with most of it here though :sweat_smile: Writing my thoughts down a final time before giving in haha.
We could argue that having all the code in a single pipeline is actually following your single model file policy. Clear no. Single model file policy means exactly to not build one model that fits it all but to have multiple modeling files. This was misunderstood. This was written for
transformers
so it cannot be applied one-to-one here.On the contrary, now it's getting easier to see the difference between img-to-img and inpainting than to try to compare code in different files. Don't agree either. We now cannot set sensible defaults anymore, clutter the
__call__
with input args (likestrength
latents
mask_image
....) How would one know what guidance scale should be used for the pipelines, what num inference steps to set? It's not true that most people understandnum_inference_steps
well enough that it doesn't require a good default. People will mix functionality for different tasks and we'll get issues about I believe. Let's see!
A point that is not taken into account a lot here IMO: Users of diffusers
include both "end-users" and "packages".
"packages" are the ones that build on top of diffusers
and many good "packages" is more important than direct "end-users". PyTorch is so highly used because every ML package can super easily build on it, it has an intuitive design, ... not because they started adding a stable diffusion model in their core library.
"AutoText2ImagePipeline"
class.
E.g. having all image-to-image, in-painting, and text2image is very stable diffusion specific. Having everything in one pipeline make us prone to future breaking changes. Most arguments here fall away for the "library" users:
from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline, StableDiffusionInpaintPipeline
from diffusers import DiffusionPipeline
stable_diffusion = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") submodels = {k: v for k,v in vars(pipeline).items() if not k.startswith("")}
text2img = StableDiffusionPipeline(sub_models) img2img = StableDiffusionImg2ImgPipeline(sub_models) inpaint = StableDiffusionInpaintPipeline(**sub_models)
def all_in_one(type, *args, *kwargs): if type == "text2img": return text2img(args, **kwargs) ...
**Note**: `{k: v for k,v in vars(pipeline).items() if not k.startswith("_")}` is very hacky but we could very well provide a nice function for this in the downstream `DiffusionPipeline` class (cc @patil-suraj @anton-l ) I think this would be useful anyways.
- Why is this better? This way the `diffusers` repo can focus on making clean task-specific pipelines and doesn't have to put the burden of maintenance higher lever functionality on us. E.g. now image stable diffusion has two more task that can be done with the same model => having an official one-fits-it-all pipeline puts us then in a tough spot: "where do we draw the line", "can we add the new features without making the code unreadable? Do we have to break existing behavior?" => it becomes obvious that the one-size-fits it all will quickly break
- I think a big problem is that we don't advertise well enough how `diffusers` is intended to be used. E.g. it's really not that hard to build something on top of `diffusers` that doesn't require 3x the memory usage IMO
Nevertheless, I do want to have this library be community driven, but I think the above aspects and the long-term vision for the library is not taken into account enough.
=> So if you want, happy to follow your advice, but added it as a "experimental" pipeline. However I won't maintain the pipeline and ping you quite a bit on issues I think @leszekhanusz haha ;-). Thanks for taking the time to write everything down though - "easy-of-use" for such a young library might indeed be the winning argument in the end of the day.
Think this is a cool discussion, so will also asked the Twitter community on their opinion :-)
Also cc @keturn @shirayu @johnowhitaker @jonatanklosko @joaoLages @pcuenca @patil-suraj @anton-l @kashif @sgrigory very interested in hearing your final thoughts!
Also here a twitter poll: https://twitter.com/PatrickPlaten/status/1572980566579941381
I think I'm on team 'three classes'. If you want to do a quick demo of one application, easy to just load, say, text2img. If you want multiple your example is very helpful, it'd be nice to document what the sub_models are and how to avoid loading duplicates like you do there. I also think a lot of people trying more complex things won't be loading three pipelines, they'll be defining their own sampling process anyway - where they can add all the extra features they want. That being said, I'm not the one having to maintain the code ;)
I understand the '3 classes' arguments but, as @leszekhanusz mentioned, having the code duplicated for all of them is very error-prone.
IMO, the solution proposed by @leszekhanusz of having the backbone class is what I thought of.
To exemplify:
class CoreStableDiffusionPipeline:
...
def _forward(prompt, init_image: Optional = None, mask_image: Optional = None, ...):
....
class StableDiffusionPipeline(CoreStableDiffusionPipeline):
...
def __call__(prompt, ...):
return self._forward(prompt, ...)
class StableDiffusionImg2ImgPipeline(CoreStableDiffusionPipeline):
...
def __call__(prompt, init_image, ...):
assert init_image is not None # avoid incorrect class usage
return self._forward(prompt, init_image, ...)
class StableDiffusionInpaintPipeline(CoreStableDiffusionPipeline):
...
def __call__(prompt, init_image, mask_image, ...):
assert init_image is not None and mask_image is not None # avoid incorrect class usage
return self._forward(prompt, init_image, mask_image, ...)
This way we would still keep the 3 classes separated and the code would be way easier to maintain. Do you guys see any disadvantages in this solution?
Yes! People now need to switch between multiple files CoreStableDiffusionPipeline
and StableDiffusionInpaintPipeline
. Maintaining different files is really not that big of a problem and much less of a concern in OSS than in industry settings
Yes! People now need to switch between multiple files
CoreStableDiffusionPipeline
andStableDiffusionInpaintPipeline
.
If that's a problem, the CoreStableDiffusionPipeline
subclasses could be all in the same file.
This solution is better to avoid adding/removing functionality in one of the pipelines but not in the others, that's the way I see it.
As you say @patrickvonplaten it is not really difficult to get the pipelines to share the same models now either. The non hack way that already works is to just load each model separately via subdirectory instead of from a pipeline. Maybe a simple example for documentation is enough for most end-users in this regard and merging the pipelines are overkill. It's just that one tends to get spoiled by the "one liner that just works" huggingface experience that we all know and love 😉
Maintaining different files is really not that big of a problem and much less of a concern in OSS than in industry settings
That hasn't been my experience.
but there is a lot in this thread I could potentially ramble about, let me see if I can find a few things to focus on.
As for txt2img, img2img, and inpaint: I wouldn't want the interface for those three different tasks to all be squished in to one function. But classes are collections of data and methods, and
That suggests to me that however we define the interface to those three tasks, they should share almost all the same data structures and implementations, leaving us better able to focus on what the distinguishing factors between the tasks really are.
Side note: this practice of having a class with basically one method (__call__
in this case) is not a super common pattern in my experience. I don't hate it, not at all, but you know you are allowed to define more than one public method on a class, right? 😉
Your example here:
stable_diffusion = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
sub_models = {k: v for k,v in vars(pipeline).items() if not k.startswith("_")}
text2img = StableDiffusionPipeline(**sub_models)
is super important! This is the thing that most of us new to the library do not understand at first. By now I understand it probably works like that, but there are still some aspects of it I'm still wary of. How safe is it to share those models between tasks? Are they immutable? Do the "sub-models" depend on their siblings at all? e.g. if I want a different scheduler or feature extractor in my second pipeline, is anything going to be mis-configured or accidentally cross-referenced to the one from the first pipeline?
In any case, I think emphasizing that with a prominent example (with that function to replace the dict comprehension) would go a long way to resolving the "multiple pipeline instances are too expensive in load time + RAM" concerns.
having an official one-fits-it-all pipeline puts us then in a tough spot: "where do we draw the line", […]
This is a big thing I'm not clear on too. What's the intended scope of 🤗🧨diffusers?
There's obvious demand for an image-synthesis engine to provide the implementation for web apps, other gRPC clients, desktop apps like Krita/GIMP/Blender/Photoshop, etc. There's a bunch of common tasks they all want to do, and they're all going to be built by people who don't know which tasks share which submodels with the pipelines for which other tasks, and they don't really care. They just want some memory-efficient implementation that gives them access to the whole toolbox, with maybe a couple knobs to turn if there are significant speed/memory trade-offs available.
(I'm talking about image synthesis here, but I expect the same will apply for voice and music as soon as those audio models drop.)
Does 🧨diffusers want to be that? Or should we plan for that to be a different library instead? (and who is hiring me to build it? 😄)
There are several topics being discussed here.
If we want to prevent the problem that people may load modules in inefficient ways, I think the best course of action would be to document how to do it with a snippet similar to @patrickvonplaten's (loved it lol). And fully address the very reasonable concerns that were raised about potential interdependencies. I think this would also make it clearer how pipelines are made up of pluggable components, which would be very helpful and instructive.
Another debate seems to revolve around code repetition, which is mostly about maintainability in my opinion. I agree that from a pure software practice perspective it feels a little bit "cringy", but as Patrick explained it's a conscious compromise. Maintainers take the hit of having to repeat portions of the code in exchange for code clarity and, very importantly, future evolution. I'd rather work towards something like AutoPipelineForText2Image
than trying to come up with perfect abstractions for the current code, and having to revise them every time a new task, feature or model is introduced. Inconsistencies should arise in tests, and that should be another place to put effort into.
There could exist an intermediate abstraction layer in the blocks and data structures used by pipelines, as @keturn pointed out, even if pipelines are still independent. This is a good point that is worth exploring, also keeping in mind that other modalities and tasks may exist in the future.
I may be mistaken, but I think it could be beneficial to advertise the existing pipelines as examples of how to use the building blocks, and work towards those "one liners that just work" that @okalldal mentioned. I'm not sure that a single complex pipeline object with a lot of parameters would help towards that goal.
+1 for factoring out shared functionality. Duplicate code is an issue, and it means there is work to be done in order to achieve good code quality here. DRY is a core best practice in software engineering.
I wanted to better understand where the current approach is coming from, so I went back and re-read the blog post in more detail. There is is a reasonable motivation behind the original philosophy (making model code easily understandable for developers new to the library) but this case highlights two core issues:
The philosophy does not adequately emphasize the drawbacks of duplicate code. It reads like "we don't have to follow DRY at all" whereas it should at most say "we are breaking DRY under very specific circumstances, and otherwise our codebase must still follow this principle."
As a result, the philosophy is being applied far too broadly. The original article refers to shared logic for different models after development is complete, but here we see duplicate code for three different pipelines running the same model, which are still under development. This repo is not following the single model file policy anyways (e.g. U-Net is spread across multiple files that include implementation details for all three of these pipelines). Instead we just have duplicate pipeline code. The response that SD is not 1-to-1 with Transformers does not provide a concrete rationale for this duplicate code; if anything, that is another reason why the Transformers philosophy is inapplicable in this case.
If you have to keep the duplicate code in these pipelines, please at least add comments that clearly identify which blocks of code are duplicated, and which other files need to be updated when a developer changes one of them. The diffusers copy mechanism and better unit testing should be made priorities in order to support this, but in general, duplicate code is not the optimal solution here.
Overall, the primary benefit of this philosophy is being undermined in Stable Diffusion's case. This benefit is to improve the experience of a developer that is new to the library, by making it easier to understand how a model works. However, as we can see from this discussion, it is now being applied too broadly and actually degrading the development experience instead. As to package users, better code quality will always be preferable from the standpoint of a downstream package, because we need to be able to trust that new releases will not introduce new bugs (when we upgrade). Regarding the points raised about the API, this design decision should not be coupled with any aspect of the API. I'm building a downstream package currently and will probably rebuild all of this pipeline functionality to avoid these issues. That is why the decision to sacrifice code quality does not make sense in this case.
Very nice ideas in this thread here!
To me the main concern seems to be about not factoring out shared functionality and code. I also agree that this could help in better understand differences Happy to factor out the code that all pipelines share into a function and then make sure this stays identical for the other pipelines.
Big +1 on @pcuenca that pipelines are rather supposed to be examples. We hope that most downstream application directly make use of schedulers
and models
instead of the very high level pipelines.
Will focus on two things from this PR:
@patrickvonplaten, awesome! Sounds good to me. Thank you for making these updates, and I'll make sure to follow the advice regarding examples / schedulers
and models
as well. Looking forward to checking out the refactored pipelines.
I've been away from Python for a while so I hope these two questions aren't too painful to answer.
I see there is already a base class of sorts, which leaves me confused. It could just be that you needed to get to this point to see which code can actually be moved to the base class.
PR#700 just before my comment seems to be someone who has implemented a similar idea albeit not passing tests quite yet.
Will wait with the refactor here until the schedulers API is updated: https://github.com/huggingface/diffusers/issues/336
It's odd seeing this entire discourse and having gone through the entire thing myself, and one of the things that kept me away from ever using diffusers
. Yes, the unified "Pipeline" makes sense, and after having gyrated between diffusers
and CompVis
, but being frustrated by the ability to add new samplers to diffusers
and many SoTA features/optimizations coming out for CompVis
I just wrote my own unified "Pipeline" using CompVis
.
It's great, it's easy to use, and it's easy to manipulate if I need to play with it. I don't know why there's so much push-back against it.
Hey @leszekhanusz , @keturn , @shirayu and everybody who is interested to review!
1.) - I've made a first draft of a "One-for-all-tasks Stable Diffusion" pipeline here: https://github.com/huggingface/diffusers/pull/821 . I'd be very happy if you could review the PR and give feedback on the naming (think "mega" is not a great name, but couldn't think of anything better).
Note: This is orthogonal to:
2.) - https://github.com/huggingface/diffusers/issues/551#issuecomment-1259415393 which I'll will try to tackle early next week.
So 2.) is more about more explicative and maintainable code and 1.) is to satisfy a "one-in-all" pipeline.
Keen to hear your thoughts :-)
@leszekhanusz I agree with you completely. It looks like the pipeline is divided and organized, but when you actually use it, the UX is obviously bad.
@kosukekurimoto thanks for the feedback, could you share some more detail in what is bad? Happy to try to make it cleaner next week!
@patrickvonplaten It uses extra memory if the classes are separated😇
e.g.
class Predictor(BasePredictor):
def setup(self):
"""Load the model into memory to make running multiple predictions efficient"""
print("Loading pipeline...")
scheduler = PNDMScheduler(
beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear"
)
self.pipeImg2Img = StableDiffusionImg2ImgPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
scheduler=scheduler,
revision="fp16",
torch_dtype=torch.float16,
cache_dir=MODEL_CACHE,
local_files_only=True,
).to("cuda")
self.pipeInPaint = StableDiffusionInpaintPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
scheduler=scheduler,
revision="fp16",
torch_dtype=torch.float16,
cache_dir=MODEL_CACHE,
local_files_only=True,
).to("cuda")
def predict(〜):
if predict_type == "text2img":
# use stable-diffusion model
self.pipeText2Img.scheduler = scheduler
output = self.pipeText2Img(
prompt=[prompt] * num_outputs if prompt is not None else None,
width=width,
height=height,
guidance_scale=guidance_scale,
generator=generator,
num_inference_steps=num_inference_steps,
)
elif predict_type == 'img2img':
# use material-stable-diffusion model
self.pipeImg2Img.scheduler = scheduler
output = self.pipeImg2Img(
prompt=[prompt] * num_outputs if prompt is not None else None,
init_image=init_image,
strength=prompt_strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
)
〜
Hey everybody,
We've now merged the "all-in-one" community pipeline which doesn't require any additional memory to run all pipelines. Please have a look at: https://github.com/huggingface/diffusers/tree/main/examples/community#stable-diffusion-mega
Will try to tackle the main "factor out functions" PR tomorrow
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Very nice ideas in this thread here!
To me the main concern seems to be about not factoring out shared functionality and code. I also agree that this could help in better understand differences Happy to factor out the code that all pipelines share into a function and then make sure this stays identical for the other pipelines.
Big +1 on @pcuenca that pipelines are rather supposed to be examples. We hope that most downstream application directly make use of
schedulers
andmodels
instead of the very high level pipelines.Will focus on two things from this PR:
- add a function to retrieve all components
- add better docs for this
- factor out a function that includes shared common code
- add copy mechanism so that the function can stay in the pipeline file but we ensure identical code across pipelines
Closing this issue now as https://github.com/huggingface/diffusers/pull/1269 finishes off the 4 points above.
"make fix-copies"
to copy all the functionality that is shared. I believe this at least partly resolves this issue. Note that since we started this discussion the "official" in-painting checkpoint has come out which uses a different model architecture than CompVis/stable-diffusion-v1-4 (more latent channels) => so those models just need different code and IMO it's good that they are separated here.
Please leave a comment below if you would still want to have some major changes and don't think this issue is "solved" yet.
It's possible to implement enable_xformers_memory_efficient_attention for Mega?
Hey @thealanreis yes! Should look very similar to how it's done in the StableDiffusionPipeline
. Would you like to take a shot at this ? Happy to help :)
Hello @patrickvonplaten, thanks for the hard work for the community. I love diffusers! What about this PR tho? I would love to have a full compatibility with diffusers-interpret in the latest version and the changes does not seem that huge.
Hey @federicotorrielli,
Thanks for the nice words! Regarding "changes does not seem that huge" I must disagree a bit. The intro of the PR is:
- Having the option to run DiffusionPipeline.call while calculating gradients;
- Having a output_latents flag (similar to output_scores/output_attentions/etc from transformers) that adds a latents attribute to the output;
- Deactivating safety checker; removed this option (https://github.com/huggingface/diffusers/commit/12ca96990b0e6d7ef34a0c4bbcd9819ffe2752bd)
- Passing text_embeddings directly instead of the text string;
- Gradient checkpointing (this was already a feature in transformers too).
In general, for each of these points it'd be nice to open a separate issue/PR.
In more detail:
@patrickvonplaten thanks for the explanation. How can I help? Is there a possibility that, opening a feature request for point i and iv, would be implemented?
I think iv. from my point of view should definitely be feasible. i) is a big change that will take quite some time. So maybe iv) if you'd like :-)
We'll have to see with the other maintainers of this library though
after this PR changed 78 files just to fix a type error i'm pretty frustrated about this issue still. it would be nice to see movement toward resolving it instead of citing the "philosophy" document whose goal seems to be producing a large quantity of broken / duplicate code.
Following the Philosophy, it has been decided to keep different pipelines for Stable Diffusion for txt-to-img, img-to-img and inpainting.
Here is the result:
It's getting even worse! Now with onnx and the obvious request to have img-to-img and inpainting pipelines with onnx (#510), that would mean that every modification will need to be duplicated 6 times in total, with 6 times the number of tests.
This in untenable...