huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.11k stars 5.38k forks source link

[Design discussion] Diffusers schedulers/samplers & 2nd Order Schedulers #1308

Closed patrickvonplaten closed 1 year ago

patrickvonplaten commented 1 year ago

This issue is just used to collect PRs, issues, progress on 2nd order schedulers for Stable Diffusion as the community has asked a lot about this. We'll try to allocate time this or next week to think about a fitting API and will propose it here!

Currently, I'm most excited about the idea of "lying" about the timestamps in some sense as described here: https://github.com/huggingface/diffusers/pull/636#issuecomment-1289654426 by hafriedlander to potentially have 0 code changes of the stable diffusion pipeline. We'll have to evaluate though how feasible this design is (think it should be possible this way though!

Linked Issues:

keturn commented 1 year ago

Currently, I'm most excited about the idea of "lying" about the timestamps in some sense as described here:

My reaction to this is more "afraid" than "excited"!

Yes, it's an example of creative problem-solving to get around an external constraint. We could make it work. But how much easier would it be to read and document and test if we didn't have that constraint?

You own the API here, the project is still young, nothing is set in stone. I'll stick my neck out here and say it's much easier to design and maintain things that aren't built on a web of lies.

Let's back up a step here: hafriendlander's suggestion was in response to your statement

Regarding wrapping the model and the sampler, we have quite a [strong] requirement to not pass the model into a scheduler function

Where does that requirement come from?

More broadly: Why not use the design used by k-diffusion (and the paper that inspired it, https://github.com/nvlabs/edm ), passing the model (or at least its forward function) to a sampler?

The great insight of that paper-- that we can decouple the solver and schedule and discretization methods to independent components-- is a great boon for the application of diffusion models.

I don't think we can properly take advantage of that if the solver can't make its own decisions about how to evaluate the model function, or if the schedule has to "lie" for certain solvers. That collusion breaks the nice decoupling offered by the design.

patrickvonplaten commented 1 year ago

diffusers is built on the premise that models and schedulers should not be passed into each other. It make more sense. The model is responsible for predicting epsilon (eps = unet(sample, t)), the scheduler is not a neural network, it's just an "algorithm" responsible for computing prev_sample from eps and sample. The scheduler doesn't need to know anything about the model.

More practically:

Hope this explains this a bit :-)

This design has so far been the most positive feedback we have received, we feel very strongly about it, so this won't change.

I think we have to options here:

  1. We stretch the time steps to essentially do just do:
    for t in [50, 50, 49, 49, 48, 48, ...]:
     epsilon = model(sample, t)
     sample = scheduler(epsilon, sample, t)

Advantage: No changes needed to pipelines such as stable diffusion Disadvantage: Reader doesn't see that we're working with a 2nd order scheduler here

  1. We don't stretch the time steps:

    for t in [50, 49, 48, ...]:
     epsilon = model(sample, t)
     sample = scheduler(epsilon, sample, t)
    
     if scheduler.order > 1:
         epsilon_2 = model(sample, t)
         sample = scheduler(epsilon_2, epsilon, sample)

Also cc @pcuenca @patil-suraj @anton-l what do you think? I agree with @keturn that we don't want to build something on "lies" so maybe 2. is better here?

keturn commented 1 year ago

the scheduler is not a neural network, it's just an "algorithm" responsible for computing prev_sample from eps and sample.

But what we've learned is that a single epsilon-prediction is not the most accurate or compute-efficient way to determine that prev_sample. More inputs can lead to better results. This seems well-supported in the literature now, from DEIS to EDM to DPM-Solver, and borne out in practice with the recent success of DPM-Solver++ in Stable Diffusion image synthesis applications.

if scheduler.order > 1:
    epsilon_2 = model(sample, t)

not quite. The second evaluation is at some _tprime. For the algorithms covered in EDM, I think that might be as straightforward as the next value from timesteps, but I also think @LuChengTHU has demonstrated the advantage of having a third value somewhere between those.

keturn commented 1 year ago

We don't want to pass functions into other functions (hard to understand, harder to debug)

...wait what? transformers and diffusers have both very heavily adopted a convention of pipelines and models and other things being __call__able. Things get passed around to be called as functions all the time.

[later]

okay, I grant that yes, "function that takes tensors and returns tensors" is easier to deal with than any sort of "function that takes function."

but I also think function-that-takes-function is kinda integral to how higher-order solvers work? like, by definition, even.

patrickvonplaten commented 1 year ago

I just noticed that I wasn't super welcoming in my message above :sweat_smile: - sorry! I really value your opinion - you've already shaped the design of the library twice clearly to the better and I should be more listening :-)

Also, it's a very important topic & fundamental design decision of the library, so it's good to discuss this from all angles. Thanks for linking the papers - those are also the ones we're looking at quite a bit as SOTA schedulers/samplers at the moment as well.

I thought about it again (tried to be objective), also skimmed through the algorithm sections the papers again, but it's still very clear to me that the current design we have is the right one.

Just to throw in some more IMO important arguments (in addition to the ones we made above):

Library design for everybody We don't want everybody that uses diffusers to have to understand the scheduler code (which order it is, etc...). Someone bit by bit getting to know the library and that looks at it more from an engineer point of view (memory, speed, funtion vs. nn.Module vs. tensor and not read papers) will always prefer the design we currently we have in my opinion.

For me it's really hard to imagine how someone that doesn't want to understand ODEs but just want to quickly iterative with different things in diffusers would prefer any other design.

People coming from the ODE world Now for someone that comes from the ODE world, I can see how the current design is a bit forced! You can indeed frame every scheduler like an ODE (like karras et. al did) and then it makes mathematical sense to just say: "Ok let's define a f(x)" and we build different ODE solvers that take that function and solve the ODE. This makes sense from a mathematical point of view, but not necessarily from a design/UX perspective.

To me, ODEs is one way of framing the schedulers, the other way is to rather explain it like the DDIM paper does in the beginning - i.e. framing them as a generative process. (Note DDIM also makes the comparison to ODE later in the paper). While ODE is more beautiful mathematically maybe, just decomposing the design as a generative process is also fine. Also current SOTA schedulers indeed are pretty much always framed as ODEs, but we cannot be sure that this will stay that way IMO. New, different kinds of diffusion models with different kinds of schedulers might come out that have a different framing => the current design is more flexible to support this.

People coming from the transformers world: People coming from the transformers GPT1,2,3 world know auto-regressive generation very well. If you know auto-regressive generation, you will be very happy about the current design, it makes it super easy to understand diffusion models by making an analogy to auto-regressive models. Also just from a mental effort point of view. If you know auto-regressive generation code, this code will be very familiar to you. Also I'd argue that this design is more flexible to future not-yet-known model + scheduler designs.

Just general UX for a user of diffusers Just from a practical point of view, let's say we go the ODE way. Let's look into the implications.

Other remarks One point where I think we agree is that as a researcher that has just written a DPM paper, it's tedious to have to force your scheduler ODE design into the diffusers design, but IMO that's the price to pay to make the code readable. I often start from the assumption that 1 person will write the code, 100 people will debug, tweak the code, 10,000 people will skim, read the code. For a long-term library, it's ok to make writing the code a bit tedious (affects 1 person). For both people tweaking and reading the code the current design is better IMO.

Also I just went through the DPM scheduler code that was integrated recently (here) and it's really readable. It's not like we have forced a completely unfitting design here. DPM is a higher order scheduler and fits very nicely, so currently I really don't see the limitation of the current design.

Thanks a lot for starting this discussion, it's a really important one & think it'll help people in the future to understand why the library has the design it has, but still very much against changing course.

hafriedlander commented 1 year ago

Passing the unet into the Diffuser Scheduler solves the specific problem of DPM2 & Huen not working properly, but it leaves other problems.

I'm currently working on expanding my UnifiedPipeline (which has been based purely on Diffusers) to work with both Diffusers Schedulers and k-diffusion samplers.

There are a few differences between the two:

Of the three, that first is the most fundamental difference. k-diffusion samplers make building chains of nested models that wrap the unet trivial. That's because predicted X0 is a very useful tensor for adjusting the way SD runs (much more useful than epsilon or Xt)

With Diffusers Schedulers, you can only either get the epsilon, or the (noisy) Xt.

That means I'm going to end up with one difference I can't fix (without changes to the Schedulers) - for my Diffusers DPM2 and Huen implementations (which I've changed to accept a unet as an argument to step), I can't do some things on every unet call - they need Xt, and Xt isn't available until the whole Scheduler step has run

I think the lying about timesteps approach is a bit ugly. If we called them something other than timesteps, that might help, but it's still pretty ugly. If instead of accessing the timesteps property, we added a generator to the scheduler thay might help make it cleaner? But, because it has the ability to expose Xt on every unet call, and not only once per Scheduler, I still prefer it to passing the unet in.

pcuenca commented 1 year ago

I do appreciate abstracting away the complexities of ODE/SDE solving, and wouldn't be surprised if that was not the only paradigm that existed. Things like cold diffusion show that mathematical beauty is nice and convenient, but we are allowed to think in broader terms.

I was going to raise the issue of predict_epsilon too. To me those are the two points of contact between models and schedulers:

Personally, I liked the recent implementation of DPM-Solver++ a lot, but I understand that maybe that approach may not be feasible in all cases. For things such as Heun, I don't dislike the loop that was outlined a few comments above (repeat calling the scheduler as many times as its order). If we want to generalize it more, I suppose we could return a tuple (sample, t) from the scheduler and have the scheduler store the state it needs and control how t progresses. Something like this:

t = 50
while t > 0:
     epsilon = model(sample, t)
     sample, t = scheduler(epsilon, sample, t)

Regarding what the model predicts, I think it's a hard problem too. k-diffusion made the choice to always use x0, and diffusers made the choice to assume epsilon, but neither works because there are models that predict all sorts of things. Passing the model to the scheduler doesn't solve that problem as @hafriedlander pointed out. I'd like to consider what alternatives we have in that regard because it may affect the interface between models and schedulers. As @patrickvonplaten said, it'd be ugly to have a lot of if statements inside each one of the schedulers, but that's essentially what diffusers' predict_epsilon boolean is doing, isn't it? What other alternatives do we have?

patrickvonplaten commented 1 year ago

Here a design for the Heun scheduler without changing anything in the pipeline: https://github.com/huggingface/diffusers/pull/1336

I honestly think it's not that bad and probably better then adding the loop inside a loop logic. Very curious to hear feedback

hafriedlander commented 1 year ago

Comments here in or in the PR @patrickvonplaten?

I'm pretty sure this specific implementation will break img2img pipeline and other pipelines that index into the timesteps list, because they assume len(timesteps) === num_inference_steps. Easy fix (use len(timesteps) instead of num_inference_steps to calculate start offset), but does mean "no changes to pipeline" isn't true.

I don't mind it personally (I know others will disagree). I do think an iteratable or generator would be cleaner. Either would require a change to img2img, since they don't allow slice indexing, but I suspect needing to change img2img is unavoidable with anything that touches timesteps. They would potentially also allow adaptive samplers. Downside is slightly more magic + probably a small performance cost.

keturn commented 1 year ago

I spent yesterday afternoon experimenting with refactoring the NVlabs implementation of EDM to see what Python interfaces would fall out of it. It splits apart in to a number of pieces in a way that lends itself fairly well to being able to document the individual pieces -- e.g. "here's the Variance Exploding noise schedule, and here's the Variance Preserving version" -- but when it comes to understanding the result as a whole, many of your criticisms (nested function factories & etc) absolutely do apply.

If I open it back up again, I'll probably discover I re-implemented a cruder version of k-diffusion. And I think I left a class named Norf.

I'm glad you've made that example of a Heun scheduler for us to look at. I often find design discussions go in circles without some concrete examples. I look forward to reading that soon.

Library design for everybody

This reminded me of something: Do you have user personas for these libraries?

I feel like the places where we clash come from having distinctly different backgrounds and concerns than some other segments of your audience.

keturn commented 1 year ago

After reading your HeunDiscreteScheduler with the benefit of having looked at another implementation of the same thing so recently, I think I've caught a few bugs in my understanding and am starting to come around.

keturn's bad assumptions

I still have some concerns about the pattern laid out by the HeunDiscreteScheduler implementation. Some of them are fixable, others might still come from wanting different trade-offs. But in any case, I'm no longer in the "omgwtfbbq this is fundamentally impossible with this API" position.

(Did someone try to correct me on that earlier? I've been quite emphatic about it for the last two months. That's a tad embarrassing.)

patrickvonplaten commented 1 year ago

Here another implementation from @pcuenca : #1356

Both #1356 and #1336 are now functional and should work.

Happy to discuss which design fits better (now & in near / mid-term future)

cc @patil-suraj @pcuenca @anton-l

(maybe also cc @keturn )

keturn commented 1 year ago

I like the design in #1356. Having some of my ideas about the scheduler shaken up (duplicate timestamps???) got me thinking: What is returned by Scheduler.step()?

It returns the next arguments to be evaluated by the model.

That's it, nothing more. The "steps" don't have to going in a straight line. We don't have to overload it with any more meaning than that.

This return value:

latents, t = self.scheduler.step(noise_pred, t, latents, return_dict=False, **extra_step_kwargs)

Makes that super clear.

It also removes the complication of having part of the state of the loop being based in an iterator the pipeline owns over exposed timesteps while there's also state internal to the scheduler. #1336 requires those two things stay in lockstep, but that's a difficult constraint to enforce and cognitive overhead I would rather do without.

I'm also happy to see it let go of the pipeline requiring a fixed-length list of timesteps. Sure, there will always be cases where you'd really like to know that up-front, but it was never a technical requirement for the process and an unnecessary constraint on scheduler design.

I think this design will also bring the PyTorch and Jax implementations closer together.

If this were a language that encouraged more use of closures -- which is, um, I think most things except for Python at this point -- and we wanted to do this thing where we want the scheduler to provide the arguments to the model, but lexically have the body of the loop inline, we'd do something like

self.scheduler(initial_latents, num_inference_steps, **extra_step_args) { (latents, t) -> 
    pred_noise = model(latents, t, embeddings)
    // ...guidance and stuff...
    return pred_noise, latents
}

but I guess since Python has held firm to its distaste for multi-statement anonymous functions, we won't do that and the inner loop will only get broken out to a named function when something like Jax demands it.

alexjc commented 1 year ago

I've been using diffusers to experiment with various forms of texture synthesis, in particular trying to render up-to 4K by porting the best techniques from non-diffuser technology (see https://github.com/texturedesign/texturize). I like that in diffusers there's a bunch of blocks (models, schedulers) that I can reuse and play around with. Now I'm resorting to writing my own pipeline because it's not much code (3 lines of code) and what's there didn't work for me.

With that as context, I believe you're trying to design & freeze an API that should not be. If you do, it will become a historical landmark, a legacy of 2022 technology that is quickly forgotten like StyleGANs have been.

Why do I think it's potentially premature?

As for my recommendation...

Think of diffusers as a library of modular blocks that can incrementally improve over time. Alongside, a set of interface specifications to connect these blocks together, along with nice error messages if the blocks don't fit together. If a scheduler has a new API, embrace it rather than force it into a "standard" so early. Just write the corresponding pipeline that best fits. This will make it more likely to withstand the fast pace of innovation, and support a wide variety of data formats.

Discussions that revolve around 3 lines of code in a pipeline feel like the wrong place to focus to me; I threw those out after a few days of using diffusers (I'm still using other blocks and the infrastructure) — and IMHO that's a good thing that should be encouraged.

(Side note, I don't see the library as beginner friendly or helpful as an introduction. It's nice and modular, but it's a bit too entreprisey for a good introduction or tutorial. IMHO, an implementation that fits in a single file does that better.)

I hope that helps!

LuChengTHU commented 1 year ago

Hi guys,

Hope you are doing well. Happy to know that you all approve my implementation for the multi-step DPM-Solver & DPM-Solver++. I'm also thinking about the implementation of single-step high-order schedulers (such as single-step DPM-Solver) these days ---- Actually, the difference between single-step and multi-step solvers is not that big, and I think we can unify them!

Theoretically speaking, high-order solvers aim to approximate the high-order derivatives in the Taylor expansions of an unknown function (e.g., the noise prediction model in diffusion models). The difference between single-step and multi-step solvers is very simple:

  1. Single-step solvers approximate the derivatives by some "intermediate" points. e.g., 2nd order solvers need to compute an additional value at each "step".

  2. Multi-step solvers approximate the derivatives by previously computed values. e.g., 2nd order solvers need to use the model output at the previous "step".

However, the "step" meaning in Diffusers is different from that in the ODE literature. In Diffusers, we take the "step" as "a single function evaluation". i.e., total "steps" == total number of function evaluations. At each "step" in Diffusers:

This concept is true for both single-step and multi-step solvers. The only differences are:

  1. How many previously computed outputs we used at time t.
  2. The starting time s at each ODE-literature "step". i.e., we compute x_t from time s to time t.

For example, if we want to use 5 function evaluations for sampling and we have initialized a timestep list:

timesteps = [t1, t2, t3, t4, t5]

We can further initialize a list that contains the number of previously computed outputs we used. Specifically,

For multi-step 2nd order solver (such as 2nd order DPM-Solver++), the order list is:

orders = [1, 2, 2, 2, 2]

where we use 1st order solver to initialize and 2nd order at the remaining steps.

Instead, for single-step 2nd order solver, the list is:

orders = [1, 2, 1, 2, 1]

and you can consider it as [2, 2, 1] order at each "step" in the ODE literature. i.e., we take two steps of 2nd order solver which uses 4 function evaluations and one step of 1st order solver which uses 1 function evaluation.

Therefore, I think the following implementation can unify the single-step and multi-step solvers:

  1. Maintain a "state" variable to contains the previously computed model outputs and sample outputs, such as https://github.com/huggingface/diffusers/blob/6b02323a602a66841729c3a5d60844b24aa81ff2/src/diffusers/schedulers/scheduling_dpmsolver_multistep.py#L204-L206:
self.sample_outputs = [ ... ]
self.model_outputs = [ ... ]

where the length is the order of the solver. e.g., for 2nd order solvers, it is something like

self.sample_outputs = [sample1, sample2]
self.model_outputs = [outputs1, outputs2]

where sample_outputs contains the previously computed x_t, and model_outputs contains the previously computed model(x_t, t) at previous times t.

  1. Initialize an order list:

    orders = [1, 2, 2, 2, 2] # For multi-step
    orders = [1, 2, 2, 2, 1] # For multi-step with "lower_order_final=True", such as my implementation in multi-step DPM-Solver
    orders = [1, 2, 1, 2, 1] # For single-step
  2. At each step, given an input x and model_output, we compute the x_t at time t = timesteps[step] by:

How do you think the above implementation? If you think this design is good for Diffusers, I can help to create a PR for implementing single-step DPM-Solver. (without changing anything in the pipelines)

LuChengTHU commented 1 year ago

Hi guys @keturn @hafriedlander @patrickvonplaten @pcuenca ,

I've implemented a pytorch version of single-step DPM-Solver in this PR: https://github.com/huggingface/diffusers/pull/1442

Is it good for you? Looking forward to further discussions!

LuChengTHU commented 1 year ago

In addition, I even think that we may unify the implementations of all the schedulers because it seems that they all can be understood as the above process, and the API can be the same as the JAX version:

x, states = self.scheduler.step(model_output, x, t, states=states, return_dict=False, **extra_step_kwargs)

where the states is a dict containing of the intermediate values of the previous model_output and x with a length of order.

The only difference between different schedulers is:

  1. The timesteps.
  2. The order_list.
  3. The updating equation at each step with the order order_list[step] and the intermediate values at state.

Moreover, such design can also easily implement schedulers which need method combinations (such as PNDM, combining single-step and multi-step methods). e.g., we can let the order list be

order_list = [1, 2, 3, 1, 2, 3, 3, 3, 3, 3]
method_list = ["3S", "3S", "3M", "3M", "3M", "3M"]

where the order_list can control the number of intermediate values used in each step, and method_list can control the starting and ending times in the ODE "step" literature. It seems that all the previous deterministic schedulers can be abstracted into such a unified formulation.

Therefore, I'm also curious about whether we should reimplement all the previous deterministic schedulers into such formulation and let the code more readable. How do you think?

patrickvonplaten commented 1 year ago

I've implemented a pytorch version of single-step DPM-Solver in this PR: #1442

Hey @alexjc,

Thanks a lot for the feedback - so in more practical terms, your advice would be to just write new pipelines for different schedulers?

patrickvonplaten commented 1 year ago

Hi guys @keturn @hafriedlander @patrickvonplaten @pcuenca ,

I've implemented a pytorch version of single-step DPM-Solver in this PR: #1442

Is it good for you? Looking forward to further discussions!

The PR is very nice!

alexjc commented 1 year ago

@patrickvonplaten Yes, go broader rather than deep. Stay flexible more than specialized.

I noticed yesterday than k-diffusion has wrappers for many libraries but not diffusers. If I was you I'd look at why that's the case and what could have made it easier to wrap for diffusers. https://github.com/crowsonkb/k-diffusion

Supporting the innovations, and having the flexibility to keep doing that as the technology evolves, is key IMHO!

LuChengTHU commented 1 year ago

I think one of the reasons is that k-diffusion uses a clean and easy wrapper for the whole solver. Maybe we can also wrap the schedulers into one file by using the above abstraction? e.g., we can provide a simple "sampling" pipeline by using the scheduler, and users can import the sampling pipeline to sample from their diffusion models?

patrickvonplaten commented 1 year ago

Thanks for all the feedback!

For now, we would like to continue the design we have by going with #1336 . #1356 is also a nice design but has similar problems than #1336 and doesn't offer enough advantages in our opinion to change all the pipeline code.

However, we'd really like to continue the discussion here and potentially go for a major re-design if we have a clearly better design idea than we have now.

patrickvonplaten commented 1 year ago

@alexjc note that we have wrapped k-diffusion in diffusers here: https://github.com/huggingface/diffusers/blob/main/examples/community/sd_text2img_k_diffusion.py

So this is essentially how it would look.

To me it all still boils down to the question that @keturn asked above:

My answer would still be: https://github.com/huggingface/diffusers/issues/1308#issuecomment-1318361179

So far I think diffusers stays pretty flexible by:

To me, wrapping functions into more functions and abstracting class into more classes is (which is done here IMO: https://github.com/crowsonkb/k-diffusion/commit/4314f9101a2f3bd7f11ba4290d2a7e2e64b4ceea)

Overall, I see the disadvantages of the current design, but simply don't have an idea for a better one. So if more schedulers/samplers are coming in & we see how they clearly break the current API we have, I'm more than happy to do a radical change.

Also one analogy I like is the way models and optimizers interact in PyTorch: https://pytorch.org/docs/stable/_modules/torch/optim/adamw.html#AdamW

Note that PyTorch optimizers do not take the model class or the model forward function as inputs and then just do "adam_w.solve()" - instead they store the parameters of the model as in inner state and then expose the training loop and call optimizer.step(), optimizer.zero_grad(), ... => PyTorch's optimizer to me are essentially taking the gradients of the model as an input to the step function and then return the optimized gradients. In reality weights are stored in both the model and optimizer and then alternatively improved. FLAX/JAX literally takes the model weights as inputs and returns the gradients for their optimizers. The functional approach here would be to have a "adam_w.solve(...)` method that takes the model function as an input. If you look into the Adam paper: https://arxiv.org/pdf/1412.6980.pdf, algorithm 1 this would have looked like it makes sense actually, but in coding practice it would have made it much more "difficult to read" to write training loops.

I know the schedulers/samplers aren't exactly the same, but there are strong similarities IMO and maybe this can give a new angle to view the problem.

Overall, I'm quite certain that we'll soon have to do some major API changes, but at the moment, I don't see the clear need to do so.

alexjc commented 1 year ago

@patrickvonplaten Lots of things to think about, thanks for the thoughtful reply!

I'm going to try another prototype this week that pushes the limits in a different way, and see what comes out of it... I'll report back if I learn something new there.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.