[Community] cancelable and asynchronous pipelines

keturn commented 2 years ago

In #developers-corner we have a request for cancellation support in the pipeline API.

That could also be something to consider in the context of a more asynchronous API in general; some applications want to avoid blocking the thread if all the slow parts are on a GPU device and the CPU is available to do other work.

I don't expect this is something to fit in the next release or two, but we can plant the seed to start thinking what sort of API could provide these features.

parlance-zz commented 2 years ago

I agree this would be helpful.

patrickvonplaten commented 2 years ago

Cool idea @keturn indeed! I think we had similar ideas for transformers pipelines. @Narsil do you have some ideas here maybe? :-)

Also cc @anton-l @patil-suraj

patrickvonplaten commented 2 years ago

cc @anton-l @patil-suraj again

patrickvonplaten commented 2 years ago

@keturn - do you have by any chance already a design in mind that we could use to enable asynchronous pipelines?

keturn commented 2 years ago

Not completely.

My primary familiarity with asynchronous APIs in Python is through Twisted, and I get the feeling that is not a common API among your target audience in this day and age.

I imagine you want something framework-agnostic, but with an eye toward being convenient in Jupyter notebooks, so I'd look to see if there's established precedent for successful async apis in notebooks.

My searches for async + PyTorch didn't find much, but there is at least one good example of an async pytorch pipeline. He looks like a very busy guy these days but I bet @lantiga could offer some valuable guidance here.

I also heard that your peers at @gradio-app have just released a version that does live updates for iterative outputs, so they might have some ideas too.

anton-l commented 2 years ago

We might have an issue with returning intermediate results from StableDiffusionPipeline due to the need to run a safety checker, unfortunately. But I'm open to try at least a generator design (yield-based) for other pipelines! Not sure we'd be able to go fully async though, but maybe @Narsil has a better intuition for torch-based pipelines.

patrickvonplaten commented 2 years ago

IMO we can be a bit more lenient here with the safety checker since it's intermediate results (also cc @natolambert).

However, I think such a pipeline would be best implemented as a community pipeline and not into the official pipeline :-)

Narsil commented 2 years ago

Cancelable and asynchronous are not linked (as far as I understand) to generator and yielding.

Cancelable and asynchronous:

I think this is meant to me used in async context like webservers. Pytorch itself is not async, so I don't think this library (or any that depend on pytorch) should ever try to create artifical asynchronous behavior (it's asking for trouble IMO). If you want to handle this, then you need a proper thread to do the Pytorch computation. It's especially important if the model is large and you don't want to duplicate. Be it for CPU or GPU pytorch attempts (to the best of my knowledge ) to use all available resources, so have two threads or processes for a single model will likely lead to weird things happening. So diffusers or transformers for that matter shouldn't be async aware and it's up to the user to figure out how to integrate best into an async environment.

Generator and yielding:

Here I think the idea is slighlty different, generators are used to prevent memory from being allocated too early. I would make the comparison with DataLoader within the Pytorch ecosystem. This definitely belong in this library IMO, mostly to make sure to remove any CPU computation when the model is on GPU so that the GPU can stay as busy as possible. Last I checked the current library and code (at least for stable diffusion) was already doing a pretty good job at it (mostly because the model is really GPU intensive but still it works ! )

Another benefit for using generators here, is when batching might be needed. https://huggingface.co/docs/transformers/v4.22.1/en/main_classes/pipelines#pipeline-batching

Feel free to correct me if I misunderstood anything !

keturn commented 2 years ago

You are correct that we've muddied the discussion here with several different concepts with their own concerns.

The original request was for cancellation support: the ability to interrupt work on a task that's in-progress, allowing the process to move on to other work. Use cases might be things like:

the client disconnected, so the result is no longer deliverable.
the intermediate outputs look discouraging, so we want to cut our losses on this run.

async comes up because that's an interface that represents "task in progress," and async interfaces often provide a cancel methods.

A generator API is different, though it's somewhat mixed together in Python's history of using yield for both its generators and coroutines. A generator would give us some of what we're looking for from cancellation, because a generator doesn't so much defer the allocation of memory — it's more a deferral of work.

That is, if you decide you are no longer interested in the results of a generator, you "cancel" it by simply not asking it for any more results. (and you might want generators anyway, for gradio reasons.)

Pytorch itself is not async, so […] diffusers or transformers for that matter shouldn't be async aware and it's up to the user to figure out how to integrate best into an async environment.

As I understand it, CUDA is asynchronous (as are most calls to graphics drivers), but as you say, PyTorch is not.

IMHO nearly all libraries should be async, because APIs that fail to reflect the fact that they're waiting on some external event inevitably lead to programs that block far more than they need to. But that's a different rant. We're not going to rewrite PyTorch today. And PyTorch is at least kind enough to release the GIL, which allows us to run it in a thread efficiently.

So what does all that mean for a Python API? Again, this is where my Python knowledge is at least six versions out of date, leaving me feeling very rusty for designing an API going in to 2023.

It might be that this doesn't call for an API change after all, and it's better addressed by

a short example that shows how to dispatch a call to a pipeline to a thread and optionally cancel it before it completes,
a test or two to ensure that cancelling a coroutine in that matter does actually prevent the pipeline from running its remaining steps, and
some documentation on which things are not thread-safe or otherwise unsafe to use from parallel coroutines.

Narsil commented 2 years ago

doesn't so much defer the allocation of memory — it's more a deferral of work.

You are entirely correct ! In my day-to-day I tend to see memory allocations being the issue more often than not, but you are entirely correct it's lazy work in general.

PyTorch is not.

It's more that some calls within (Like printings, fetching values, or moving data from devices) ARE blocking because they need to wait on the GPU to finish to deliver the final data.

IMHO nearly all libraries should be async

I tend to err on the other side that NO library should be async, and ordering work should be left to the kernel but I think we would agree that it's the color problem the biggest issue. (And realistically async/not async is here to stay no matter our opinions ).

a short example that shows how to dispatch a call to a pipeline to a thread and optionally cancel it before it completes, a test or two to ensure that cancelling a coroutine in that matter does actually prevent the pipeline from running its remaining steps, and some documentation on which things are not thread-safe or otherwise unsafe to use from parallel coroutines.

Not a maintainer of this lib so take my opinion with a grain of salt but I would refrain to do that here.

Parallism, async, threading multi processing are choices.

And the best solution will probably be different for different use cases. Do you want to maximize CPU usage ? and hence use all cores, do you want to minimize latency in a webserver context (Hence cancellable and CPU parallelism used to do the computations themselves) ? Are you using multiple nodes to do some job on GPU ? Do you want 25% of your cores doing video processing and use only 75% for the model inference ? How does that play with your GPU usage.

All these are very different contexts and pushing users into one direction is likely to mislead some into suboptimal courses.

I think it's impossible to be exhaustive on the matter, and isn't even a library's responsability. The best a library can realistically do, is explain what it's doing in terms of parallelism so users can adjust. And always enable a way to DISABLE parallelism of any kind so that users can do the parallelization upstairs. One example which we had to implement: https://github.com/huggingface/transformers/issues/5486

That being said, having somewhere in the docs (or just this issue) to refer to users asking for help would definitely help.

patrickvonplaten commented 2 years ago

Also cc @pcuenca here since it's similar to the PR that's about to be merged: https://github.com/huggingface/diffusers/pull/521

irgolic commented 2 years ago

Bump to unstale.

I believe this is a very important feature; surely we're one step closer to solving this now that pipeline callbacks are here. On that topic, does interrupting the pipeline with a callback at each step incur any slowdown?

patrickvonplaten commented 2 years ago

Hey @irgolic and @keturn,

I'm not sure if we want to support that much logic in the "native" pipelines - could we maybe try to make this feature a community pipeline, see: https://github.com/huggingface/diffusers/issues/841 and if it pans out nicely we could in a next step merge it to the native pipelines?

jamestiotio commented 2 years ago

On that topic, does interrupting the pipeline with a callback at each step incur any slowdown?

@irgolic As for the current callback implementation, strictly and technically speaking, in terms of the number of sequential CPU instructions being executed, yes. At the very least, even with an empty callback function that does nothing, some Python bytecode will still be generated and executed.

For example, as of Python 3.10, let's say we have the following code:

def main():
    pass

It would be compiled to:

  1           0 LOAD_CONST               0 (<code object main at 0x5569c2b59450, file "example.py", line 1>)
              2 LOAD_CONST               1 ('main')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (main)
              8 LOAD_CONST               2 (None)
             10 RETURN_VALUE

Disassembly of <code object main at 0x5569c2b59450, file "example.py", line 1>:
  2           0 LOAD_CONST               0 (None)
              2 RETURN_VALUE

Meanwhile, if we have the following code instead:

def func():
    pass

def main():
    func()

It would be compiled to:

  1           0 LOAD_CONST               0 (<code object func at 0x55b5cd0e4520, file "example.py", line 1>)
              2 LOAD_CONST               1 ('func')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (func)

  4           8 LOAD_CONST               2 (<code object main at 0x55b5cd10bf50, file "example.py", line 4>)
             10 LOAD_CONST               3 ('main')
             12 MAKE_FUNCTION            0
             14 STORE_NAME               1 (main)
             16 LOAD_CONST               4 (None)
             18 RETURN_VALUE

Disassembly of <code object func at 0x55b5cd0e4520, file "example.py", line 1>:
  2           0 LOAD_CONST               0 (None)
              2 RETURN_VALUE

Disassembly of <code object main at 0x55b5cd10bf50, file "example.py", line 4>:
  5           0 LOAD_GLOBAL              0 (func)
              2 CALL_FUNCTION            0
              4 POP_TOP
              6 LOAD_CONST               0 (None)
              8 RETURN_VALUE

Notice that at the very least, there is an additional LOAD_GLOBAL, CALL_FUNCTION, and POP_TOP instruction being executed. The Python interpreter would also still need to construct a frame object for the func function, even though it technically does nothing. Furthermore, this also doesn't fully take into account some of the additional JMP instructions somewhere in the resulting assembly code that would be the result of the if conditional statements that are doing the checks for whether there are any defined callbacks (as it is in the implementation of callbacks in #521).

Of course, with empty callback functions, this is negligible with modern CPUs that can execute billions of instructions per second. Most of the time/work would instead be spent on the Stable Diffusion model itself. However, I would not recommend putting heavy computation/workload on your callback functions as it is ultimately still a synchronous operation.

Regarding the extra logic involved in implementing this feature, I have put some of my comments here about the backward compatibility aspect that might be relevant to whoever is looking into implementing this feature. Extra care might be needed if this feature is to be implemented in the "native" pipelines. That said, I can see GUI-heavy applications greatly benefitting from asynchronous pipelines.

irgolic commented 2 years ago

How about something simple like https://github.com/huggingface/diffusers/pull/1053? Alternatively, I've found the fix for my use case: raising a custom exception through the callback and catching it outside the pipe() call.

irgolic commented 2 years ago

On the topic of asynchronous pipelines, what I'd like to see is a way to yield (asyncio.sleep(0)) at callback time.

IIRC, that will implicitly allow cancelling the pipeline in an async way. When pipeline_task yields to another coroutine, and the other coroutine calls pipeline_task.cancel(), an asyncio.CancelledError will be raised the next time control returns to pipeline_task.

carson-katri commented 1 year ago

All of our pipelines in Dream Textures use generators, and it works well for us with cancellation and step previews: https://github.com/carson-katri/dream-textures/blob/c3f3a3780c2229eb4cce7390f193b59b0569f64a/generator_process/actions/prompt_to_image.py#L509

It would be nice to have this option upstream, but I understand it may be difficult to maintain both the callback and generator methods.

patrickvonplaten commented 1 year ago

Could the callbacks help for this or is this not really sufficient? https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.callback_steps

patrickvonplaten commented 1 year ago

Also cc @williamberman

williamberman commented 1 year ago

This is a really interesting issue but also possibly really complicated!

I think this very quickly gets into specific details of the python and cuda runtimes that we can't answer questions around or plan solutions for unless we get very specific problems we're trying to solve or benchmarks for inference we're trying to hit. I think it also pretty quickly gets into questions around standards for productionizing serving models that I'd want get some other more experienced people to help answer (I'd be surprised if making model execution async from a main thread was custom handled by every model library -- I'd assume there'd be production runtimes with standard interfaces with webserving frameworks that would handle this (does onnx handle something like this?)).

It sounds to me like basic use cases such as accessing intermediate outputs or interrupting pipeline execution (rather naively with exceptions -- would assume there's better ways to do so in a production environment), callbacks are sufficient.

I think an open ended discussion about how we can assist people productionizing diffusers (more broadly than just cancelable and async pipelines) might be a good fit for a forum discussion but I think the issue as is currently is a little too broad to know how to help :)

williamberman commented 1 year ago

Going to close discussion for now but if anyone feels question isn't adequately answered, feel free to re-open!

huggingface / diffusers

[Community] cancelable and asynchronous pipelines #374