Closed keturn closed 1 year ago
I agree this would be helpful.
Cool idea @keturn indeed! I think we had similar ideas for transformers
pipelines. @Narsil do you have some ideas here maybe? :-)
Also cc @anton-l @patil-suraj
cc @anton-l @patil-suraj again
@keturn - do you have by any chance already a design in mind that we could use to enable asynchronous pipelines?
Not completely.
My primary familiarity with asynchronous APIs in Python is through Twisted, and I get the feeling that is not a common API among your target audience in this day and age.
I imagine you want something framework-agnostic, but with an eye toward being convenient in Jupyter notebooks, so I'd look to see if there's established precedent for successful async apis in notebooks.
My searches for async + PyTorch didn't find much, but there is at least one good example of an async pytorch pipeline. He looks like a very busy guy these days but I bet @lantiga could offer some valuable guidance here.
I also heard that your peers at @gradio-app have just released a version that does live updates for iterative outputs, so they might have some ideas too.
We might have an issue with returning intermediate results from StableDiffusionPipeline
due to the need to run a safety checker, unfortunately.
But I'm open to try at least a generator design (yield
-based) for other pipelines! Not sure we'd be able to go fully async though, but maybe @Narsil has a better intuition for torch-based pipelines.
IMO we can be a bit more lenient here with the safety checker since it's intermediate results (also cc @natolambert).
However, I think such a pipeline would be best implemented as a community pipeline and not into the official pipeline :-)
Cancelable and asynchronous are not linked (as far as I understand) to generator and yielding.
Cancelable and asynchronous:
diffusers
or transformers
for that matter shouldn't be async aware and it's up to the user to figure out how to integrate best into an async environment.Generator and yielding:
DataLoader
within the Pytorch ecosystem. This definitely belong in this library IMO, mostly to make sure to remove any CPU computation when the model is on GPU so that the GPU can stay as busy as possible. Last I checked the current library and code (at least for stable diffusion) was already doing a pretty good job at it (mostly because the model is really GPU intensive but still it works ! )Another benefit for using generators here, is when batching might be needed. https://huggingface.co/docs/transformers/v4.22.1/en/main_classes/pipelines#pipeline-batching
Feel free to correct me if I misunderstood anything !
You are correct that we've muddied the discussion here with several different concepts with their own concerns.
The original request was for cancellation support: the ability to interrupt work on a task that's in-progress, allowing the process to move on to other work. Use cases might be things like:
async comes up because that's an interface that represents "task in progress," and async interfaces often provide a cancel
methods.
A generator API is different, though it's somewhat mixed together in Python's history of using yield
for both its generators and coroutines. A generator would give us some of what we're looking for from cancellation, because a generator doesn't so much defer the allocation of memory — it's more a deferral of work.
That is, if you decide you are no longer interested in the results of a generator, you "cancel" it by simply not asking it for any more results. (and you might want generators anyway, for gradio reasons.)
Pytorch itself is not async, so […] diffusers or transformers for that matter shouldn't be async aware and it's up to the user to figure out how to integrate best into an async environment.
As I understand it, CUDA is asynchronous (as are most calls to graphics drivers), but as you say, PyTorch is not.
IMHO nearly all libraries should be async, because APIs that fail to reflect the fact that they're waiting on some external event inevitably lead to programs that block far more than they need to. But that's a different rant. We're not going to rewrite PyTorch today. And PyTorch is at least kind enough to release the GIL, which allows us to run it in a thread efficiently.
So what does all that mean for a Python API? Again, this is where my Python knowledge is at least six versions out of date, leaving me feeling very rusty for designing an API going in to 2023.
It might be that this doesn't call for an API change after all, and it's better addressed by
doesn't so much defer the allocation of memory — it's more a deferral of work.
You are entirely correct ! In my day-to-day I tend to see memory allocations being the issue more often than not, but you are entirely correct it's lazy work in general.
PyTorch is not.
It's more that some calls within (Like printings, fetching values, or moving data from devices) ARE blocking because they need to wait on the GPU to finish to deliver the final data.
IMHO nearly all libraries should be async
I tend to err on the other side that NO library should be async, and ordering work should be left to the kernel but I think we would agree that it's the color problem the biggest issue. (And realistically async/not async is here to stay no matter our opinions ).
a short example that shows how to dispatch a call to a pipeline to a thread and optionally cancel it before it completes, a test or two to ensure that cancelling a coroutine in that matter does actually prevent the pipeline from running its remaining steps, and some documentation on which things are not thread-safe or otherwise unsafe to use from parallel coroutines.
Not a maintainer of this lib so take my opinion with a grain of salt but I would refrain to do that here.
Parallism, async, threading multi processing are choices.
And the best solution will probably be different for different use cases. Do you want to maximize CPU usage ? and hence use all cores, do you want to minimize latency in a webserver context (Hence cancellable and CPU parallelism used to do the computations themselves) ? Are you using multiple nodes to do some job on GPU ? Do you want 25% of your cores doing video processing and use only 75% for the model inference ? How does that play with your GPU usage.
All these are very different contexts and pushing users into one direction is likely to mislead some into suboptimal courses.
I think it's impossible to be exhaustive on the matter, and isn't even a library's responsability. The best a library can realistically do, is explain what it's doing in terms of parallelism so users can adjust. And always enable a way to DISABLE parallelism of any kind so that users can do the parallelization upstairs. One example which we had to implement: https://github.com/huggingface/transformers/issues/5486
That being said, having somewhere in the docs (or just this issue) to refer to users asking for help would definitely help.
Also cc @pcuenca here since it's similar to the PR that's about to be merged: https://github.com/huggingface/diffusers/pull/521
Bump to unstale.
I believe this is a very important feature; surely we're one step closer to solving this now that pipeline callbacks are here. On that topic, does interrupting the pipeline with a callback at each step incur any slowdown?
Hey @irgolic and @keturn,
I'm not sure if we want to support that much logic in the "native" pipelines - could we maybe try to make this feature a community pipeline, see: https://github.com/huggingface/diffusers/issues/841 and if it pans out nicely we could in a next step merge it to the native pipelines?
On that topic, does interrupting the pipeline with a callback at each step incur any slowdown?
@irgolic As for the current callback implementation, strictly and technically speaking, in terms of the number of sequential CPU instructions being executed, yes. At the very least, even with an empty callback function that does nothing, some Python bytecode will still be generated and executed.
For example, as of Python 3.10, let's say we have the following code:
def main():
pass
It would be compiled to:
1 0 LOAD_CONST 0 (<code object main at 0x5569c2b59450, file "example.py", line 1>)
2 LOAD_CONST 1 ('main')
4 MAKE_FUNCTION 0
6 STORE_NAME 0 (main)
8 LOAD_CONST 2 (None)
10 RETURN_VALUE
Disassembly of <code object main at 0x5569c2b59450, file "example.py", line 1>:
2 0 LOAD_CONST 0 (None)
2 RETURN_VALUE
Meanwhile, if we have the following code instead:
def func():
pass
def main():
func()
It would be compiled to:
1 0 LOAD_CONST 0 (<code object func at 0x55b5cd0e4520, file "example.py", line 1>)
2 LOAD_CONST 1 ('func')
4 MAKE_FUNCTION 0
6 STORE_NAME 0 (func)
4 8 LOAD_CONST 2 (<code object main at 0x55b5cd10bf50, file "example.py", line 4>)
10 LOAD_CONST 3 ('main')
12 MAKE_FUNCTION 0
14 STORE_NAME 1 (main)
16 LOAD_CONST 4 (None)
18 RETURN_VALUE
Disassembly of <code object func at 0x55b5cd0e4520, file "example.py", line 1>:
2 0 LOAD_CONST 0 (None)
2 RETURN_VALUE
Disassembly of <code object main at 0x55b5cd10bf50, file "example.py", line 4>:
5 0 LOAD_GLOBAL 0 (func)
2 CALL_FUNCTION 0
4 POP_TOP
6 LOAD_CONST 0 (None)
8 RETURN_VALUE
Notice that at the very least, there is an additional LOAD_GLOBAL
, CALL_FUNCTION
, and POP_TOP
instruction being executed. The Python interpreter would also still need to construct a frame object for the func
function, even though it technically does nothing. Furthermore, this also doesn't fully take into account some of the additional JMP
instructions somewhere in the resulting assembly code that would be the result of the if
conditional statements that are doing the checks for whether there are any defined callbacks (as it is in the implementation of callbacks in #521).
Of course, with empty callback functions, this is negligible with modern CPUs that can execute billions of instructions per second. Most of the time/work would instead be spent on the Stable Diffusion model itself. However, I would not recommend putting heavy computation/workload on your callback functions as it is ultimately still a synchronous operation.
Regarding the extra logic involved in implementing this feature, I have put some of my comments here about the backward compatibility aspect that might be relevant to whoever is looking into implementing this feature. Extra care might be needed if this feature is to be implemented in the "native" pipelines. That said, I can see GUI-heavy applications greatly benefitting from asynchronous pipelines.
How about something simple like https://github.com/huggingface/diffusers/pull/1053? Alternatively, I've found the fix for my use case: raising a custom exception through the callback
and catching it outside the pipe()
call.
On the topic of asynchronous pipelines, what I'd like to see is a way to yield (asyncio.sleep(0)
) at callback time.
IIRC, that will implicitly allow cancelling the pipeline in an async way. When pipeline_task
yields to another coroutine, and the other coroutine calls pipeline_task.cancel()
, an asyncio.CancelledError
will be raised the next time control returns to pipeline_task
.
All of our pipelines in Dream Textures use generators, and it works well for us with cancellation and step previews: https://github.com/carson-katri/dream-textures/blob/c3f3a3780c2229eb4cce7390f193b59b0569f64a/generator_process/actions/prompt_to_image.py#L509
It would be nice to have this option upstream, but I understand it may be difficult to maintain both the callback and generator methods.
Could the callbacks help for this or is this not really sufficient? https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.callback_steps
Also cc @williamberman
This is a really interesting issue but also possibly really complicated!
I think this very quickly gets into specific details of the python and cuda runtimes that we can't answer questions around or plan solutions for unless we get very specific problems we're trying to solve or benchmarks for inference we're trying to hit. I think it also pretty quickly gets into questions around standards for productionizing serving models that I'd want get some other more experienced people to help answer (I'd be surprised if making model execution async from a main thread was custom handled by every model library -- I'd assume there'd be production runtimes with standard interfaces with webserving frameworks that would handle this (does onnx handle something like this?)).
It sounds to me like basic use cases such as accessing intermediate outputs or interrupting pipeline execution (rather naively with exceptions -- would assume there's better ways to do so in a production environment), callbacks are sufficient.
I think an open ended discussion about how we can assist people productionizing diffusers (more broadly than just cancelable and async pipelines) might be a good fit for a forum discussion but I think the issue as is currently is a little too broad to know how to help :)
Going to close discussion for now but if anyone feels question isn't adequately answered, feel free to re-open!
In #developers-corner we have a request for cancellation support in the pipeline API.
That could also be something to consider in the context of a more asynchronous API in general; some applications want to avoid blocking the thread if all the slow parts are on a GPU device and the CPU is available to do other work.
I don't expect this is something to fit in the next release or two, but we can plant the seed to start thinking what sort of API could provide these features.