python / cpython

The Python programming language
https://www.python.org
Other
62.95k stars 30.15k forks source link

Add eager task creation API to asyncio #97696

Closed jbower-fb closed 1 year ago

jbower-fb commented 2 years ago

Feature or enhancement

We propose adding “eager” coroutine execution support to asyncio.TaskGroup via a new method enqueue() [1].

TaskGroup.enqueue() would have the same signature as TaskGroup.create_task() but eagerly perform the first step of the passed coroutine’s execution immediately. If the coroutine completes without yielding, the result of enqueue() would be an object which behaves like a completed asyncio.Task. Otherwise, enqueue() behaves the same as TaskGroup.create_task(), returning a pending asyncio.Task.

The reason for a new method, rather than changing the implementation of TaskGroup.create_task() is this new method introduces a small semantic difference. For example in:

async def coro(): ...

async with TaskGroup() as tg:
  tg.enqueue(coro())
  raise Exception

The exception will cancel everthing scheduled in tg, but if some or all of coro() completes eagerly any side-effects of this will be observable in further execution. If tg.create_task() is used instead no part of coro() will be executed.

Pitch

At Instagram we’ve observed ~70% of coroutine instances passed to asyncio.gather() can run fully synchronously i.e. without performing any I/O which would suspend execution. This typically happens when there is a local cache which can elide actual I/O. We exploit this in Cinder with a modified asyncio.gather() that eagerly executes coroutine args and skips scheduling a asyncio.Task object to an event loop if no yield occurs. Overall this optimization saved ~4% CPU on our Django webservers.

In a prototype implementation of this proposed feature [2] the overhead when scheduling TaskGroups with all fully-synchronous coroutines was decreased by ~8x. When scheduling a mixture of synchronous and asynchronous coroutines, performance is improved by ~1.4x, and when no coroutines can complete synchronously there is still a small improvement.

We anticipate code relying on any semantics which change between TaskGroup.create_task() and TaskGroup.enqueue() will be rare. So, as the TaskGroup interface is new in 3.11, we hope enqueue() and its performance benefits can be promoted as the preferred method for scheduling coroutines in 3.12+.

Previous discussion

This new API was discussed informally at PyCon 2022, with at least some of this being between @gvanrossum, @DinoV, @markshannon, and /or @jbower-fb.

[1] The name "enqueue" came out of a discussion between @gvanrossum and @DinoV.

[2] Prototype implementation (some features missing, e.g. specifying Context), and benchmark.

Linked PRs

gvanrossum commented 2 years ago

If cond is True and some or all of coro() completes eagerly, any side-effects of this will be observable in further execution. If tg.create_task() is used instead, no part of coro() will be executed if cond is True.

If I didn't already understand the changes to async functions you are proposing I wouldn't understand what this means (or the code fragment above it). What does cond refer to? I think what you are trying to say here is that, if the body of the with TaskGroup() block raises before the first await, it will cancel all created tasks before they have started executing. Right?

Separately -- Can you remind me of the reason to make this a TaskGroup method instead of a new flag on create_task()? And the enqueue() name feels odd, especially since it may immediately execute some of the coroutine, which seems the opposite of putting something in a queue.

To be clear, I have no objection to the feature, and I think making this an opt-in part of task creation avoids worries of changing semantics for something as simple as await coro().

itamaro commented 2 years ago

Separately -- Can you remind me of the reason to make this a TaskGroup method instead of a new flag on create_task()? And the enqueue() name feels odd, especially since it may immediately execute some of the coroutine, which seems the opposite of putting something in a queue.

This came up in the conversation we had with @DinoV during the Language Summit. We started out with a flag on create_task (e.g. eager=True), but we didn't like the fact that when setting this flag, the name of the method may now be misleading (because, in fact, it may not create a task). enqueue was the alternative (I don't remember who suggested it and why), but we can definitely bikeshed the naming :) If we take inspiration from Trio's nurseries, we can use start_soon. Other options are execute or run (although these names may be misleading too, since users may assume these APIs never create a task).

jbower-fb commented 2 years ago

If I didn't already understand the changes to async functions you are proposing I wouldn't understand what this means (or the code fragment above it). What does cond refer to? I think what you are trying to say here is that, if the body of the with TaskGroup() block raises before the first await, it will cancel all created tasks before they have started executing. Right?

Right. Yeah, this is more contrived than needed, I'll fix up the text.

This came up in the conversation we had with @DinoV during the Language Summit. We started out with a flag on create_task (e.g. eager=True), but we didn't like the fact that when setting this flag, the name of the method may now be misleading (because, in fact, it may not create a task).

I wasn't in this conversation but when writing this issue it occurred to me that even with eager execution we could still return something Task-like. This would mitigate any issues with knowing the interface of the returned object (e.g. is cancel() available). This would also mean the name create_task() would still make sense even with a keyword-argument to alter the behavior. Unless, I'm missing something which doesn't make this viable?

Having said this, I think eager execution should be the default behavior users reach for going forward. So it might still be better to have a new method.

gvanrossum commented 2 years ago

Thanks for fixing up the description, it's clear now.

I wasn't in this conversation but when writing this issue it occurred to me that even with eager execution we could still return something Task-like. This would mitigate any issues with knowing the interface of the returned object (e.g. is cancel() available). This would also mean the name create_task() would still make sense even with a keyword-argument to alter the behavior. Unless, I'm missing something which doesn't make this viable?

It could return a Future, which has many of the same methods. If the coroutine returns an immediate result that could be the Future's result.

However, if the result is generally not used though the need to return something with appropriate methods would just reduce the performance benefit.

Having said this, I think eager execution should be the default behavior users reach for going forward. So it might still be better to have a new method.

Yeah, that is definitely an advantage of a new method name. I find enqueue() rather easy to mistype though.

Have the performance results been repeated with TaskGroup.enqueue(), or are the quoted measurements based on the Cinder implementation using gather()? There could be surprises here.

jbower-fb commented 2 years ago

Have the performance results been repeated with TaskGroup.enqueue(), or are the quoted measurements based on the Cinder implementation using gather()? There could be surprises here.

The benchmark showing an 8x improvement (etc.) is with Dino's prototype of TaskGroup.enqueue() against some revision of 3.11. This is functional, but still needs some work. Notably it does not yet implement management of the current Context when eagerly executing. There will be some extra overhead from that but my hope is it'll be negligible.

It could return a Future, which has many of the same methods. If the coroutine returns an immediate result that could be the Future's result.

However, if the result is generally not used though the need to return something with appropriate methods would just reduce the performance benefit.

Dino's prototype currently returns a newly constructed Future if the eager execution fully completes, but returns a Task otherwise. While I can imagine it'll be less common to use the extra fields on Task, it might be annoying to have to think about whether a Future or a Task is returned. Again, unless I missed something, I don't think there should be extra overhead from returning some kind of eagerly-completed-Task value vs. a completed Future.

I find enqueue() rather easy to mistype though.

Indeed, I have misspelled it a number of times already. I like start_soon or start_immediate which to me imply execution start time may not be bound by the async with scope.

gvanrossum commented 2 years ago

For a new API, returning either a Future or a Task is totally fine -- Task is a subclass of Future, so we can just document it as returning a Future.

I hadn't realized the prototype doesn't even need C changes -- I had assumed it depended on the changes to vectorcall to pass through a bit indicating it's being awaited immediately. Is that change even needed once we have this? (Probably we resolved that already during discussions at PyCon, it's just so long ago I can't remember anything.)

I think we need to bikeshed some more on the name, but you're right about the impression the name ought to give. Maybe start_task()?

jbower-fb commented 2 years ago

We can just document it as returning a Future.

I think it'd be better to document it as returning a Task? That way it's clear cancellation is still generally an option.

I had assumed it depended on the changes to vectorcall to pass through a bit indicating it's being awaited immediately. Is that change even needed once we have this? (Probably we resolved that already during discussions at PyCon, it's just so long ago I can't remember anything.)

That is an interesting question, and again I don't think I was around if that was discussed in detail. For this feature, that flag is not needed. If we want to add an optimization for the large body of existing code that uses asyncio.gather(), then it would help significantly. We are probably going to carry this feature forward in Cinder for a while at least as it'll likely take quite some time (years?) before we can fully migrate to the new TaskGroup API. I'd definitely like to know what your, or other members of the community feel about adding an optimization to help with the older/existing API.

Maybe start_task()?

Fine with me. Maybe we should discuss further once a PR is up.

gvanrossum commented 2 years ago

I think it'd be better to document it as returning a Task? That way it's clear cancellation is still generally an option.

From a static type POV it's always a Future, not always a Task. If you want to cancel it you take your chances -- cancel() will return a bool telling you whether it worked. The docs should just explain the situation without lying.

I will wait for the PR (that you're hopefully working on?). I recommend using start_task() as the method name until we come up with something better.

jbower-fb commented 2 years ago

From a static type POV it's always a Future, not always a Task.

Whoops, I just realized Future has a cancel() method. I mistakenly thought that was a feature of Task. Sorry for the confusion.

I will wait for the PR (that you're hopefully working on?)

We should have something up soon. Really wanted to get an issue open first for early feedback on the plan.

gvanrossum commented 2 years ago

We had a fruitful discussion on this topic at the core dev sprint. @1st1 and @carljm were present (and others).

Yury recommends not adding a new API like proposed here, but instead solving this globally (per event loop). We can have an optional task factory that people can install using loop.set_task_factory(), and which does effectively what is being proposed here for all create_task() calls (not just for the TaskGroup method).

The new create_task() would do what you propose here (call coro.send(None) and analyze the outcome), and either create a Future and set its result (or exception), or create a Task from a coroutine and a yield_result. The latter operation could be done by an extension to the create_task() API and the Task constructor API -- if an optional keyword parameter yield_result=result is passed, the task constructor doesn't end by calling loop.call_soon(self.__step, context=self._context), but instead treats result as is done by __step, interpreting it as a future to be blocked on.

This requires an extension of the create_task() API, but that should be okay -- if you are using a loop whose create_task() doesn't yet support the yield_result keyword parameter, you just can't benefit from the new functionality yet. (I'm guessing the loop should advertise whether it supports this somehow.)

gvanrossum commented 2 years ago

I'm working on a prototype implementation in #98137.

gvanrossum commented 2 years ago

I'm hoping that one of the Instagram folks can work on this contribution? I don't have the bandwidth to finish up that PR -- it was just meant as a quick proof of concept of what we discussed at the sprint.

jbower-fb commented 2 years ago

Yep, we'll pick that up! Just need to find a moment to look in more detail.

gvanrossum commented 1 year ago

@jbower-fb Have you ever found the time to look into this more?

jbower-fb commented 1 year ago

@gvanrossum I have not but I think @itamaro is going to pick this up. Sorry for the confusion, it was a bit unclear last year who was going to have time to do what.

gvanrossum commented 1 year ago

Cool. Do I need to keep this PR open?

itamaro commented 1 year ago

Cool. Do I need to keep this PR open?

already pulled that PR locally, so you can close it!

probably going to take me a bit to learn enough asyncio internals to get this working and tested

gvanrossum commented 1 year ago

Let me know (e.g. by posting questions here) if you need any help!

itamaro commented 1 year ago

GH-101613 has a working prototype. still much left to do. TODO items and additional discussion items on the draft PR (let me know if you prefer to keep the discussion on the issue).

graingert commented 1 year ago

isn't this going to cause issues with structured concurrency objects (those that use asyncio.current_task(), eg timeout and TaskGroup)?

eg in

async def some_task():
    async with asyncio.timeout(1):
        await asyncio.Event().wait() # never gets cancelled

here asyncio.timeout grabs the asyncio.current_task(), which would actually be some previous task

We anticipate code relying on any semantics which change between TaskGroup.create_task() and TaskGroup.enqueue() will be rare

opening a timeout or TaskGroup in the first step of an async function doesn't seem like a rare occurrence

itamaro commented 1 year ago

isn't this going to cause issues with structured concurrency objects (those that use asyncio.current_task(), eg timeout and TaskGroup)?

yes, this is a real problem.

one possible solution is to modify asyncio.current_task to materialize a task on-demand for the current coroutine if it detects it's being called in the context of an eagerly executing coroutine (would require the eager task factory to set something on the loop to communicate this to current_task).

would that resolve the issue and be an acceptable solution?

gvanrossum commented 1 year ago

@itamaro Is that how you resolved it?

jbower-fb commented 1 year ago

Hello, I'm back and have been assisting @itamaro with this issue. In the latest PR update I believe we have this addressed with minimal extra overhead.

The crux of the change is to always create a Task, even when eagerly executing. This means there's a Task available to swap in as the "current" task during eager execution.

In theory this shouldn't have been much more expensive as eager execution was always creating either a Task or a Future anyway. In practice, a big difference was Tasks were registered in the _all_tasks weak-set which is quite expensive. I've mitigated this by splitting _all_tasks into _scheduled_tasks which is still a weak-set, and a much cheaper regular set with only_eager_tasks. We don't need a weak-set for eager tasks as we fully control life-time during eager execution, and if an await does happen we can graduate to a "scheduled task". The all_tasks() now pulls from both sets so users still have a unified view.

To make the above work I've pushed most of the functionality that was in create_eager_task_factory.factory() into Task. Firstly, this allows us to set the result/exception of a completed eager execution so there's no longer a need to make a Future. We couldn't easily do this from outside the class as set_exception/result() are disallowed. I think this also makes more sense as a Task now holds the logic both to schedule itself to the loop or eagerly execute itself.

As well the issue above I fixed some other issues too:

I expect/hope the full Pyperformance results to be about the same as before, but we're still getting that data.

I also want to think some more about testing for some of the other issues resolved here, but wanted to get the new implementation up now for public comment.

itamaro commented 1 year ago

@jbower-fb provided the details :) I'm running pyperformance - should have results tomorrow

jbower-fb commented 1 year ago

One other curiosity worth mentioning: with the async_tree benchmark I found eager execution lead to a lot more GC overhead (according to Linux perf). I cut this down by setting Task._coro to None if eager execution completes the task. I don't deeply understand what's going on here though.

Clearing the coroutine on any Tasks completion may help a bit anyway. It is a semantic change though - if you queryd a Task for its coroutine after it's completed you'd go from getting one to None . I guess nothing should really be relying on this? But maybe I'm missing something so I've restricted this change to eager Tasks only.

itamaro commented 1 year ago

I commented on the PR - benchmarks results are the same as the original implementation!

kumaraditya303 commented 1 year ago

There are two PRs referenced by the issue, which one is most update? Should the other one be closed now?

itamaro commented 1 year ago

There are two PRs referenced by the issue, which one is most update? Should the other one be closed now?

I closed the draft PR, thanks for noticing!

kumaraditya303 commented 1 year ago

See my comments on the PR, reopening.

kumaraditya303 commented 1 year ago

One other curiosity worth mentioning: with the async_tree benchmark I found eager execution lead to a lot more GC overhead (according to Linux perf). I cut this down by setting Task._coro to None if eager execution completes the task. I don't deeply understand what's going on here though. Clearing the coroutine on any Tasks completion may help a bit anyway. It is a semantic change though - if you queryd a Task for its coroutine after it's completed you'd go from getting one to None . I guess nothing should really be relying on this? But maybe I'm missing something so I've restricted this change to eager Tasks only.

I don't see this change documented anywhere and I am not sure anyone has agreed that this is correct. I am -1 on doing this change, it breaks introspection, if there is a performance bottleneck then it needs to be investigated rather than just being masked away by unrelated change in semantics.

jbower-fb commented 1 year ago

I don't see this change documented anywhere and I am not sure anyone has agreed that this is correct.

The way it's implemented it isn't a change to any existing behavior. This only affects eagerly executed tasks which are a new feature. What I was saying in the quoted text is we could also apply this to all tasks.

Specifically for eagerly executing tasks, I would argue losing the coroutine reference after execution isn't a big deal. At the point the coroutine reference is removed the task is also no longer available to external observers via "current" or "all" task features. I think the only reasonable way you would have a reference to the Task at this point would be if you already had it at a time when the coroutine was available. If you really wanted to keep the coroutine around you could tweak your own code to capture it as part of adapting to the new feature of eagerly executed tasks.

Are there some specific use cases you have in mind where one might need access to a completed coroutine?

I suspect this isn't something you'd want to do as a matter-of-course as it could keep an arbitrary amount of data alive.

kumaraditya303 commented 1 year ago

Specifically for eagerly executing tasks, I would argue losing the coroutine reference after execution isn't a big deal. At the point the coroutine reference is removed the task is also no longer available to external observers via "current" or "all" task features. I think the only reasonable way you would have a reference to the Task at this point would be if you already had it at a time when the coroutine was available.

From users POV, there is little difference if the task completed eagerly or blocked and was completed in multiple cycles, if you can inspect a task's coroutine which was completed in multiple cycles, then I expect the same to be for eager task. There should be little to no difference between the two if we want eager task to be easily useable.

Did you investigate why there was so much overhead in the first place? How much % change did setting coro to None helped?

kumaraditya303 commented 1 year ago

I read through the documentation in the PR and I have no idea how is one expected to use create_eager_task_factory. Specifically it talks about using a custom task implementation as eager task factory, but there is no documentation how one is expected to create such a task implementation. After poking in the code, the task should accept eager_start: bool which defaults to false but this isn't documented. The new option kwarg isn't even documented for asyncio.Task.

Is this correct or am I missing something?

jbower-fb commented 1 year ago

Did you investigate why there was so much overhead in the first place? How much % change did setting coro to None helped?

Not in detail. I think at some point during development I found eager tasks were executing more slowly than I was expecting. When I compared with/without eager execution for the same benchmark I noticed GC routines popping up in Linux perf more in the eager case. Then I took an educated guess it was the coroutine and cleared it.

I just took another quick look at this for the async_tree benchmark from pyperformance and got these results:

So, clearing the coroutine is a significant improvement.

The reason I asked what applications you had in mind for the coroutines of finished tasks is I did think about this a bit when I wrote the code. It occurred to me the main things someone might want would be specific/limited metadata about the coroutine, e.g. its (qual)name. These could be captured and stored explicitly by a Task with much less system impact. Keeping around completed coroutines and their completed frames seems a bit sketchy in general. It could cause unbounded amounts of data to avoid GC longer than expected.

From users POV, there is little difference if the task completed eagerly or blocked and was completed in multiple cycles, if you can inspect a task's coroutine which was completed in multiple cycles, then I expect the same to be for eager task. There should be little to no difference between the two if we want eager task to be easily useable.

Today eagerly completing tasks are an optional feature. The reason for this is they can cause subtle differences in behavior. I would argue that coroutines being cleared on task completion could be one of these differences and we can document it as such. After the feature has been in use publicly for a while we might get some idea of what people want zombie coroutines for and come up with a more efficient way of providing targeted functionality.

kumaraditya303 commented 1 year ago

I just took another quick look at this for the async_tree benchmark from pyperformance and got these results: (Using C-accelerated asyncio) async_tree_none: Mean +- std dev: 658 ms +- 43 ms (With C version of eager execution as merged) async_tree_none: Mean +- std dev: 457 ms +- 36 ms (With C version of eager execution without clearing the coroutine) async_tree_none: Mean +- std dev: 547 ms +- 80 ms pyperf also includes a warning with this result:

The numbers are hard to trust considering that the stddev is 80ms as pyperf is correctly warning that the benchmark is so unstable but I take it.

Anyways since you feel so strongly about this I accept it but with proper documentation and that @gvanrossum agrees. I'll approve the PR and wait for Guido to approve.

itamaro commented 1 year ago

I read through the documentation in the PR and I have no idea how is one expected to use create_eager_task_factory. Specifically it talks about using a custom task implementation as eager task factory, but there is no documentation how one is expected to create such a task implementation. After poking in the code, the task should accept eager_start: bool which defaults to false but this isn't documented. The new option kwarg isn't even documented for asyncio.Task.

I have proposed documentation improvements in https://github.com/itamaro/cpython/commit/ac1ee82a311b86d78f422364bf8ada660df21ecb (PR depends on gh-104251 first)

gvanrossum commented 1 year ago

What’s left? Is there a what’s new entry?

itamaro commented 1 year ago

What’s left? Is there a what’s new entry?

I think this is done now! Yes, we added a what's new entry.