Open alexhartl opened 2 years ago
You can use the new asyncio.TaskGroup
to avoid this in 3.11+
It looks to me like it would be a trivial code change (keeping all active tasks in a separate, non-weak set) to gain a large benefit here.
This behavior is inherited from Futures (where it makes sense). If a future is not being kept track of, it's a bug (the developer forgot to await). However, tasks start executing, if not immediately, on their own, so not awaiting a task is not as obviously a mistake.
Interrupting actively running code because someone didn't maintain a reference (on the gc's terms and timing) is a very different beast than not executing code that the desire and timing of which is unclear (and an exception can be somewhat promptly triggered for).
I'll have time this evening to open a PR, if nobody else wants to get to it.
For impact perspective, discord.py drops the task after calling create_task on it.
This means (effectively) every python based discord bot is affected by this, and every event dispatch appears to be technically at the whims of the GC.
Rather than rhetoric about impact or how trivial the fix would be, we need a discussion on why we designed tasks this way in the first place. It doesn't look to me like it was inherited from Futures -- the variable is called _all_tasks
, not _all_futures
. Without understanding this, we risk making things worse with a hasty "fix".
That said, I'm not sure why we designed it this way, and (from private email) @1st1 doesn't seem to recall either. But it was definitely designed with some purpose in mind -- the code is quite complex and was updated every time an issue with it was discovered, and we've had many opportunities to change this behavior but instead chose to update the docs (gh-29163), even when asyncio itself suffered (gh-90467).
I also don't understand why the dict of all tasks is a global, since the only API that uses this (all_tasks()
) filters out tasks belonging to a different event loop. Again, I assume there's a reason for the complexity, but I don't know what it is. Is it 3rd party event loops like uvloop? Or frameworks like aiohttp?
FWIW if we simply make _all_tasks
a plain dict, we would have to arrange for tasks to remove themselves from the list when they exit, possibly in a done-callback (though there's a cost to that). There's already an API to do that (_unregister_task
) but it doesn't seem to be called.
Agree that this should be approved by someone who is familiar with the code's design objectives, which is why I brought up two alternatives and did not create a PR right away.
I would not turn _all_tasks
into a plain dict (or set), since this might break code that assumes that finished tasks (that have a user reference somewhere) are still returned by all_tasks()
. The suggested change is to have an additional _pending_tasks
set. Yes, we would need a done-callback, which would introduce a slight performance penalty.
In any case, imho the current behavior is somewhat annoying or misses some fundamental functionality.
Sorry, I've been doing more reading into what the current state is and other related issues since I wrote that comment and it is absolutely a lot more detailed behind the scenes (like this comment here I only got to today: https://github.com/python/cpython/issues/80788#issuecomment-1279608229 ) which mirrors some of what you've sent here.
I was working with the idea that Tasks were inherited from Futures because that was how they were introduced in the original PEP-3156.
I think I have found the origin of why it was made a weakset in the first place: https://groups.google.com/g/python-tulip/c/13hfgbKrIyY/m/HpwWPGHKT6IJ
Specifically, this patch: https://codereview.appspot.com/14295043/diff/1/tulip/tasks.py
So it was originally intended to be a registry to help recover stuck tasks and get stack frames for?
(Sorry, I hit some wrong buttons.)
So it was originally intended to be a registry to help recover stuck tasks and get stack frames for?
But apparently at that point it was already the case that tasks had to be kept alive by their owner -- ISTM a weakset was used specifically to avoid keeping tasks alive longer than necessary.
So this has always been part of the design, it just wasn't made explicit in the docs. I'm still curious why we originally designed it that way. It's possible that we never consciously realized this constraint. It's also possible that, since we did make Task a subclass of Future, we assumed that tasks would always be awaited.
I am still not convinced, despite this being a common stumbling point, that we can fix this without consequences for use code.
Tasks can't be dropped if they're actively executing or scheduled to execute (_ready) because there's strong references held by the event loop, and there's a comment that notes a bit about this at the start of the Task class, although I'm not sure if that's intended to apply to execution only or for reference management.
Tasks only appear to be GCable if they're blocking on another future which ends up being GCable itself. If the blocking future is _ready/executing, then it, also, is kept alive through the above invariant, and there's no chance of losing the dependent task.
It makes sense to error and garbage collect if the chain is broken, because we've got a "proven" memory leak (the task cannot be woken again if it's dependent future is unreachable).
Yes, it is possible to fix this by making sure there's always a chain of strong references from the most basic futures to the running tasks awaiting something. But I have a feeling that this would be a somewhat fragile approach, both considering future changes in the stdlib and futures that a user creates. A year ago I grep
ed asyncio's source and found two spots where weak references were used that might cause problems. Don't know if those have been fixed/modified meanwhile.
Haven't thought of it like causing a provable memory leak, though. Good point.
So how about adding a keyword argument to create_task
that allows adding the task to a _pending_tasks
set? We could add a note in the documentation warning about potential memory leaks.
So it was originally intended to be a registry to help recover stuck tasks and get stack frames for? But apparently at that point it was already the case that tasks had to be kept alive by their owner -- ISTM a weakset was used specifically to avoid keeping tasks alive longer than necessary.
It was already the case that tasks had to be kept alive by their owner, but I'm not sure this was understood to be the case. If you look back at the diff that @bast0006 found, there were additional methods all_done_tasks
and all_failed_tasks
that were present in Tulip but didn't make the transition to asyncio
. One reason this may have been done was to diagnose reference cycle complexities with the coroutine runner (I saw a lot of this in Tornado's coroutines). If that was the intention, a weak set would have been necessary to make the "done" and "failed" tasks iterable while they're waiting for GC without keeping them alive unnecessarily.
I also don't understand why the dict of all tasks is a global, since the only API that uses this (all_tasks()) filters out tasks belonging to a different event loop. Again, I assume there's a reason for the complexity, but I don't know what it is. Is it 3rd party event loops like uvloop? Or frameworks like aiohttp?
It could just be a historical accident. IIRC, early on the event loop didn't have any knowledge of tasks, so all_tasks
was tracked in class attributes of Task
instead of introducing such a link.
I am still not convinced, despite this being a common stumbling point, that we can fix this without consequences for use code.
I appreciate your caution here, but I'll point out that the docs for create_task
encourage users to create strong references from global variables to all of their Tasks, so if there are risks to doing this in create_task
itself, the docs should reflect the same considerations we're discussing here. (That's how I came to this issue. I was thinking about making a global background_task set as documented, but realized that create_task
could be doing that for me but it's not, and I don't understand why not)
Arguments for having create_task
manage a set of strong references to all active tasks:
Arguments against:
The worst-case scenario that I can see if we make this change would be in a program that calls asyncio.run
in a loop, and each time it leaves some tasks running (maybe waiting on a sleep) without cancelling them (or they catch the asyncio.CancelledError that is thrown into them when the loop shuts down). That's a case where the weak set really would save us from a memory leak. But the risk goes away if the task set were moved from a global to an attribute of the event loop, since the loop and all pending tasks could go away at once. That feels like a fairly safe solution to me, except for the fact that it's now two changes instead of one, to code that is already poorly understood :)
I would be interested in seeing "all_tasks" going from a state-global weakset of "alive tasks" to a per-loop strong-set of pending tasks.
interestingly (maybe), my motivation is not enabling free-flying tasks - it is performance! adding tasks to the all_tasks weakset is pretty slow (because WeakSet is pretty slow) - using a regular set instead makes it faster. (10-20% faster across the async benchmarks based on recent experimentation by @jbower-fb and reproduced by me)
some observations from trying to dig through the evolution of all_tasks from tulip to today:
register_task
and unregister_task
APIs are relatively recent additions, and asyncio itself doesn't really use unregister_task
to keep the registry tidy - it relies on the weakness of the set.
unregister_task
in asyncio, and probably a reasonable expectation from third-party loop/task providers to use these APIs properly.unregister_task
would be more than offset by the gain from a set being much more efficient than a WeakSet.test_log_destroyed_pending_task
test demonstrates this. maybe this is also ok?all_tasks()
API itself (in CPython asyncio, not tulip) never really returned all tasks (only briefly when introduced, but it was changed to pending tasks before it was released, in 3.7 I think).so... considering there's some convoluted history behind all_tasks in general, and it being a WeakSet in particular, and changing this would make some users happy (free-flying tasks! perf!) and others unhappy (backwards incompatible change!) -- is there a realistic path to changing this?
so... considering there's some convoluted history behind all_tasks in general, and it being a WeakSet in particular, and changing this would make some users happy (free-flying tasks! perf!) and others unhappy (backwards incompatible change!) -- is there a realistic path to changing this?
Give it a try, but aim for 3.13. (Just sitting on the PR for 2.5 weeks should do it.)
I'm not sure though it would be easy (or possible) to solve the problem of "stuck tasks" being kept alive forever by the strong reference. I think the
test_log_destroyed_pending_task
test demonstrates this. maybe this is also ok?
The example in that test can be reduced to:
async def c():
fut = asyncio.create_future()
await fut
With a strong-set this will be kept alive, but with a weak-set it will be GC'd.
I find this pretty unintuitive but it happens because there are no references to the Task
/coroutine
in the event-loop. What is supposed to happen is someone will call set_result()
or set_exception()
on the Future
. This in-tern would cause the Future
to schedule a callback onto the event loop to re-schedules the Task. This will of course never happen as the coroutine is blocked and is the only thing with a reference to the Future
. With all_tasks
as a weak-set this free-standing reference cycle is GC'able.
In my mind, anyone writing code like this deserves a memory leak. They might even want a leak as this might be one way of spotting a programming error here.
@itamaro Your comment seems similar to what I have planned in https://github.com/python/cpython/issues/80788#issuecomment-1276443890 to do in 3.13.
it seems that there was an intentional effort over the years to keep the task and the loop decoupled. maybe this helps third party loop providers and task providers? moving the registry to the loop would introduce such coupling (maybe that's ok - I don't really know).
One way to avoid expanding the AbstractEventLoop interface would be to make all_tasks
become a WeakKeyDictionary[EventLoop, Set[Task]]
so that everything still becomes eligible for GC when the event loop is discarded, but the event loop does not need to explicitly know what's going on. But I think it would be cleaner to make it a responsibility of the event loop if we can tolerate the backwards incompatibility (I suspect uvloop might actually prefer it this way - I'm sure they'd rather avoid an indirection through a WeakKeyDictionary)
This will of course never happen as the coroutine is blocked and is the only thing with a reference to the Future.
The Future will never resolve, but there is still one way the Task can be rescheduled: if someone cancels the task via the reference in all_tasks
(which is done automatically by asyncio.run()
). This was suggested in https://github.com/tornadoweb/tornado/issues/3173 as a kind of hacky "shutdown hook" to allow for some cleanup operations. One of the surprising quirks is that it works with an arbitrarily long await asyncio.sleep
(because the event loop's timer queue keeps a strong reference) but it doesn't work to await a plain Future
that never resolves (or equivalently an asyncio.Event
, etc).
In my mind, anyone writing code like this deserves a memory leak. They might even want a leak as this might be one way of spotting a programming error here.
I've made the same argument in the other direction: you might want the "task was destroyed but it is pending" log message because it's a more legible indication of a problem than a memory leak (which needs to either be large or occur frequently to be noticeable). But I think that overall, if someone writes a task that waits forever, they should get a task that lives forever (with the memory consumption that implies).
you might want the "task was destroyed but it is pending" log message because it's a more legible indication of a problem than a memory leak
Ah! but this is less likely to happen than you might think. Consider this runnable example:
import asyncio
async def x():
await asyncio.get_event_loop().create_future()
async def y():
asyncio.create_task(x())
asyncio.run(y())
This actually completes silently because the manually created task ends up being cancelled.
One way to avoid expanding the AbstractEventLoop interface would be to make all_tasks become a WeakKeyDictionary[EventLoop, Set[Task]]
But this will have the downside @itamaro and I want to avoid: WeakKeyDictionary
(implemented in Python) is likely slower than native strong-sets.
The Future will never resolve, but there is still one way the Task can be rescheduled: if someone cancels the task via the reference in all_tasks (which is done automatically by asyncio.run()). This was suggested in https://github.com/tornadoweb/tornado/issues/3173 as a kind of hacky "shutdown hook" to allow for some cleanup operations. One of the surprising quirks is that it works with an arbitrarily long await asyncio.sleep (because the event loop's timer queue keeps a strong reference) but it doesn't work to await a plain Future that never resolves (or equivalently an asyncio.Event, etc).
Interesting, I hadn't thought of cancelation to wake-up the task. Although overall this sounds like an even stronger argument for a strong-set as it will at least make things consistent.
itamaro Your comment seems similar to what I have planned in https://github.com/python/cpython/issues/80788#issuecomment-1276443890 to do in 3.13.
@kumaraditya303 what you describe there sounds great if you're okay with using a strong-set. Is that the case?
FYI I've implemented a Tornado-specific version of the change proposed here (a strong set of references to pending tasks as an attribute of the event loop, cleared via Future.add_done_callback) in https://github.com/tornadoweb/tornado/pull/3269. Assuming some version of this idea makes it into a future version of python, I'll revert tornado's version.
Maybe I missed something in the thread, but as far I can see, everyone agrees that this only happens to tasks that have got stuck on an abandoned Future?
If so, then I'm with @bdarnell that the garbage collection can be considered a feature, not a bug.
There is obviously a bug in such code, as the task will never do anything useful from that point. As a developer, getting a warning that a pending task got killed is very helpful in finding and addressing that bug.
So I would prefer not to change any code in asyncio.
I'd very much like the documentation to change, though. The current warning for create_task()
is just incorrect given the above understanding. You would only need to keep strong references if you want to force hung tasks to stay in memory. Which doesn't sound terribly useful.
Maybe I missed something in the thread, but as far I can see, everyone agrees that this only happens to tasks that have got stuck on an abandoned Future?
I don't think that's true. I'm not sure about the specifics but I have multiple reports from people who say that making strong references here fixed their problem. It seems unlikely that this was all about Futures that were truly abandoned, as opposed to Futures that were part of a subgraph that was collectible but only because the pattern of strong vs weak references was unexpected.
If so, then I'm with @bdarnell that the garbage collection can be considered a feature, not a bug.
That is no longer my position. I used to believe that but I now think that using strong references to prevent GC is the better approach.
Perhaps I'm misunderstanding you, but that still sounds like abandoned Futures? Perhaps in very complex ways that are difficult to anticipate, but abandoned nonetheless?
So there must be something more lurking here. If those Futures were part of something collectible, then how did adding a strong reference to the task fix anything? It should have just kept them in memory, not affected what gets executed. Right?
Does any of those reports have something somewhat reproducible that can be analysed and experimented with?
I don't have a complete reproducible example since the introduction of native coroutines, but I studied an older example of this problem in https://github.com/tornadoweb/tornado/issues/1429
The thing that's confusing here is that the direction of references in coroutines is the opposite of that in typical control flow: when a synchronous function calls another, the inner stack frame holds a reference to the parent stack frame, and the innermost stack frame of each thread is a GC root to keep the whole stack alive. In Future
-based async code it's more complicated because both parent and child "stack frames" (which are not just a normal stack frame but also include the Task objects) refer to the same Future
object and the only references up the stack are indirect. When the system is at rest, the Future object has a done_callback that refers to the parent and keeps the whole stack alive. But when a Future completes it clears its callback references and in this transition there is a window in which an entire subgraph of async tasks can be garbage collected. (this seems to require something more than a chain of await f()
calls - the tornadoweb/tornado#1429 example used tornado.gen.WaitIterator
which is analogous to asyncio.gather
)
It would be awesome if someone could come up with a simple repro of the scenario that @bdarnell describes here. From there it might be possible to decide how to resolve the problem -- I don't think that window (where the subgraph can be GC'ed) should exist, so I consider it a bug.
The test you referred to, @bdarnell, doesn't seem to have any asyncio in it? If I run that test with Tornado 4.1, I can reproduce the issue, even with Python 3.11. But if I convert the test case to asyncio, then there is no problem any more.
So I'm not sure if that test case is relevant here? It seems to rather indicate some bug in Tornado?
Ben is using it as an example of the general principle he explains. A task may be GC’ed when a future it waits on completes, before it continues running.
Yeah, it's an example of the general pattern and not necessarily something that would translate directly to asyncio. I've tried doing this translation myself and making some other variations on the idea without reproducing the problem in asyncio.
It is possible that the problem is Tornado-specific. The most relevant difference that I see is that in Tornado each coroutine has two associated objects: a Future to hold the result and a Runner to drive the next callback. In asyncio, these two objects are replaced with a single Task (asyncio.Task is a subclass of Future that adds the functions analogous to Tornado's Runner). And since we don't have bidirectional references between the Future and Runner, maybe the object graph can have disconnected subgraphs that just aren't possible in asyncio.
But the tornado Runner object is legacy code - in modern apps that use async def
it isn't used any more, we use asyncio.Task instead. We have enough anecdotal reports of a problem in recent years that it seems likely that some of them are only using asyncio.Task, but the only reproducible examples we have involve tornado.gen.Runner.
And there have definitely been reports of situations in asyncio that cause similar problems, e.g. gh-90467.
Okay, I found the issue with gh-90467. It was caused by the fix in gh-78819. The weak link between StreamReaderProtocol
and StreamReader
breaks the chain between the event loop and the coroutine waiting for the StreamReader
object.
The chain, as far as I understand it:
BaseSelectorEventLoop => EPollSelector => SelectorKey => Handle => _SelectorSocketTransport => StreamReaderProtocol =×=> StreamReader => Future => Task
If I convert that weak reference to a strong one, then the bug goes away.
Yes, that's a familiar pattern from Tornado too - see a circular reference, try to break it by making one direction a weak reference, but it turns out that reference was the one holding the whole structure together. This is one of the consequences of my comment about the direction of the object graph being the opposite of what you expect - it's easy to put a weak reference in the wrong place.
It's possible that this is the whole problem - if we never used weak references and just allowed the strong reference cycles to exist, maybe there wouldn't be any way for "free-flying tasks" to get prematurely garbage collected. But then we'd have a different problem, of the memory and CPU cost of relying more heavily on the full GC to free up these object cycles.
The global set of all pending tasks avoids this cycle GC problem by clearing the map entry when the Future resolves. Maybe there's a way to use callbacks on the future to break cycles manually instead of using weak references? But that's difficult to apply systematically; the nice thing about the global set is that even though it's inelegant, it does solve the problem for all tasks without having to fix all the individual instances of the problem.
Okay, I found the issue with gh-90467. It was caused by the fix in gh-78819. The weak link between
StreamReaderProtocol
andStreamReader
breaks the chain between the event loop and the coroutine waiting for theStreamReader
object.The chain, as far as I understand it:
BaseSelectorEventLoop => EPollSelector => SelectorKey => Handle => _SelectorSocketTransport => StreamReaderProtocol =×=> StreamReader => Future => Task
If I convert that weak reference to a strong one, then the bug goes away.
This was also the issue that I encountered afair. In my top post I mentioned another weak reference that might cause issues (not tested though). It might be possible to avoid such bugs by making these references strong ones. But then this behaviour has to be documented somewhere to make future developers aware of the problem. Also, similar to _all_tasks
, someone had reasons why he chose to use weak references there (as @CendioOssman found).
All in all, I would prefer solving the problem on the task level (i.e. introducing a _pending_tasks
set or keeping strong references in _all_tasks
), as this seems to be a much more robust solution and less convoluted modification than changing those weak references.
I can understand the concern, and I'm growing more open to the conclusion that some extra references to tasks is the lesser evil.
I can be a bit concerned that this is a broader problem than just tasks, though, as this fundamentally has to do with how the event loop works rather than specifically tasks? Perhaps weak references should be viewed as something very dangerous and the barrier for accepting usage of them should be much higher?
What I'd like to avoid is also throwing out the good effects the current system has. Specifically, that the test test_tasks.py:BaseTaskTests.test_log_destroyed_pending_task()
continues to work.
The discussion here so far has been very long term, future proofing. However, this test case is currently broken:
import asyncio
import gc
async def background():
remote_read, remote_write = await asyncio.open_connection("example.com", 443, ssl=False)
await remote_read.read()
async def cleanup():
while True:
gc.collect()
await asyncio.sleep(3)
async def main():
asyncio.create_task(background())
await cleanup()
asyncio.run(main())
Should I open a new issue for that, or do we deal with it here?
@CendioOssman If you have a solution in mind, you can submit a PR and link it to this issue in the PR title.
I have an idea at least. I'll open a PR once I have something that seems to work.
However, after looking more at this, the active task list unfortunately doesn't completely solve this. So I'm back towards preferring some ban on weak references. You can hit this issue even without coroutines:
import asyncio
import functools
import gc
import weakref
class Proto:
def __init__(self, reader):
self.reader_wr = weakref.ref(reader)
asyncio.get_event_loop().call_later(20, self._wake_reader)
def _wake_reader(self):
self.reader_wr().wake()
class Reader:
def __init__(self):
self.proto = None
def set_proto(self, proto):
self.proto = proto
def wait(self):
self.waiter = asyncio.get_event_loop().create_future()
return self.waiter
def wake(self):
self.waiter.set_result(None)
class LogFunc:
def __init__(self):
self._done = False
def __del__(self):
if not self._done:
context = {
'message': 'Callback was destroyed but it is pending!',
}
asyncio.get_event_loop().call_exception_handler(context)
def __call__(self, *args, **kwargs):
self._done = True
print("Foo", args, kwargs)
def start():
reader = Reader()
proto = Proto(reader)
reader.set_proto(proto)
fut = reader.wait()
fut.add_done_callback(functools.partial(LogFunc(), reader, proto))
async def cleanup():
while True:
gc.collect()
await asyncio.sleep(3)
async def main():
start()
await cleanup()
asyncio.run(main())
The above example constructs the same weak reference structure as the streams API, but there are no tasks involved that could be used to anchor things.
That example just shows what weakrefs do: they don't keep an object alive. That's a feature of weakrefs. If you want the object to stay alive, don't use a weakref. So what's the point of your example?
Indeed. But the example uses the same design as StreamReader
and StreamReaderProtocol
currently uses (from gh-78819).
So this is the type of code we are trying to make "safe" by providing a strong reference elsewhere. The point of the example was to show that the suggested fix of having a list of tasks will be insufficient in that goal. So something more would be needed.
@CendioOssman I'm afraid I've lost track of what you're arguing. Likely you are in violent agreement with the other participants here. You speak of a ban on weak refs. Do you mean everywhere, or everywhere in asyncio, or in a specific place in asyncio? Or do you mean this as a general recommendation to asyncio users? And you claim that a list of active tasks isn't sufficient. I'd think that would be sufficient for preventing active tasks from being lost -- are there other things in asyncio that aren't kept alive? (Handles or callbacks?)
@CendioOssman Thank you for the example in https://github.com/python/cpython/issues/91887#issuecomment-1805959882, this is the simple non-Tornado example we're looking for. No weakrefs here (unless there are some in asyncio itself). I'm a little confused by your most recent example, but the first one works to illustrate the problem we're discussing in this issue.
For those following along who haven't run the code, the example I linked will log a "Task was destroyed but it is pending" message for the background()
coroutine. In this particular example the coroutine doesn't actually do anything upon completion, but you could add a print()
call to the end of the function and see that it never gets called. Instead, a GeneratorExit
exception is raised by one of the await
s when GC runs.
So this is an illustration of the guideline in the docs to save a reference to the result of create_task
. If create_task
added all pending tasks to a global set, this code would work as expected. Separately, you say that the second example shows that there are unexpected results even without create_task
, but given the use of weakref
I'm not sure whether that's a real issue or not. If you want to argue for the relevance of the second example, could you say more about why this particular use of weakref is expected and the GC behavior is at fault?
I agree with @gvanrossum that just because it's possible to construct a scenario where a weak reference's object is destroyed, that doesn't necessarily mean that that's a problem.
As I see it, the entire idea behind garbage collection is that objects, that are no longer reachable by user code, can be destructed. Since user code is executed by tasks in asyncio, the natural assumption is that strong references are held from the side that executes user code, i.e. by tasks. Apparently the asyncio standard library also was designed with this intuition in mind. Introducing a _pending_tasks
set would live up to this intuition and fix the issue. Not sure if anyone is working on this at the moment. If not, I can suggest a PR.
Is there still an action item here? If not, I can close it (either as "won't fix" or "fixed" depending on the mood).
Yes, there's still an issue here. There's some discussion of weak references and whether we should care, but that's kind of a distraction and we have examples without weak references and only using core asyncio code.
This minimal example was provided by @CendioOssman in https://github.com/python/cpython/issues/91887#issuecomment-1805959882 and I'll repeat it here:
import asyncio
import gc
async def background():
remote_read, remote_write = await asyncio.open_connection("example.com", 443, ssl=False)
await remote_read.read()
async def cleanup():
while True:
gc.collect()
await asyncio.sleep(3)
async def main():
asyncio.create_task(background())
await cleanup()
asyncio.run(main())
This is a background task which does not follow the docs' guidance to hold a strong reference to the result of create_task
. The task gets garbage collected and its side effects (if any) won't happen.
If this were the only instance, perhaps we could say that asyncio is working as documented and nothing needs to change. But going back to the original message in this issue, asyncio.shield()
and asyncio.gather(..., return_exceptions=False)
create very similar situations, and in those cases there is no natural (non-global) place to store any strong references.
I continue to think that a global _pending_tasks
set is the best solution here (this is what I implemented in Tornado where we see a similar problem). An alternative would be to introduce an explicit notion of "background tasks" and only store global references to those, but this would remain error-prone.
I've been meaning to come back to this issue and provide some tangible change suggestions. So I would appreciate it if things could be kept open for a while longer. :)
Let's call the example with classes Proto
, Reader
and LogFunc
a red herring. Yes, it shows that such a structure is possible without tasks, and I believe that asyncio's stream reader design has a similar structure, but I believe that if someone is using a stream reader without tasks they have plenty of opportunity to put in hard references, and they should.
So let's focus on the shorter example with coroutines background()
and cleanup()
, which demonstrates that there's a more general issue when tasks are not kept alive by the user (regardless of whether they are using stream readers).
Like Ben, I don't immediately see a downside to having a global set of pending tasks either, as long as tasks are guaranteed to remove themselves from it when they complete (and without the need for __del__
).
That doesn't mean there isn't a downside, but we'll never find out until we try. We'll probably have to ask @1st1 to think about this -- there might be a reason that involves uvloop or Edgestore. Possibly the fact that uvloop has its own task factory makes things more complicated -- we should definitely think about that some more. Perhaps we can use a done callback that unlinks the task when it completes?
@CendioOssman, are you interested in coming up with a PR? Or do you continue to feel that the other example must also be fixed? In that case we may have to agree to disagree, and someone else can create a PR.
Here's a concrete example of a hack to workaround this issue https://github.com/Falmarri/podman-compose/blob/8d8fa54855ce7eb73e802ef06e9f48645a30e2ac/podman_compose.py#L1229-L1237
It's possible there's a better way of structuring this, but IMO this should just be the default where I don't have to care about this. I can just start these tasks and not have to worry.
I have a feeling we need a new champion to drive a PR here. I think a global or per-loop (non-weak) set of active tasks should solve the issue, but there are a bunch of details that need sorting through -- notably how we ensure that 3rd party tasks (e.g. from uvloop) are properly inserted into and removed from the set, without requiring changes to the 3rd party library.
I have a feeling we need a new champion to drive a PR here. I think a global or per-loop (non-weak) set of active tasks should solve the issue
cc'ing @kumaraditya303, in relation to gh-104787 and https://github.com/python/cpython/issues/80788#issuecomment-1276443890 - are you still planning to tackle this for 3.13?
@itamaro @kumaraditya303 Unless one of you is actively working on a PR for this, I would recommend we move this item back to TO DO. Thoughts?
@itamaro @kumaraditya303 Unless one of you is actively working on a PR for this, I would recommend we move this item back to TO DO. Thoughts?
Agreed. I haven't worked on this.
Moving back to To Do for now.
In #88831 @vincentbernat pointed out that CPython only keeps weak references in
_all_tasks
, so a reference to aTask
returned byloop.create_task
has to be kept to be sure the task will not be killed with a "Task was destroyed but it is pending!" at some random point in time.When shielding a task from cancellation with
await shield(something())
,something
continues to run when the containing coroutine is cancelled. As soon as that happens,something()
is free-flying, i.e. there's no reference from user code anymore.shield
itself has a bunch of circular strong references, but these shouldn't keep CPython from garbage-collecting the task. Hence, here the same problem occurs and the task might be killed unpredictably. Additionally, when running coroutines in parallel withgather
andreturn_exceptions=False
, an exception in one of the coroutines will leave remaining tasks free-flying. Also in this case, the remaining tasks might be killed unpredictably.Hence, a warning in the documentation for
create_task
unfortunately does not suffice to solve the problem. Additionally, it has been brought up in #88831 that an API for fire-and-forget tasks (i.e. when the user doesn't want to keep a reference) would be nice.As solution, I suggest to either
(1) introduce a further
_pending_tasks
set to keep strong references to all pending tasks. This would be the simplest solution also with respect to the API. In fact, a lot of dicussions on Stack Overflow (e.g., here, here, here) already rely on this behavior (throwing away the reference returned bycreate_task
), although it's wrong currently. Since the behaviour for free-flying tasks is unpredictable currently, it should not introduce any compatibility issues when making it predictable by preventing them from being garbage-collected.(2) make sure there's always a chain of strong references from the most basic futures to the running tasks awaiting something. A quick
grep
resulted in potential problems, e.g., here and here. This does not seem like a very robust approach, though.(3) introduce the concept of background tasks, i.e., tasks the user does not want to hold references to. The interface could look like suggested in https://github.com/python/cpython/issues/88831#issuecomment-1105619239 . Tasks from
shield
andgather
could be automatically converted to such background tasks. Clearly, it would add complexity to the API, but the distinction between normal tasks and background tasks might potentially be beneficial also for other purposes. E.g., one might add an API call that waits for all background tasks to be completed.My preferred solution would be (1).
Linked PRs