Async Call-Stack Reconstruction

f820ac6c-889b-43b7-b077-21b10aeb370a commented 2 years ago

BPO	46892
Nosy	@gvanrossum, @DinoV, @asvetlov, @1st1, @kumaraditya303, @itamaro, @mpage

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['interpreter-core', '3.11', 'type-feature', 'expert-asyncio'] title = 'Async Call-Stack Reconstruction' updated_at = user = 'https://github.com/mpage' ``` bugs.python.org fields: ```python activity = actor = 'asvetlov' assignee = 'none' closed = False closed_date = None closer = None components = ['Interpreter Core', 'asyncio'] creation = creator = 'mpage' dependencies = [] files = [] hgrepos = [] issue_num = 46892 keywords = [] message_count = 5.0 messages = ['414301', '414884', '414930', '414948', '414985'] nosy_count = 7.0 nosy_names = ['gvanrossum', 'dino.viehland', 'asvetlov', 'yselivanov', 'kumaraditya', 'itamaro', 'mpage'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue46892' versions = ['Python 3.11'] ```

Linked PRs

gh-103976

f820ac6c-889b-43b7-b077-21b10aeb370a commented 2 years ago

Profiling tools that use the call-stack (i.e. all of them) paint an incomplete picture of what’s really going on in async-heavy codebases. They can only show the stack of the currently executing task; they miss the chain of awaitables that are transitively waiting on the current task. To remedy this, we have added support in Cinder to expose the async call-stack. This consists of the call stack for the currently executing task, followed by the chain of awaitables that are transitively reachable from the currently executing task. See below for a clarifying example.

async def f1():
  return await f2()

async def f2():
  return await asyncio.ensure_future(f3())

async def f3():
  return await f4()

async def f4():
  await asyncio.sleep(10)
  return 42

When retrieved from f4, the two different stacks (top-of-stack last) are: sync - [f3, f4] async - [f1, f2, f3, f4]

We’d like to merge our implementation into CPython so that other heavy users of asyncio can benefit. This will consist of a few parts:

A new limited C-API to set and retrieve the “awaiter” of an awaitable object.
Additions to PyAsyncMethods to store function pointers for setting and retrieving the awaiter on instances.
An API in managed code to retrieve the async call stack as a list of fully qualified names (i.e. \<module>:\<class>.\<function>).

asvetlov commented 2 years ago

We’d like to merge our implementation into CPython

Could you provide a link first, please?

f820ac6c-889b-43b7-b077-21b10aeb370a commented 2 years ago

Sorry for the confusion, I'm working on a PR. I filed the BPO to gauge interest in the feature.

gvanrossum commented 2 years ago

I've recently dabbled a bit in some new primitives for asyncio, and based on that experience I think this would be very useful.

IIRC Trio does this (presumably at considerable cost) in userland.

asvetlov commented 2 years ago

The idea looks interesting. The devil in the details I guess. I'm curious what is the memory and performance penalty. Waiting for the PR as the discussion starting point.

itamaro commented 2 years ago

somewhat related discussion (where this feature might have been helpful) - https://discuss.python.org/t/can-i-get-a-second-opinion-on-an-asyncio-issue/18471

this is the cinder 3.10 implementation of the cr_awaiter field

gvanrossum commented 2 years ago

If someone wants to move this forward let them propose on a design here, and (if they're confident enough) submit a PR.

gvanrossum commented 2 years ago

@mpage Are you still interested in working on this? I am still interested in having this as a feature in CPython!

mpage commented 1 year ago

@gvanrossum Yes, still interested! Just haven't found the time yet to start working on it.

gvanrossum commented 1 year ago

You might be interested in the existence of Task.get_stack(). Apparently it was part of the original asyncio code (I'd forgotten about it). I'm not sure if it addresses this problem in general (it's part of asyncio) or if it is fast enough or if it even works.

mpage commented 1 year ago

Task.get_stack() will only give you a call stack for that task. The interesting/useful part of this proposal is that it allows you to construct an "await stack" across Task boundaries. This produces a clearer/more complete picture of the control flow in many cases. Maybe there's a way to stitch the results of Task.get_stack() for multiple tasks that I'm not aware of.

For example, the code from the initial comment is codified in https://gist.github.com/mpage/584a02fc986d32b11b290c7032700369. Unfortunately you need Cinder in order to run it. When using Task.get_stack() we see the synchronous call stack for the task running the f3 and f4 coroutines. When using the new approach, we see the full "await stack," f1, f2, f3:

> ./python ~/local/scratch/trace_async_stacks.py
Stack retrieved from Task.get_stack():
  <module>
  run
  run_until_complete
  run_forever
  _run_once
  _run
  f3
Stack retrieved using awaiters:
  f1
  f2
  f3
>

gvanrossum commented 1 year ago

Oh, you're right. Task.get_stack() is disappointing -- it knows a bit about how to get to the the stack frame of the coroutine running in the task, and then it just walks the call stack following f_back, which is uninteresting. I'm sorry I even mentioned it!

Looking through the Cinder code it seems that this requires a fair amount of C code (which is maybe why you haven't submitted your PR yet?). Is that fundamental or an optimization? How would pure Python code in 3.11 go about finding the awaiter of a suspended coroutine? Is there just no way? What if we know it's an asyncio task?

gvanrossum commented 1 year ago

I just created a little proof-of-concept that gets the await chain for a task, unless that task is the current task. To get around that, you can create a dummy task and call it from there. Here's the basic code (sans dummy task hack):

def get_stack(task):
    coro = task.get_coro()
    frames = []
    while coro is not None and hasattr(coro, "cr_frame"):
        frame = coro.cr_frame
        frames.append(frame)
        coro = coro.cr_await
    return frames

It probably needs improvements (I see in Tasks.get_stack() that there are complications for legacy coroutines and async generators) but this seems to get the job done for suspended tasks.

Maybe that's enough if you need this in a debugging context? Or is this something where performance is important? Or am I missing something? (Maybe you need this when an uncaught exception is raised?)

gvanrossum commented 1 year ago

Talking to myself here, my above code doesn't appear to be able to cross task boundaries -- I can get it to produce [f3, f4] for your example program (by calling it from a dummy helper task), but it's missing [f1, f2]. It looks like the iterator used by the Future object (either the Python or the C version) is impenetrable. Thoughts? (Gist here.)

gvanrossum commented 1 year ago

I have a better version now. It relies on asyncio.Task, in particular the (public) asyncio.all_tasks() function and the (private) Task._fut_waiter attribute. Output for a program similar to yours:

Task stack:
  Task: Task-1
    Coro: main
  Task: F1
    Coro: f1
    Coro: f2
  Task: F3
    Coro: f3

This leaves a lot to be desired if you're in a coroutine world other than asyncio, and it's pretty inefficient, since it must traverse the _fut_waiter links backwards, and the only way I can think of doing that (without changing asyncio itself) is to create a complete reverse map by walking all tasks. If you occasionally need this for serious debugging it's probably fine; if you routinely log it at various points "just in case" this sounds expensive. (If we decide to make this an official API in 3.12 we can do something about this of course.)

My question for you at this point is, do you really need to do this purely at the coroutine level, or is it acceptable for your use case(s) if this is essentially a Task API?

gvanrossum commented 1 year ago

I just realized that f4 and f5 were missing from the output. A little tweak to the logic handles that -- the coro stack should be collected from the helper task as well. I've updated the gist, and the output is now:

Task: Task-1
  Coro: main stack.py:108
Task: F1
  Coro: f1 stack.py:82
  Coro: f2 stack.py:85
Task: F3
  Coro: f3 stack.py:88
Task: F4
  Coro: f4 stack.py:91
  Coro: f5 stack.py:94
  Coro: task_stack stack.py:69

mpage commented 1 year ago

Our primary use case for this feature is an always-on sampling profiler (roughly equivalent to GWP at Google). Periodically, for each machine in the fleet the profiler will wake up, grab the await stack for the currently running task, and finally send the await stack to a service that performs aggregation. The primary use of this data is to provide CPU-time profiling for application.

Performance is important for us. Since this is always on and pauses request processing we don't want the implementation to negatively impact performance of in-flight requests. We typically have a large number of active tasks (I'm somewhat embarrassed to admit that I don't have a number handy here), so I'm not sure the approach you've taken here would work for us. The logic for collecting the await stack is currently implemented as an eBPF probe and the awaiter implementation in Cinder simplifies that. The probe "only" has to walk the await stack backwards from the coroutine for the current task.

The approach you've taken is neat but unfortunately I don't think it can cross gather boundaries. I modified your example slightly to print the await stack of two gathered coroutines.

Your version:

> python3 ~/local/scratch/print_await_stack_upstream.py
Task: F4_0
  Coro: f4 ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:95
  Coro: f5 ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:98
  Coro: task_stack ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:69
Task: F4_1
  Coro: f4 ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:95
  Coro: f5 ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:98
  Coro: task_stack ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:69
>

Cinder version:

> ./python ~/local/scratch/print_await_stack_cinder.py
Stack retrieved using awaiters:
  f1
  f2
  f3
  f4
  f5
Stack retrieved using awaiters:
  f1
  f2
  f3
  f4
  f5
>

gvanrossum commented 1 year ago

Yeah, I can see that a sampling profiler would need some support from the C-level data structures, both for speed and to make the eBPF code for collecting the await stack from simpler (or possible).

The problem with GatheringFuture seems to be shared with anything that doesn't wait for a task -- using a TaskGroup has the same problem. I'm sure I could work around it (though I'd probably have to modify a few spots in asyncio) but I agree that as a general approach this isn't going to work.

So I'm looking forward to the PR you promised!

mpage commented 1 year ago

So I'm looking forward to the PR you promised!

Hoping to have some time in the next couple of weeks to put an initial PR up :)

bigfootjon commented 1 year ago

(Maybe you need this when an uncaught exception is raised?)

I know the Cinder folks haven't flagged this, but having an easier way to walk "async frames" when handling an exception would make some things easier in Django's @sensitive_variables decorator.

The decorator works by attaching a bunch of locals to a wrapper function:

https://github.com/django/django/blob/aff649a3bd0c4fd8f7d859464f0eb907e877443f/django/views/decorators/debug.py#L36-L44

Then looking for that wrapper function in f_back in the traceback:

https://github.com/django/django/blob/85b52d22fd2841c34e95b3a80d6f2b668ce2f160/django/views/debug.py#L244-L256

If we can find the wrapper function then the error reporter hides the listed sensitive variables when generating the debug 500 page. (I apologize that requires a bit of reading and isn't in Gist form, but I thought it might be useful to see the utility for this in a practical example)

This doesn't work for async functions because f_back is None for a coroutine frame (sorry if I'm mangling the terminology here, I'm at the edges of my knowledge of how all this works), and I'm not sure its possible to "reconstruct" the frame (though this is probably a lack of knowledge on my part).

I've tried reading https://docs.python.org/3/library/inspect.html but the docs are a little sparse on how these pieces fit together and again I don't really know what I'm talking about / what most of the discussion in this issue is about.

Anyway, I'm not entirely sure if adding cr_awaiter fixes the limitation for Django, so I wanted to flag this additional consideration (if this moves forward).

I can file a separate Issue / post on a discussion forum if that would be more appropriate for this

carljm commented 1 year ago

@njsmith I'm curious if you have any feedback on this proposed feature from the perspective of Trio. Guido mentioned above that maybe Trio goes to some lengths to make complete logical async stacks available?

See also the PR and @markshannon 's comment there proposing that instead of building this feature into coroutines, instead the async frameworks should all agree on how Tasks-or-equivalent can link to their awaiting Task-or-equivalent, allowing stitching their stacks together. (I'm not sure how feasible it is for different frameworks to represent this in the same way so that profilers don't have to treat them all as special cases, or how feasible it is for an out-of-process profiler to find the Task objects and do this stitching regardless.)

njsmith commented 1 year ago

If I'm understanding correctly, this is solving a problem that Trio never has in the first place, because we don't have a concept of "awaitable objects".

We do struggle with call stack introspection in a few ways. I think the main ones are:

Reconstructing the stack of a sleeping task is problematic. I don't remember the exact details but something like -- for a regular coroutine you can reconstruct the stack by starting at the root coroutine in the task and following cr_await links, but for some objects like the pseudo-coroutines created by calling async_generator.__next__ they don't expose a cr_await field? maybe?
- More minor but while I'm here: even for regular coroutines the way the stack information is encoded while sleeping is rather weird, where accessing the field involves poking around in the bytecode stack -- see _PyGen_yf. This is fine from Python, because you just type .cr_await and it handles the tricky bits for you, but it's awkward for out-of-process profilers like py-spy.
The big one: in order to reconstruct a complete stack trace, that crosses tasks, we can go from a task -> the nursery ("task group") that spawned it. But then to keep walking the stack up into the parent task, we need to figure out which frame in the parent task logically opened the nursery/task group. This is quite non-trivial, because:
- It's often not the leaf frame in that task (after opening the nursery/task group, the parent task may have done await some_other_function())
- While there is always a unique frame in the parent task that contains the async with that created the nursery/task group, it might not have created it directly -- it might be async with some_wrapper(): ..., where some_wrapper is an @asynccontextmanager that internally opens the nursery/task-group. Or it might be async with AsyncExitStack(): .... So we can't just have the nursery/task-group construct capture the parent frame when it's created -- the actual frame we care about might be several levels up. But capturing an entire stack trace every time we open a nursery/task-group is way too expensive, esp. in terms of all the objects it keeps pinned in memory...

Currently we work around these issues with deeply messy hacks involving calling gc.get_referents() on frame objects; see https://github.com/oremanj/stackscope for details. Also CC'ing @oremanj as he'll certainly remember the details here better than I do :-)

For the problem of attributing nursery/task-groups to parent task frames, I think there actually is a neat solution we can implement. You may recall that we want to add some feature to prevent yielding out of a nursery/task-group (cf https://discuss.python.org/t/preventing-yield-inside-certain-context-managers/1091); no-one had time to get it done for 3.11 but I'm actually hoping to get the PEP posted within the next day or two. One thing I realized while writing it, though, is that for this feature we need to track which frame is logically responsible for opening a nursery/task-group, and I think we have a simple algorithm for that. And once we have the information it ought to be cheap and easy to use it for call stack reconstruction too.

carljm commented 1 year ago

@njsmith I think that this feature would solve the problem you describe as "the big one"? So it seems like it does solve a problem that Trio has? (EDIT: on second thought I'm not sure it does; need more study. Thanks for the summary of Trio's issues in this area.)

njsmith commented 1 year ago

I don't think it does, but I could be wrong -- here's a concrete example to make sure we're on the same page :-)

async def parent1():
    await parent2()

async def parent2():
    async with trio.open_nursery() as nursery:
        nursery.start_soon(child1)
        await parent3()

async def parent3():
    await trio.sleep(10)

async def child1():
    await child2()

async def child2():
    await trio.sleep(10)

From this, we want to reconstruct the child's call stack as: child2 -> child1 -> parent2 -> parent1, and the parent's call stack as parent3 -> parent2 -> parent1. So notice that the child and parent only share a suffix of the parent's stack: parent3 is not present in the child's stack, because it was called after the child "forked off" from parent2. Also, the parent is not waiting on the child.

(And to clarify a few edge cases: (1) it doesn't matter where nursery.start_soon is called; just where the async with trio.open_nursery() happens. (2) instead of async with trio.open_nursery():, people often write async with wrapper_around_open_nursery():, and we want to handle that case too.)

gvanrossum commented 1 year ago

Per https://github.com/python/cpython/pull/103976#issuecomment-1539162460, can this wait for 3.13?

oremanj commented 1 year ago

My understanding is that this feature would support tracing async stacks "inside-out": starting with a particular task or coroutine, trace outward to find who's waiting on it. There is no way to do that currently, except maybe via gc.get_referrers() which is incredibly slow. I agree that it would potentially be useful, especially for out-of-process sampling profilers.

The semantics are decidedly funky, though. If you do

async def outer():
    t = asyncio.create_task(inner())
    await some_other_operation()
    await t

then the stack of inner() effectively gets grafted onto the stack of outer() only after some_other_operation() completes. That's not very stack-like. I'm also not sure how things like asyncio.gather() and asyncio.TaskGroup would interact with this feature.

Trio doesn't allow a task to directly wait on another task (except indirectly via synchronization primitives like Event). Every task is managed by a nursery (basically asyncio.TaskGroup) and that nursery is bound to the stack of some parent task using an async with block. Given the cr_awaiter feature as proposed, Trio would probably want to set a new task's root coroutine's cr_awaiter to the coroutine object that did the corresponding async with trio.open_nursery():, to get a stack that looks like the one @njsmith described. That's actually pretty expensive to determine currently: a frame object doesn't know the coroutine that it belongs to, so you'd have to start at the root coroutine for the task and trace forward along cr_await links. Even worse, these cr_await links are not set while the task is running; you'd have to wait for it to be suspended and then fix up the cr_awaiter of the new task. If it were possible to go directly from a running frame to its owning coroutine object, then the cr_awaiter interface would be more feasible (though still challenging due to the issue of nurseries hidden inside other async context manager). asyncio.TaskGroup is going to have an analogous set of problems.

Not everything in an async call stack is a coroutine object. Some of the challenging ones I ran into:

coroutine_wrapper objects which tend to show up when using awaitable objects (this is what gets returned by coro.__await__()). The coroutine they wrap is not directly exposed, but is needed to continue tracing inward. Solution: gc.get_referents()
async_generator_asend and async_generator_athrow objects are the awaitable objects returned by async generator asend() and athrow(). In order to continue tracing inward you need to know the async generator they wrap. This is also not exposed and stackscope also uses gc.get_referents() for it.

There are also regular generators (for yield from based coroutines) and async generators. Those do expose enough introspection attributes, but they're differently named, so you can't just blindly trace via cr_await.

The logic in stackscope for locating the context managers on a certain frame's stack (which is a prerequisite for drawing the nursery/task tree accurately) is by far the most convoluted/brittle part. There are two implementations, one that uses gc.get_referents() and one that uses ctypes-based frame object introspection + bytecode introspection. The gc.get_referents approach gets limited information, can't inspect running frames (only suspended ones), and can be fooled by code that holds references to __exit__ methods. The ctypes-based approach works well enough but needs sometimes-major changes with each new Python minor release.

I would recommend to anyone working on async call stack introspection to review stackscope's implementation; it's a pretty good synthesis of all the dark corners I've found in ~3 years of poking at this. There are a lot of dark corners. It would be great if we could make CPython changes that result in fewer dark corners. I don't know how much cr_awaiter is going to be able to do in getting us there, though.

jbower-fb commented 1 year ago

Unfortunately @mpage has not had time to look at this recently, so I'm having a try at addressing the comments on his original PR https://github.com/python/cpython/pull/103976, and this issue.

To start with I've made a completely new implementation here: https://github.com/jbower-fb/cpython-jbowerfb/commit/391837909e388aa8fccc5a8ce9867eb52732771b. This isn't PR ready yet - it's only got a Python implementation, is lacking proper tests, and almost certainly needs some public iteration. So, I wanted to get it up early for comment.

My version follows @markshannon and @njsmith's suggestions on https://github.com/python/cpython/pull/103976, and rather than linking coroutines instead links Tasks. In fact I've gone further and introduced a new abstract base class AsyncGraphAwaitable which captures general await-dependencies between awaitables. Both Future and TaskGroup implement this and in theory other things could too. AsyncGraphAwaitable contains a set of dependent awaiters which are updated via an add_awaiters() method. The add_awaiters() method should be called by any awaitable that awaits on another awaitable. In practice this mostly means Task , _GatheringFuture, and TaskGroup. Anything third-party will need to be updated to conform to this but the worst that happens if they don't is they won't get included in a graph capture - not ideal but doesn't break anything existing.

In addition to new await dependency data, I’ve added a function get_async_graph() to build logical call-graphs covering everything from the currently executing function to the entry-point of the program via awaitables. This works by walking frames and the links between awaitables, starting from the currently executing frame and Task. This produces a graph of AsyncGraphNode-derived objects which are inserted by an abstract method AsyncGraphAwaitable.makeAsyncGraphNodes().

Finally there is a straight-forward async_graph_to_dot() function which can turn a graph of AsyncGraphNodes into a GraphiViz Dot string for visualization. Example output is in asyncgraph_examples/. A PR-ready version probably also needs something which will turn this into a text-based graph - something like we get from git log --graph.

Most of the implementation is now in the asyncio library rather than the runtime. However, I have made some small tweaks there too which are hopefully fairly general and zero overhead. Firstly, I’ve altered _PyGen_yf() so it works for active coroutines (I think it already did this before YIELD_FROM was broken up). I’ve also taken on board @oremanj’s comments about cr_await and added a new yield_from() method to all generators in genobject.c, enabling fully walking the coroutine stack from Tasks (again, modulo third-party generators).

Note the problems @njsmith mentioned with awaits in a TaskGroup remain, but I've added hopefully clearer messaging in this case. A logical call-graph with an unawaited TaskGroup will end up with a node linking the dangling tasks to the parent task along with text showing a trace for the await which occurred within the open group. See the taskgroup_with_await.py example.

Looking forward to feedback.

python / cpython

Async Call-Stack Reconstruction #91048

Linked PRs