Make it easier to traverse the frame stack for third party tools.

markshannon commented 1 year ago

Profilers and debuggers need to traverse the frame stack, but the layout of the stack is an internal implementation detail. However can make some limited promises to make porting tools between Python versions a bit easier.

In order to traverse the stack, the offset of the previous pointer needs to be known. To understand the frame, more information is needed.

@pablogsal @Yhg1s expressed interest in this.

Linked PRs

gh-100988
gh-102640
gh-103010
gh-105727

markshannon commented 1 year ago

Initially, I propose to refactor the PyInterpreterFrame struct such that it starts:

typedef struct _PyInterpreterFrame {
    PyCodeObject *f_code;
    struct _PyInterpreterFrame *previous;
    ...

Currently f_code must be a code object, but we could generalize it to allow other objects. For example, the shim frame inserted on entry to _PyEval_EvalFrameDefault could have that field set to None indicating it should be skipped in tracebacks, etc.

The order of f_code and previous doesn't really matter, but have f_code first makes https://github.com/python/cpython/issues/100719 a bit simpler

pablogsal commented 1 year ago

Let me collect some feedback from maintainers of debuggers and profilers and will comment here the requirements so we can think of solutions.

markshannon commented 1 year ago

@pablogsal Any feedback?

markshannon commented 1 year ago

We can further improve traversal of the _PyInterpreterFrame for debugging and introspection by allowing C extensions to create frames without the rigmarole of creating a code object.

We should rename the f_code field to f_executable, and allow any object.

typedef struct _PyVMFrame {
    PyObject *f_executable;
    struct _PyVMFrame *previous;
} PyVMFrame;

Although tools and the VM should tolerate any object, we should in practice only allow a few classes:

CodeObject: Implies that the PyVMFrame is a full _PyInterpreterFrame. Only the VM should make this kind of frame
Builtin function, method descriptor, slot wrapper, etc. The frame represents a call to the given object.
None: Internal shim. Tools should skip this frame.
Tuple: First three items should be name, filename, flags where flags determine the meaning of additional entries.

The tuple form is for tools like Cython, Nanobind, etc. Creating a tuple of strs and ints is much simpler and faster than creating a fake code object.

C extension can link themselves into the frame stack at the cost of about 4 memory writes, and 3 reads:

    PyVMFrame frame;
    frame.previous = tstate->current_frame.frame;
    frame.f_executable = &EXECUTABLE_OBJECT;
    tstate->current_frame.frame = &frame;
    /* body of function goes here */
    tstate->current_frame.frame = frame.previous;

We can do this for builtins functions by modifying the vectorcall function assigned to the builtin function/method descriptor. We would need to benchmark this to see the performance impact, but it will be much cheaper than sys.activate_stack_trampoline()

pablogsal commented 1 year ago

@pablogsal Any feedback?

I have reached out again to tool authors, give me a couple of days to gather comments. Apologies for the delay

markshannon commented 1 year ago

No problem.

itamarst commented 1 year ago

@benfred -^

markshannon commented 1 year ago

I've made a branch that adds "lightweight" frames (just a pointer to a "code" object and a link pointer), and inserts one for each call to a builtin function in the interpreter. The performance impact is negligible and all builtin function and class calls are present in the frame stack.

Branch: https://github.com/python/cpython/compare/main...faster-cpython:cpython:allow-non-python-frames?expand=1

Performance: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20230315-3.12.0a6%2B-3e2c3ab/bm-20230315-linux-x86_64-faster%252dcpython-allow_non_python_fra-3.12.0a6%2B-3e2c3ab-vs-base.png

pablogsal commented 1 year ago

I've made a branch that adds "lightweight" frames (just a pointer to a "code" object and a link pointer), and inserts one for each call to a builtin function in the interpreter. The performance impact is negligible and all builtin function and class calls are present in the frame stack.

Branch: https://github.com/python/cpython/compare/main...faster-cpython:cpython:allow-non-python-frames?expand=1

Performance: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20230315-3.12.0a6%2B-3e2c3ab/bm-20230315-linux-x86_64-faster%252dcpython-allow_non_python_fra-3.12.0a6%2B-3e2c3ab-vs-base.png

We still need the concept of entry frames for tools that merge native and python stacks. Why do you removed _PyFrame_IsEntryFrame in your branch?

markshannon commented 1 year ago

It's a proof of concept, it was easier to remove _PyFrame_IsEntryFrame than re-implement it. _PyFrame_IsEntryFrame can be added back, if necessary.

pablogsal commented 1 year ago

It's a proof of concept, it was easier to remove _PyFrame_IsEntryFrame than re-implement it. _PyFrame_IsEntryFrame can be added back, if necessary.

👍

carljm commented 1 year ago

We are also very interested in this proposal from the Cinder JIT perspective.

One difference I see with our use case compared to what the draft PR so far aims to support is that we would like to be able to link in "minimal frames" that are still considered "complete": they are fetched by _PyFrame_GetFirstComplete() and can be materialized into a full PyFrameObject. (In the draft PR here, only _PyInterpreterFrame frames are considered "complete".) We don't want to constantly keep a _PyInterpreterFrame (localsplus contents) up to date while the JIT is running (this is expensive), so we'd need to get some kind of callback to reify our minimal frame into a PyFrameObject on-demand (i.e. some hook into _PyFrame_GetFrameObject).

In the tuple form of f_executable, could there be a well-known bit-flag in the third element that tells the interpreter "this should be considered a 'complete' frame, and the next word in its struct is a function pointer that will take the VMFrame struct and return a PyFrameObject."?

pablogsal commented 1 year ago

We should rename the f_code field to f_executable, and allow any object.

In general the feedback is that this will make introspection tool much harder to implement. Currently the fact that this can only be a code object makes it very easy to traverse the frame stack. If you allow any Python object it makes it harder or in some cases even impossible.

If we restrict this to a finite set of possibilities, it still makes it much harder but if we add some kind of enumeration to the frame that tells the tool what's going to be there it makes it a bit easier.

In general I don't think that this proposal helps introspection tools, it actually makes the implementation harder and less efficient because more pointers need to be copied and more logic needs to be included.

pablogsal commented 1 year ago

Some comments from authors:

I feel it won't be too easy to decipher the type of the object remotely. This would likely increase the number of private structures that we need to copy over from Python headers to parse this information (e.g. tuples), making things more complex. Of course one could just try treating the object as a PyCodeObject and check for failures, but this would now imply a potential loss of captured information, unless all the other object types that can appear here are also handled. Perhaps an extra int field that specifies the type of the object being passed with f_executable might help in this direction, to some extent. But perhaps one simplification that depends on a positive answer to the following question could be adopted: is the value f_executable crucial for the actual execution, or is it just added to carry the frame's metadata (e.g. filename, function name, line number, ...)? If that is added just for the metadata, perhaps that could be added directly to the _PyVMFrame structure in the form of extra fields? There could be a core set of fields that are common to all object types (filename, function qualname, location data), plus a generic PyObject reference that can be consumed easily by in-process tools. However, I can see the downside being that the cost would probably end up being slightly more than just 4 W and 3 R operations in general.

pablogsal commented 1 year ago

In general the sentiment is that the more regular the structure is, the easier is for profilers and debuggers to traverse the stack. The more variations and python-isms (as in, using PyObject* instead of C structs) the harder it makes for these tools to properly traverse the stack, which goes against (partially) what we are trying to do here

markshannon commented 1 year ago

Feedback from who? Which operations for which tools become harder? It is hard to take vague and anonymous feedback seriously.

No one is forcing tools to handle all possible frames. They can skip frames that have "executable" objects other than code objects. The presence of additional information that tools ignore cannot be worse than that information not being present in the first place. Add not all tools will ignore it; the PR already give better tracebacks in the faulthandler module.

In general the sentiment is that the more regular the structure is, the easier is for profilers and debuggers to traverse the stack. The more variations and python-isms (as in, using PyObject* instead of C structs) the harder it makes for these tools to properly traverse the stack, which goes against (partially) what we are trying to do here

Two fields, one pointing to the next frame, and one pointing to the executable object, seems quite regular to me. Traversal of the stack is trivial. Just follow the previous pointer.

markshannon commented 1 year ago

It might be informative to compare this with adding perf support:

Adding perf frames causes a slowdown of 8%. The PR above has negligible performance impact.
This works on Windows and any machine that does not have perf installed.
It works with PEP 523 or PEP 669.

pablogsal commented 1 year ago

Feedback from who?

Authors of profilers and debuggers. For now authors of Austin, py-spy, scygraph and fil, and myself (memray/pystack). I collected feedback from them but if you prefer that they comment here directly individually I can also ask for that.

Which operations for which tools become harder?

Getting the Python stack from a remote process reading memory.

It might be informative to compare this with adding perf support:

Informative how? perf support is optional, it doesn't affect profiling or debugging tools other than allowing perf to work and it doesn't collide with this work. I am failing to see the argument here. This issue is called "Make it easier to traverse the frame stack for third party tools" and we are literally discussing changes that will achieve the opposite for some tools, not sure how the perf support is involved.

markshannon commented 1 year ago

if you prefer that they comment here directly individually I can also ask for that.

Yes, please.

perf support is optional

Is it really? In that case let's drop it now before it causes trouble for 3.13. We aren't going to support perf in any future JIT compiler.

markshannon commented 1 year ago

Let's make it clear. The choice isn't between the proposed ABI and the status quo. The choice is between a well defined, if minimal, ABI and no ABI guarantees whatsoever. The _PyInterpreterFrame struct is internal and will change.

For example, we need to insert frames for shims at the exit from __init__ functions in order to inline them. We might want to do the same for calls to __setitem__ and __setattr__ as well. Create code objects for these shims is a waste of effort. We might want to replace calls to tuple with a surrogate that constructs the tuple in bytecode. We will want to have the tuple object as the "executable", so that the frame stack looks the same.

And don't forget the producers of frames, as well as the consumers. JIT compilers may want to insert frames. Cinder does. I suspect we won't, but we might. Cython and pybind11 may want to insert frames, if not all the time, at least if an error occurs. Creating fake code objects is unnecessary overhead.

If the ABI I'm suggesting is not a good one, then you need to suggest a better one.

I am failing to see the argument here (Contrasting with perf support)

The purpose of adding perf support is that tools can see the mixed C/Python stack. perf support does that by faking Python frames on the C stack. I propose adding frames for C callables (and other things) to the Python stack. Adding it to the Python stack means that it is easily accessible to in-process tools, and with very low overhead. Adding to the C stack requires support for the native debugging format and ABI, and has a large cost.

markshannon commented 1 year ago

The question (as far as I am concerned) is this: What restrictions should be placed on the "executable" object?

Should those be structural, i.e. the start of all "executable" object must have the same layout, or should they be nominal, only a fixed set of classes would be allowed as the type of the "executable"?

The tighter the restrictions, the easier it is for introspection tools. The looser the restrictions, the easier it is for producers of frame.

pablogsal commented 1 year ago

We aren't going to support perf in any future JIT compiler.

I am sorry Mark, but the support is already there and I am failing to see who and how has made the decision that we won't be supporting perf in 3.13.

Is it really? In that case let's drop it now before it causes trouble for 3.13.

I am quite surprised about this request. To be clear, you are asking to drop a feature we have already worked on for 3.12, that already appears in the what's new and that it has already been released in several alphas and tested by people? Sorry Mark, but I think your request is unrealistic.

I have to say that I am a bit sad that you think is ok to dismiss the work of others this way.

pablogsal commented 1 year ago

The purpose of adding perf support is that tools can see the mixed C/Python stack.

No, the purpose is that users can use perf with both C and Python frames.

markshannon commented 1 year ago

Isn't that the same thing?

pablogsal commented 1 year ago

Isn't that the same thing?

No. One thing is allowing users to use the perf tool, which is a VERY SPECIFIC profiler and another thing (what you said) is to allow TOOLS (generic word here, so assuming you refer to OTHER tools) to see the mixed C/Python stack

pablogsal commented 1 year ago

Also, to clarify what I meant with optional: this means that users get to decide whether to pay the cost on process startup, and unless specifically requested via the APIs we added, users won't pay the cost. In short: is optional for users, not for us.

markshannon commented 1 year ago

So it doesn't mean optional for us and we are stuck with it? In that case it needs a PEP.

Supporting perf in a JIT compiler is going to be lot of extra work, with a benefit for a few large corps that have the infrastructure to support perf profiling in production, and a cost (worse performance and/or reliability) for many.

pablogsal commented 1 year ago

So it doesn't mean optional for us and we are stuck with it? In that case it needs a PEP.

The feature has landed already in 3.12 and is already released in alpha. I respect your position but I disagree with it. I suggest that if you want to discuss this, we can do it in a more real-time channel other than a GitHub issue.

pablogsal commented 1 year ago

tighter the restrictions, the easier it is for introspection tools. The looser the restrictions, the easier it is for producers of frame.

Refocusing the discussion on the original issue at hand. I think that if we add some kind of metadata to the frame that tells the tool what kind of frame is this (so basically what's going to be in the "f_executable" field, that's already a win. As you mention, tools may want to skip some of these frames if it cannot be handled.

On the other hand if we support simple structs or simple Python objects (as opposed to custom classes or even dictionaries) in f_executable that's also a win.

Additionally, I would like if we keep the current structure as preserved as possible (with this I mean that most frames will have a f_executable that points to a well-defined code object.

Also, I think having these fields as you propose:

typedef struct _PyInterpreterFrame {
    PyCodeObject *f_code;
    struct _PyInterpreterFrame *previous;
    ...

is a big win as tools don't need to update these definitions every single time.

markshannon commented 1 year ago

The "f_executable" field is its own metadata, as Python objects are self-describing. Additional data adds bulk to the frame, and slows down frame creation.

We can restrict the number of classes that are officially supported. If "f_executable" object is one of those, then tools can use that information. It is something else, they can just ignore it.

In the minimal case of just supporting code objects:

while (frame);
   if (frame->f_executable->ob_type == &PyCode_Type) {
       do_stuff_with_frame(frame);
   }
   frame = frame->previous;
}

As for what should be supported:

Code objects
Classes
Builtin functions
Method descriptors
(name, filename, lineno) tuples.
(Maybe other C callables, like slot wrappers)*

I'd like to get rid of slot wrappers and other oddities and merge them into builtin functions, but that's another issue.

The VM might create frames for other objects, but tools should ignore them.

pablogsal commented 1 year ago

The "f_executable" field is its own metadata, as Python objects are self-describing. Additional data adds bulk to the frame, and slows down frame creation.

We can restrict the number of classes that are officially supported. If "f_executable" object is one of those, then tools can use that information. It is something else, they can just ignore it.

In the minimal case of just supporting code objects:

I understand, but this makes life for inspection tools harder because it forces to copy much more stuff (the class and the name at least) instead of inspecting an enum. I would like to have what is in f_executable in the frame object. I understand that you don't but I want to state that I do :)

As for what should be supported:

Code objects

Classes

Builtin functions

Method descriptors

(name, filename, lineno) tuples.

(Maybe other C callables, like slot wrappers)*

I see what you are coming from, but allowing all these things is going to make implementing these tools a nightmare because supporting all these possibilities is a lot. I would like to restrict this list to just simple stuff like tuples, code objects and maybe some c-like struct that can be used for more exotic stuff. This is basically what you said here:

The tighter the restrictions, the easier it is for introspection tools. The looser the restrictions, the easier it is for producers of frame.

I am advocating for much tighter restrictions, but I understand that's not the direction that you want to go and I respect that.

carljm commented 1 year ago

We could use a tagged pointer in f_executable to provide an easy-to-read "flag" indicating the frame type without adding any additional bulk to frames, and not much extra cost to frame creation. This is what Cinder shadowframes does today: https://github.com/facebookincubator/cinder/blob/cinder/3.10/Include/internal/pycore_shadow_frame_struct.h#L42-L60

In the minimal form, we can leave the low bit 0 to indicate "normal frame, pointer to code object" (then there is also zero overhead in normal interpreter frame creation) and set it to 1 to mean "pointer to something new and different." Then existing tools that want to just handle normal code objects like they already do only need a single bit test to discard frames they don't want to deal with.

There are another two bits we could play with if we want to provide streamlined indication of any other common cases (builtin function, tuple form, maybe?).

I hope we can make life easier for existing inspection tools by making it really easy to detect the common cases they want to care about, but I also hope (from the Cinder JIT perspective) that at least one of the valid options for f_executable is "extensible". E.g. if(name, filename, lineo) tuple is allowed, that it's also valid to have a longer tuple carrying additional payload, with the first three elements interpreted as name, filename, lineo.

carljm commented 1 year ago

Classes

I'm curious, why would we want frames to hold a pointer to a class (I assume while executing the class body?) rather than to the code object of the class body?

markshannon commented 1 year ago

I'm curious, why would we want frames to hold a pointer to a class

class C: pass

c = C()  # This is a call to a class

markshannon commented 1 year ago

@pablogsal Is checking for five or six distinct values, rather than two or three that big a deal? Also, why is comparing to an address a problem, whereas comparing to an int is not? frame->f_kind == CODE+OBJECT_KIND is a 32 bit comparison. frame->f_executable == &PyCode_Type is a 64 bit comparison. Fetching the address of PyCode_Type will need the symbol table, but you'll need that anyway.

markshannon commented 1 year ago

@carlmjm I don't see the value in tagging bits. The tag you propose holds no additional information, as the same value can be got with the simple comparison f_executable == &PyCodeObject

pablogsal commented 1 year ago

Fetching the address of PyCode_Type will need the symbol table, but you'll need that anyway.

Not necessarily. These tools need to work sometimes with stripped binaries or core files and requiring the symbol table there can be a huge pain compared with just checking against an integer, as we are vendoring the headers anyway. In particular, as an example (please don't focus too much on this) in core files is a huge pain because the address may not be in the core if is in the .rodata segment.

Currently, once you get the frame, you access the f_code pointer and you KNOW is a code object so once you have the layout for it (because we vendor the headers) you know how to extract the function name and the filename.

If we have an integer in the frame that tells what f_executable will have, then we can compare against it directly and know what we are going to find. No extra information or copies are required.

If we need to compare with something now we require:

Copy the pointers/structures of all the possible types that can be (that is PyCode_Type and the same for tuples, functions...). This is already problematic because they may not be in the core OR we may not have symbols so we may be unable to locate them even in a live process.
Once you find the address of PyCode_Type and friends, you need to relocate to find the real address. Quite simple to do, but is more operations.
If we allow random classes then the tools are unable to even compare against pointers because we don't even know where they are
But an enum describing what it contains allows us to KNOW what the pointer will contain and for instance be super sure that the pointer is some custom stuff that we won't be able to understand instead of having to "guess" based on "oh, this pointer is not one of the ones we know about (code, tuples...))

markshannon commented 1 year ago

I appreciate that having extra information will make life easier for a few tool authors, but it might make things a little bit slower for very many Python users.

How do you get the frame without any symbols?

If we allow random classes...

Any class outside of the fixed set (whatever that ends up being) should be ignored, so no "random" classes.

pablogsal commented 1 year ago

How do you get the frame without any symbols?

Find the interpreter state and having the headers vendor so we know the offsets to the pointers in every struct and we know what we are going to find because at the moment is fully determined. The interpreter state can be found because we (cpython) place the runtime structure in its own section so it can be found without symbols:

https://github.com/python/cpython/blob/7703def37e4fa7d25c3d23756de8f527daa4e165/Python/pylifecycle.c#L100

Although this is technically not needed because it can be found by finding the cycle interpreter state <-> thread state by scanning the bss which is what py-spy does.

carljm commented 1 year ago

I don't see the value in tagging bits. The tag you propose holds no additional information, as the same value can be got with the simple comparison f_executable == &PyCodeObject

Sure, if you have &PyCodeObject available.

Tagging bits could "make life easier for a few tool authors" in the scenarios @pablogsal is mentioning without "making things slower for very many Python users."

EDIT: also, it's not f_executable == &PyCodeObject, it's f_executable->ob_type == &PyCodeObject, so it's adding an extra pointer chase for every frame also.

markshannon commented 1 year ago

Tagging might solve the performance issue. But we need to support 32bit machines, so we only have 2 bits to play with, which is not enough.

markshannon commented 1 year ago

If tools can find the runtime, then we can put an array of pointers there. No runtime overhead at the cost of ~40 bytes.

PyObject *callable_types[] = {
   &PyCode_Type,
   ...
};

pablogsal commented 1 year ago

If tools can find the runtime, then we can put an array of pointers there.

That would be an acceptable compromise I think.

markshannon commented 1 year ago

OK, let's go with that then.

FTR, one other reason not to use an enum is this: what happens when the enumeration and the executable don't match? We can be fairly sure it won't happen in our code, but it would be an easy mistake to make in third-party code.

By allowing objects of any class, but designating a small set of "approved" classes, the system is much more robust.

markshannon commented 1 year ago

@pablogsal Where should the array go, exactly?

markshannon commented 1 year ago

@carljm

I hope we can make life easier for existing inspection tools by making it really easy to detect the common cases they want to care about, but I also hope (from the Cinder JIT perspective) that at least one of the valid options for f_executable is "extensible". E.g. if(name, filename, lineo) tuple is allowed, that it's also valid to have a longer tuple carrying additional payload, with the first three elements interpreted as name, filename, lineo.

I don't see a problem with that. Tools should check the length of the tuple before extracting the contents, for safety. We could allow any length array, specifying only that the first three elements, if they exist, should be the name, filename and line number. ("foo",) and ("foo", "foo.py", 121, "special-data-34.8") should both be acceptable.

@pablogsal Would this be OK, or is this too complex for your tastes?

P403n1x87 commented 1 year ago

Some comments from authors:

I feel it won't be too easy to decipher the type of the object remotely. This would likely increase the number of private structures that we need to copy over from Python headers to parse this information (e.g. tuples), making things more complex. Of course one could just try treating the object as a PyCodeObject and check for failures, but this would now imply a potential loss of captured information, unless all the other object types that can appear here are also handled. Perhaps an extra int field that specifies the type of the object being passed with f_executable might help in this direction, to some extent. But perhaps one simplification that depends on a positive answer to the following question could be adopted: is the value f_executable crucial for the actual execution, or is it just added to carry the frame's metadata (e.g. filename, function name, line number, ...)? If that is added just for the metadata, perhaps that could be added directly to the _PyVMFrame structure in the form of extra fields? There could be a core set of fields that are common to all object types (filename, function qualname, location data), plus a generic PyObject reference that can be consumed easily by in-process tools. However, I can see the downside being that the cost would probably end up being slightly more than just 4 W and 3 R operations in general.

This would be me, maintainer of Austin. For context, Austin uses system calls like process_vm_readv to read memory out of process.

P403n1x87 commented 1 year ago

How do you get the frame without any symbols?

Austin uses symbols to locate _PyRuntime, but if those are not available, there is a fallback on BSS scan to locate something that looks like _PyRuntime or an interpreter state. So symbols are not strictly required (but good to have of course!).

P403n1x87 commented 1 year ago

Apologies if I slightly derail the conversation, but I wanted to express the following thought. Based on my experience with Austin, I would regard frame stack unwinding as just one aspect of the more general topic of observability into the Python VM. For example, one other thing that Austin tries to do is to sample the GC state to give an idea of how much CPU time is being spent on GC. Or detect who is holding the GIL to give a better estimate of RSS allocations. Therefore, I would tend to view frame stacks as just a part of what can be observed out of process. So the way I see a tool like Austin extracting this information in the future is by looking into an "observability entry point", much like _PyRuntime, but specifically engineered for out-of-process tools. From there one can rely on an ever growing (in an ideally backwards-compatible fashion) list of things one can observe, e.g.

.section _PyRuntimeStateABI

runtime_state {
  interpreter_state {
    thread_count: int,
    threads: [{
      top_frame: { ... },
      ....,
    ]
  },
  gc_state: ...,
  gil_state: ...,
}

python / cpython

Make it easier to traverse the frame stack for third party tools. #100987

Linked PRs