Open markshannon opened 1 year ago
Initially, I propose to refactor the PyInterpreterFrame
struct such that it starts:
typedef struct _PyInterpreterFrame {
PyCodeObject *f_code;
struct _PyInterpreterFrame *previous;
...
Currently f_code
must be a code object, but we could generalize it to allow other objects.
For example, the shim frame inserted on entry to _PyEval_EvalFrameDefault
could have that field set to None
indicating it should be skipped in tracebacks, etc.
The order of f_code
and previous
doesn't really matter, but have f_code
first makes https://github.com/python/cpython/issues/100719 a bit simpler
Let me collect some feedback from maintainers of debuggers and profilers and will comment here the requirements so we can think of solutions.
@pablogsal Any feedback?
We can further improve traversal of the _PyInterpreterFrame
for debugging and introspection by allowing C extensions to create frames without the rigmarole of creating a code object.
We should rename the f_code
field to f_executable
, and allow any object.
typedef struct _PyVMFrame {
PyObject *f_executable;
struct _PyVMFrame *previous;
} PyVMFrame;
Although tools and the VM should tolerate any object, we should in practice only allow a few classes:
PyVMFrame
is a full _PyInterpreterFrame
. Only the VM should make this kind of frameNone
: Internal shim. Tools should skip this frame.name
, filename
, flags
where flags
determine the meaning of additional entries.The tuple form is for tools like Cython, Nanobind, etc. Creating a tuple of str
s and int
s is much simpler and faster than creating a fake code object.
C extension can link themselves into the frame stack at the cost of about 4 memory writes, and 3 reads:
PyVMFrame frame;
frame.previous = tstate->current_frame.frame;
frame.f_executable = &EXECUTABLE_OBJECT;
tstate->current_frame.frame = &frame;
/* body of function goes here */
tstate->current_frame.frame = frame.previous;
We can do this for builtins functions by modifying the vectorcall function assigned to the builtin function/method descriptor.
We would need to benchmark this to see the performance impact, but it will be much cheaper than sys.activate_stack_trampoline()
@pablogsal Any feedback?
I have reached out again to tool authors, give me a couple of days to gather comments. Apologies for the delay
No problem.
@benfred -^
I've made a branch that adds "lightweight" frames (just a pointer to a "code" object and a link pointer), and inserts one for each call to a builtin function in the interpreter. The performance impact is negligible and all builtin function and class calls are present in the frame stack.
I've made a branch that adds "lightweight" frames (just a pointer to a "code" object and a link pointer), and inserts one for each call to a builtin function in the interpreter. The performance impact is negligible and all builtin function and class calls are present in the frame stack.
We still need the concept of entry frames for tools that merge native and python stacks. Why do you removed _PyFrame_IsEntryFrame
in your branch?
It's a proof of concept, it was easier to remove _PyFrame_IsEntryFrame
than re-implement it.
_PyFrame_IsEntryFrame
can be added back, if necessary.
It's a proof of concept, it was easier to remove
_PyFrame_IsEntryFrame
than re-implement it._PyFrame_IsEntryFrame
can be added back, if necessary.
👍
We are also very interested in this proposal from the Cinder JIT perspective.
One difference I see with our use case compared to what the draft PR so far aims to support is that we would like to be able to link in "minimal frames" that are still considered "complete": they are fetched by _PyFrame_GetFirstComplete()
and can be materialized into a full PyFrameObject
. (In the draft PR here, only _PyInterpreterFrame
frames are considered "complete".) We don't want to constantly keep a _PyInterpreterFrame
(localsplus
contents) up to date while the JIT is running (this is expensive), so we'd need to get some kind of callback to reify our minimal frame into a PyFrameObject
on-demand (i.e. some hook into _PyFrame_GetFrameObject
).
In the tuple form of f_executable
, could there be a well-known bit-flag in the third element that tells the interpreter "this should be considered a 'complete' frame, and the next word in its struct is a function pointer that will take the VMFrame struct and return a PyFrameObject."?
We should rename the
f_code
field tof_executable
, and allow any object.
In general the feedback is that this will make introspection tool much harder to implement. Currently the fact that this can only be a code object makes it very easy to traverse the frame stack. If you allow any Python object it makes it harder or in some cases even impossible.
If we restrict this to a finite set of possibilities, it still makes it much harder but if we add some kind of enumeration to the frame that tells the tool what's going to be there it makes it a bit easier.
In general I don't think that this proposal helps introspection tools, it actually makes the implementation harder and less efficient because more pointers need to be copied and more logic needs to be included.
Some comments from authors:
I feel it won't be too easy to decipher the type of the object remotely. This would likely increase the number of private structures that we need to copy over from Python headers to parse this information (e.g. tuples), making things more complex. Of course one could just try treating the object as a PyCodeObject and check for failures, but this would now imply a potential loss of captured information, unless all the other object types that can appear here are also handled. Perhaps an extra int field that specifies the type of the object being passed with f_executable might help in this direction, to some extent. But perhaps one simplification that depends on a positive answer to the following question could be adopted: is the value f_executable crucial for the actual execution, or is it just added to carry the frame's metadata (e.g. filename, function name, line number, ...)? If that is added just for the metadata, perhaps that could be added directly to the _PyVMFrame structure in the form of extra fields? There could be a core set of fields that are common to all object types (filename, function qualname, location data), plus a generic PyObject reference that can be consumed easily by in-process tools. However, I can see the downside being that the cost would probably end up being slightly more than just 4 W and 3 R operations in general.
In general the sentiment is that the more regular the structure is, the easier is for profilers and debuggers to traverse the stack. The more variations and python-isms (as in, using PyObject*
instead of C structs) the harder it makes for these tools to properly traverse the stack, which goes against (partially) what we are trying to do here
Feedback from who? Which operations for which tools become harder? It is hard to take vague and anonymous feedback seriously.
No one is forcing tools to handle all possible frames. They can skip frames that have "executable" objects other than code objects. The presence of additional information that tools ignore cannot be worse than that information not being present in the first place. Add not all tools will ignore it; the PR already give better tracebacks in the faulthandler module.
In general the sentiment is that the more regular the structure is, the easier is for profilers and debuggers to traverse the stack. The more variations and python-isms (as in, using PyObject* instead of C structs) the harder it makes for these tools to properly traverse the stack, which goes against (partially) what we are trying to do here
Two fields, one pointing to the next frame, and one pointing to the executable object, seems quite regular to me.
Traversal of the stack is trivial. Just follow the previous
pointer.
It might be informative to compare this with adding perf
support:
perf
installed.Feedback from who?
Authors of profilers and debuggers. For now authors of Austin, py-spy, scygraph and fil, and myself (memray/pystack). I collected feedback from them but if you prefer that they comment here directly individually I can also ask for that.
Which operations for which tools become harder?
Getting the Python stack from a remote process reading memory.
It might be informative to compare this with adding perf support:
Informative how? perf
support is optional, it doesn't affect profiling or debugging tools other than allowing perf to work and it doesn't collide with this work. I am failing to see the argument here. This issue is called "Make it easier to traverse the frame stack for third party tools" and we are literally discussing changes that will achieve the opposite for some tools, not sure how the perf support is involved.
if you prefer that they comment here directly individually I can also ask for that.
Yes, please.
perf
support is optional
Is it really? In that case let's drop it now before it causes trouble for 3.13.
We aren't going to support perf
in any future JIT compiler.
Let's make it clear. The choice isn't between the proposed ABI and the status quo. The choice is between a well defined, if minimal, ABI and no ABI guarantees whatsoever.
The _PyInterpreterFrame
struct is internal and will change.
For example, we need to insert frames for shims at the exit from __init__
functions in order to inline them. We might want to do the same for calls to __setitem__
and __setattr__
as well. Create code objects for these shims is a waste of effort.
We might want to replace calls to tuple
with a surrogate that constructs the tuple
in bytecode. We will want to have the tuple
object as the "executable", so that the frame stack looks the same.
And don't forget the producers of frames, as well as the consumers. JIT compilers may want to insert frames. Cinder does. I suspect we won't, but we might. Cython and pybind11 may want to insert frames, if not all the time, at least if an error occurs. Creating fake code objects is unnecessary overhead.
If the ABI I'm suggesting is not a good one, then you need to suggest a better one.
I am failing to see the argument here (Contrasting with perf support)
The purpose of adding perf support is that tools can see the mixed C/Python stack.
perf
support does that by faking Python frames on the C stack. I propose adding frames for C callables (and other things) to the Python stack.
Adding it to the Python stack means that it is easily accessible to in-process tools, and with very low overhead.
Adding to the C stack requires support for the native debugging format and ABI, and has a large cost.
The question (as far as I am concerned) is this: What restrictions should be placed on the "executable" object?
Should those be structural, i.e. the start of all "executable" object must have the same layout, or should they be nominal, only a fixed set of classes would be allowed as the type of the "executable"?
The tighter the restrictions, the easier it is for introspection tools. The looser the restrictions, the easier it is for producers of frame.
We aren't going to support perf in any future JIT compiler.
I am sorry Mark, but the support is already there and I am failing to see who and how has made the decision that we won't be supporting perf in 3.13.
Is it really? In that case let's drop it now before it causes trouble for 3.13.
I am quite surprised about this request. To be clear, you are asking to drop a feature we have already worked on for 3.12, that already appears in the what's new and that it has already been released in several alphas and tested by people? Sorry Mark, but I think your request is unrealistic.
I have to say that I am a bit sad that you think is ok to dismiss the work of others this way.
The purpose of adding perf support is that tools can see the mixed C/Python stack.
No, the purpose is that users can use perf
with both C and Python frames.
Isn't that the same thing?
Isn't that the same thing?
No. One thing is allowing users to use the perf
tool, which is a VERY SPECIFIC profiler and another thing (what you said) is to allow TOOLS (generic word here, so assuming you refer to OTHER tools) to see the mixed C/Python stack
Also, to clarify what I meant with optional: this means that users get to decide whether to pay the cost on process startup, and unless specifically requested via the APIs we added, users won't pay the cost. In short: is optional for users, not for us.
So it doesn't mean optional for us and we are stuck with it? In that case it needs a PEP.
Supporting perf
in a JIT compiler is going to be lot of extra work, with a benefit for a few large corps that have the infrastructure to support perf profiling in production, and a cost (worse performance and/or reliability) for many.
So it doesn't mean optional for us and we are stuck with it? In that case it needs a PEP.
The feature has landed already in 3.12 and is already released in alpha. I respect your position but I disagree with it. I suggest that if you want to discuss this, we can do it in a more real-time channel other than a GitHub issue.
tighter the restrictions, the easier it is for introspection tools. The looser the restrictions, the easier it is for producers of frame.
Refocusing the discussion on the original issue at hand. I think that if we add some kind of metadata to the frame that tells the tool what kind of frame is this (so basically what's going to be in the "f_executable" field, that's already a win. As you mention, tools may want to skip some of these frames if it cannot be handled.
On the other hand if we support simple structs or simple Python objects (as opposed to custom classes or even dictionaries) in f_executable
that's also a win.
Additionally, I would like if we keep the current structure as preserved as possible (with this I mean that most frames will have a f_executable
that points to a well-defined code object.
Also, I think having these fields as you propose:
typedef struct _PyInterpreterFrame {
PyCodeObject *f_code;
struct _PyInterpreterFrame *previous;
...
is a big win as tools don't need to update these definitions every single time.
The "f_executable" field is its own metadata, as Python objects are self-describing. Additional data adds bulk to the frame, and slows down frame creation.
We can restrict the number of classes that are officially supported. If "f_executable" object is one of those, then tools can use that information. It is something else, they can just ignore it.
In the minimal case of just supporting code objects:
while (frame);
if (frame->f_executable->ob_type == &PyCode_Type) {
do_stuff_with_frame(frame);
}
frame = frame->previous;
}
As for what should be supported:
(name, filename, lineno)
tuples.I'd like to get rid of slot wrappers and other oddities and merge them into builtin functions, but that's another issue.
The VM might create frames for other objects, but tools should ignore them.
The "f_executable" field is its own metadata, as Python objects are self-describing. Additional data adds bulk to the frame, and slows down frame creation.
We can restrict the number of classes that are officially supported. If "f_executable" object is one of those, then tools can use that information. It is something else, they can just ignore it.
In the minimal case of just supporting code objects:
I understand, but this makes life for inspection tools harder because it forces to copy much more stuff (the class and the name at least) instead of inspecting an enum. I would like to have what is in f_executable
in the frame object. I understand that you don't but I want to state that I do :)
As for what should be supported:
- Code objects
- Classes
- Builtin functions
- Method descriptors
(name, filename, lineno)
tuples.- (Maybe other C callables, like slot wrappers)*
I see what you are coming from, but allowing all these things is going to make implementing these tools a nightmare because supporting all these possibilities is a lot. I would like to restrict this list to just simple stuff like tuples, code objects and maybe some c-like struct that can be used for more exotic stuff. This is basically what you said here:
The tighter the restrictions, the easier it is for introspection tools. The looser the restrictions, the easier it is for producers of frame.
I am advocating for much tighter restrictions, but I understand that's not the direction that you want to go and I respect that.
We could use a tagged pointer in f_executable
to provide an easy-to-read "flag" indicating the frame type without adding any additional bulk to frames, and not much extra cost to frame creation. This is what Cinder shadowframes does today: https://github.com/facebookincubator/cinder/blob/cinder/3.10/Include/internal/pycore_shadow_frame_struct.h#L42-L60
In the minimal form, we can leave the low bit 0 to indicate "normal frame, pointer to code object" (then there is also zero overhead in normal interpreter frame creation) and set it to 1 to mean "pointer to something new and different." Then existing tools that want to just handle normal code objects like they already do only need a single bit test to discard frames they don't want to deal with.
There are another two bits we could play with if we want to provide streamlined indication of any other common cases (builtin function, tuple form, maybe?).
I hope we can make life easier for existing inspection tools by making it really easy to detect the common cases they want to care about, but I also hope (from the Cinder JIT perspective) that at least one of the valid options for f_executable
is "extensible". E.g. if(name, filename, lineo)
tuple is allowed, that it's also valid to have a longer tuple carrying additional payload, with the first three elements interpreted as name, filename, lineo
.
Classes
I'm curious, why would we want frames to hold a pointer to a class (I assume while executing the class body?) rather than to the code object of the class body?
I'm curious, why would we want frames to hold a pointer to a class
class C: pass
c = C() # This is a call to a class
@pablogsal
Is checking for five or six distinct values, rather than two or three that big a deal?
Also, why is comparing to an address a problem, whereas comparing to an int is not?
frame->f_kind == CODE+OBJECT_KIND
is a 32 bit comparison.
frame->f_executable == &PyCode_Type
is a 64 bit comparison.
Fetching the address of PyCode_Type
will need the symbol table, but you'll need that anyway.
@carlmjm
I don't see the value in tagging bits. The tag you propose holds no additional information, as the same value can be got with the simple comparison f_executable == &PyCodeObject
Fetching the address of
PyCode_Type
will need the symbol table, but you'll need that anyway.
Not necessarily. These tools need to work sometimes with stripped binaries or core files and requiring the symbol table there can be a huge pain compared with just checking against an integer, as we are vendoring the headers anyway. In particular, as an example (please don't focus too much on this) in core files is a huge pain because the address may not be in the core if is in the .rodata
segment.
Currently, once you get the frame, you access the f_code
pointer and you KNOW is a code object so once you have the layout for it (because we vendor the headers) you know how to extract the function name and the filename.
If we have an integer in the frame that tells what f_executable will have, then we can compare against it directly and know what we are going to find. No extra information or copies are required.
If we need to compare with something now we require:
But an enum describing what it contains allows us to KNOW what the pointer will contain and for instance be super sure that the pointer is some custom stuff that we won't be able to understand instead of having to "guess" based on "oh, this pointer is not one of the ones we know about (code, tuples...))
I appreciate that having extra information will make life easier for a few tool authors, but it might make things a little bit slower for very many Python users.
How do you get the frame without any symbols?
If we allow random classes...
Any class outside of the fixed set (whatever that ends up being) should be ignored, so no "random" classes.
How do you get the frame without any symbols?
Find the interpreter state and having the headers vendor so we know the offsets to the pointers in every struct and we know what we are going to find because at the moment is fully determined. The interpreter state can be found because we (cpython) place the runtime structure in its own section so it can be found without symbols:
Although this is technically not needed because it can be found by finding the cycle interpreter state <-> thread state by scanning the bss
which is what py-spy
does.
I don't see the value in tagging bits. The tag you propose holds no additional information, as the same value can be got with the simple comparison
f_executable == &PyCodeObject
Sure, if you have &PyCodeObject
available.
Tagging bits could "make life easier for a few tool authors" in the scenarios @pablogsal is mentioning without "making things slower for very many Python users."
EDIT: also, it's not f_executable == &PyCodeObject
, it's f_executable->ob_type == &PyCodeObject
, so it's adding an extra pointer chase for every frame also.
Tagging might solve the performance issue. But we need to support 32bit machines, so we only have 2 bits to play with, which is not enough.
If tools can find the runtime, then we can put an array of pointers there. No runtime overhead at the cost of ~40 bytes.
PyObject *callable_types[] = {
&PyCode_Type,
...
};
If tools can find the runtime, then we can put an array of pointers there.
That would be an acceptable compromise I think.
OK, let's go with that then.
FTR, one other reason not to use an enum is this: what happens when the enumeration and the executable don't match? We can be fairly sure it won't happen in our code, but it would be an easy mistake to make in third-party code.
By allowing objects of any class, but designating a small set of "approved" classes, the system is much more robust.
@pablogsal Where should the array go, exactly?
@carljm
I hope we can make life easier for existing inspection tools by making it really easy to detect the common cases they want to care about, but I also hope (from the Cinder JIT perspective) that at least one of the valid options for f_executable is "extensible". E.g. if(name, filename, lineo) tuple is allowed, that it's also valid to have a longer tuple carrying additional payload, with the first three elements interpreted as name, filename, lineo.
I don't see a problem with that.
Tools should check the length of the tuple before extracting the contents, for safety.
We could allow any length array, specifying only that the first three elements, if they exist, should be the name, filename and line number.
("foo",)
and ("foo", "foo.py", 121, "special-data-34.8")
should both be acceptable.
@pablogsal Would this be OK, or is this too complex for your tastes?
Some comments from authors:
I feel it won't be too easy to decipher the type of the object remotely. This would likely increase the number of private structures that we need to copy over from Python headers to parse this information (e.g. tuples), making things more complex. Of course one could just try treating the object as a PyCodeObject and check for failures, but this would now imply a potential loss of captured information, unless all the other object types that can appear here are also handled. Perhaps an extra int field that specifies the type of the object being passed with f_executable might help in this direction, to some extent. But perhaps one simplification that depends on a positive answer to the following question could be adopted: is the value f_executable crucial for the actual execution, or is it just added to carry the frame's metadata (e.g. filename, function name, line number, ...)? If that is added just for the metadata, perhaps that could be added directly to the _PyVMFrame structure in the form of extra fields? There could be a core set of fields that are common to all object types (filename, function qualname, location data), plus a generic PyObject reference that can be consumed easily by in-process tools. However, I can see the downside being that the cost would probably end up being slightly more than just 4 W and 3 R operations in general.
This would be me, maintainer of Austin. For context, Austin uses system calls like process_vm_readv
to read memory out of process.
How do you get the frame without any symbols?
Austin uses symbols to locate _PyRuntime
, but if those are not available, there is a fallback on BSS scan to locate something that looks like _PyRuntime or an interpreter state. So symbols are not strictly required (but good to have of course!).
Apologies if I slightly derail the conversation, but I wanted to express the following thought. Based on my experience with Austin, I would regard frame stack unwinding as just one aspect of the more general topic of observability into the Python VM. For example, one other thing that Austin tries to do is to sample the GC state to give an idea of how much CPU time is being spent on GC. Or detect who is holding the GIL to give a better estimate of RSS allocations. Therefore, I would tend to view frame stacks as just a part of what can be observed out of process. So the way I see a tool like Austin extracting this information in the future is by looking into an "observability entry point", much like _PyRuntime
, but specifically engineered for out-of-process tools. From there one can rely on an ever growing (in an ideally backwards-compatible fashion) list of things one can observe, e.g.
.section _PyRuntimeStateABI
runtime_state {
interpreter_state {
thread_count: int,
threads: [{
top_frame: { ... },
....,
]
},
gc_state: ...,
gil_state: ...,
}
Profilers and debuggers need to traverse the frame stack, but the layout of the stack is an internal implementation detail. However can make some limited promises to make porting tools between Python versions a bit easier.
In order to traverse the stack, the offset of the
previous
pointer needs to be known. To understand the frame, more information is needed.@pablogsal @Yhg1s expressed interest in this.
Linked PRs