Add a new interface for external profilers and debuggers to get the call stack efficiently

pablogsal commented 9 months ago

Currently, external profilers and debuggers need to issue one system call per frame object when they are retrieving the call stack from a remote process as these form a linked list. Furthermore, for every frame object they need at least two or three more system calls to retrieve the code object and the code object name and file. This has several disadvantages:

For sampling profilers, the status of the runtime can change dramatically as all the data is copied, so this can lead to inaccurate information or higher chances of corrupted memory.
This leaves all the tool subjected to any low-level optimisation in frame objects and the call mechanism, which further impacts accuracy and maintainability.
Some optimisations such as true function inlining may make the current approach not sufficient as frames will be missing from the normal call stack.

For these reasons, I propose to add a new interface consisting in a contiguous array of pointers that always contains all the code objects necessary to retrieve the call stack. This has the following advantages:

The full stack can be efficiently copied in one system call (using for instance process_vm_readv).
Having code objects and not frames allows for less system calls and less exposure to implementation details.
Optimizations can keep easily and very cheaply this interface running while doing aggressive changes in the real frame management.
The continuous cost is just one pointer write, one counter increment and two reads per function call, which i expect to be negligible.

Linked PRs

gh-115947

gvanrossum commented 8 months ago

This certainly requires the attention of @markshannon and @Fidget-Spinner.

The continuous cost is just one pointer write per function call, which is negligible.

Looking at the code in the PR I see one pointer write, one counter increment, and several reads. We probably want a benchmark run.

pablogsal commented 8 months ago

Looking at the code in the PR I see one pointer write, one counter increment, and several reads. We probably want a benchmark run.

Absolutely! Also the PR doesn't work in the current state because I am missing a lot of the work in the new interpreter structure so I may need some help here to know where to put the tracking code. EDIT: I have also updated the description to reflect that it's not just one write

pablogsal commented 8 months ago

@gvanrossum Do you think it's worth discussing this in one Faster CPython sync?

gvanrossum commented 8 months ago

Perhaps. Especially if you're not 100% satisfied with my answers. (Wait a day for @markshannon to pitch in as well.)

I don't know if you still have the meeting in your calendar -- we haven't seen you in a while, but of course you're always welcome! We now meet every other week. The next one is Wed March 13. The Teams meeting is now owned by Michael Droettboom.

pablogsal commented 8 months ago

Perhaps. Especially if you're not 100% satisfied with my answers. (Wait a day for @markshannon to pitch in as well.)

I am satisfied (and thanks a lot for taking the time to read my comments and to answer, I really appreciate it). I think is more than it seems that there are some moving pieces and some compromises to be made and I think a realtime discussion can make this easier.

I don't know if you still have the meeting in your calendar -- we haven't seen you in a while, but of course you're always welcome! We now meet every other week. The next one is Wed March 13. The Teams meeting is now owned by Michael Droettboom.

I do have it still! Should I reach out to Michael (after leaving time for @markshannon to answer first) to schedule the topic?

gvanrossum commented 8 months ago

Agreed.

I do have it still! Should I reach out to Michael (after leaving time for @markshannon to answer first) to schedule the topic?

Wouldn't hurt to get this on the agenda early.

pablogsal commented 8 months ago

Wouldn't hurt to get this on the agenda early.

👍 I sent @mdboom an email

markshannon commented 8 months ago

While this might make life easier for profilers and out-of-process debuggers (what's wrong with in-process debuggers?), it will make all Python programs slower. To me, this cost considerably outweighs the benefits.

pablogsal commented 8 months ago

While this might make life easier for profilers and out-of-process debuggers (what's wrong with in-process debuggers?), it will make all Python programs slower. To me, this cost considerably outweighs the benefits.

Think of this not just as a way to make the profiler life easier, but also a way to not make it break once we start inlining things or to make the JIT not push frames. If we don't do this, what do you propose we should do to not break these tools once the JIT is there or we start inlining stuff and frames start to be missing?

(what's wrong with in-process debuggers?)

That you cannot use them if Python is frozen or deadlocked and you need them if you want to attach to an already-running tool.

gvanrossum commented 8 months ago

I'd like to remind all participants in this conversation to remain open to the issues others are trying to deal with. We will eventually need some kind of compromise. Please try working towards one.

pablogsal commented 8 months ago

@markshannon What would be an acceptable approach/compromise for you to ensure that we can still offer reliable information for these tools in a way that doesn't impact too much the runtime?

gvanrossum commented 8 months ago

@pablogsal I wonder if you could write a bit more about how such tools currently use the frame stack (and how this has evolved through the latest Python versions, e.g. how they coped with 3.10/3.11 call inlining, whether 3.12 posed any new issues, how different 3.13 is currently? And I'd like to understand how having just a stack of code objects (as this PR seems to be doing) will help. I think I recall you mentioned that reading the frame stack is inefficient -- a bit more detail would be helpful.

IIUC the current JIT still maintains the frame stack, just as the Tier 2 interpreter does -- this is done using the _PUSH_FRAME and _POP_FRAME uops, and a few related ones that you can find by searching bytecodes.c for _PUSH_FRAME.

The current (modest) plans for call inlining in 3.13 (if we do it at all) only remove the innermost frame, and only if that frame doesn't make any calls into C code that could recursively enter the interpreter. (This is supported by escape analysis which can be overridden by the "pure" qualifier in bytecodes.c.) When you catch it executing uops (either in the JIT or in the Tier 2 interpreter) the frame is that of the caller and the inlined code just uses some extra stack space. (I realize there's an open issue regarding how we'd calculate the line number when the VM is in this state.)

pablogsal commented 8 months ago

Absolutely! Let me try to describe here the general idea and if any specific aspects are not clear, tell me and we can go into detail. This way you don't need to read a big wall of text with too much detail.

How the tools work

All the tools do variations of the same thing. The setup is that there are two process: the Python process being traced and the debugger/profiler (that can be written in any language). The algorithm is as follows (go directly to step 5 if you want to read specifically how the frame stack it's used):

The profiler/debugger identifies the version of Python in the process being traced. The profiler/debugger has a copy of the headers of all Python versions. This is needed to correctly know how many bytes need to be copied when copying an object from the process being traced and the offsets to the members in the structs.
The profiler/debugger locates in the address space of the process being traced where the _PyRuntime structure is. This can be done in multiple ways, but the easiest way is if there are symbols in the executable (in which case you locate the symbol offset and you add the address where the binary is mapped) or by inspecting the section table. You can see how this is done for Linux here.
Once you know the address of the _PyRuntime, you copy the memory from the process being traced into the profiler/debugger using process_vm_readv in Linux or similar system calls in other OSes.
From the memory you just copied to get the address of the interpreter head _PyRuntime->interpreters.head. You copy that structure and from it you get the thread state address interpreter->tstate_head. You copy that structure and then you get the current frame address in tstate->current_frame.
Once the profiler/debugger knows the address of the current frame being executed it will:
- Copy the _PyInterpreterFrame from the process being traced.
- Get the address of the code object (interp_frame-> f_executable) and copy it over.
- Get the address of every relevant field from the code object co_name, co_filename, copy them and get the string data.
- Get the address of the previous field from the frame (interp_frame-> previous).
- Go to the first step until the previous field is NULL
Step 5 is repeated for every thread in the thread linked list (starting from interpreter->tstate_head).

If you want to get the idea reading code you can read the test that I added for that here: https://github.com/python/cpython/blob/main/Modules/_testexternalinspection.c . This test does a simplified version of what these tools do (it only fetches function names, not file names nor line offsets).

How having just a stack of code objects will help.

This is a compromise. The best thing would have been to get all the frame objects in a contiguous array. That way you don't need to pay one system call per frame + a bunch of system calls per code object but you could copy ALL THE FRAMES in one system call.

The reason I decided to just do code objects is because I assumed that with inlining, JIT and other optimisations having the frames themselves there will be very difficult as frames will be changing or be missing so having code objects would be a much more stable interface for the tools (code objects don't change that much) and it would be easier for the optimisations to maintain the status (as frames tend to mutate a lot depending on several factors such as when ownership is handled to PyFrameObjects* and more).

Unfortunately, with this methods tools won't be able to get the current line being executed, so this is indeed suboptimal, but this is precisely one of the compromises I was trying to make.

Why reading the frame stack is inefficient

This is inefficient because every time you copy memory from the traced process you pay for a system call. Once you locate the current thread state (that needs to be done once). Every time you take an snapshot you need to do:

One system call to copy the frame
One system call to copy the code object
One system call to copy the line table data
Two system calls to copy the name of the code object and the file.

for every frame. This means that you pay 5*n_frames system calls to take a single snapshot. If we have all the code objects in a contiguous array we can copy all the array in one go (you still need more system calls per member but you get all the "stacks" in one go.

This is a problem specially for profilers because the VM can change the stack as this is happening yielding incorrect results at high stack levels. So reducing the system calls is a big deal.

How the coped with 3.12 inlining

To get the Python stack tools just needed to add an extra step between 4 and 5 that was fetching the cframe and from it the current frame. This is, doing tstate->cframe->current_frame instead of tstate->current_frame. This of course involves coping over the cframe first before the second dereference. This made the tool a bit slower because it needs to copy one extra field to get to the current frame.

The painful point was that these tools also want to merge the C stack with the Python stack. That is, instead of showing just the Python stack or the C stack (like in GDB), they want to show both merged. This is achieved by substituting every PyEval_EvalFrameDefault call in the C stack with its corresponding Python function name. In 3.11 with inline call there was a one to many relationship between PyEval_EvalFrameDefault and Python frames so tools need to know where are the entry frames to be able to know how many Python frames to show for every PyEval_EvalFrameDefault . This is achieved by checking f->owner == FRAME_OWNED_BY_CSTACK (which you also need to copy over).

Please, tell me if any of this is too confusing or you need more context. I tried to do my best to summarise it without being a wall of text but is a bit challenging to add just the minimum context so feel free to tell me if I need to explain something in more detail, show you example code or any other way I can help :)

gvanrossum commented 8 months ago

This is super helpful!

It looks like you would be a fan of @Fidget-Spinner's proposal for a different layout of the frame stack: https://github.com/faster-cpython/ideas/issues/657

That is not an accepted proposal, but if we could make that work without slowing things down compared to the current approach, it would presumably work for you too -- you'd get all the frames in a single array! (Or maybe it's chunked and there might be a few segments to it, but still, many frames per chunk.)

Coming back to @markshannon's objection, it is indeed difficult to accept even a very small slowdown to support tools like this, since most users won't ever be using those tools. An obvious compromise would be to have some sort of option so users can choose between more speed or trace tool support, without giving up all speed. Ideally there would be a command-line flag (some -X option probably, and/or a environment variable) to select trace tool support. But, depending on how that support is implemented, even the check for the flag might be too much, and we might have to make it a build-time flag. (Though I suspect that finding a flag in the Interpreter structure in the _PUSH_FRAME and _POP_FRAME uops would be much less noticeable than what you are proposing in this PR.)

Regarding the JIT causing frames to go missing, that's a slightly separate topic into which I have less insight. (I believe currently it faithfully implements _PUSH_FRAME and _POP_FRAME, but I don't know what the medium-term plans are.) I hope we can discuss that with @brandtbucher at the meeting next Wednesday.

pablogsal commented 8 months ago

Coming back to @markshannon's objection, it is indeed difficult to accept even a very small slowdown to support tools like this, since most users won't ever be using those tools.

I totally understand this position. I would ask then that at least we try not to make it impossible for these tools to work. So even if they need to do a bit of extra work to copy some extra fields in the case of JIT or inlining, let's at least ensure that we don't make it so hard that's not feasible. These tools have lots of users so disallowing them to work (even if they need to do a lot of work to keep working) would be quite a breaking change.

An obvious compromise would be to have some sort of option so users can choose between more speed or trace tool support, without giving up all speed. Ideally there would be a command-line flag (some -X option probably, and/or a environment variable) to select trace tool support.

I think that is an acceptable compromise (although it doesn't come without challenges). For profilers, it would be totally ok because asking that users run Python in a specific way when profiling is OK. For debuggers, it's more challenging. Users normally use debuggers when Python is frozen or deadlocked or has crashed (notice all this analysis can be done in a core file as well, not only on a live process) and want to know what is the stack. This use case will be noticeably affected because you don't know if you are going to need to use a debugger or not until you do, in which case it's too late. On the other hand, I think is ok if we say "you can sacrifice a bit of speed if you want to be able to use one of these tools".

. But, depending on how that support is implemented, even the check for the flag might be too much, and we might have to make it a build-time flag. (Though I suspect that finding a flag in the Interpreter structure in the _PUSH_FRAME and _POP_FRAME uops would be much less noticeable than what you are proposing in this PR.)

The problem with a compile-time flag is that most users won't be able to use these tools because most users do not compile Python themselves. I think this would be too much because it will raise a lot the ask for the end user.

gvanrossum commented 8 months ago

I guess if you have one isolated hang in production you can't debug it well, alas. (Though presumably things aren't hopeless, there will still be a current frame in the thread state, and it will point to a recent ancestor of where you actually are.) But once you have multiple hangs you can advise a team to turn on a flag so their hangs will be more debuggable.)

Understood about the build-time flag, those are always a pain.

pablogsal commented 8 months ago

I guess if you have one isolated hang in production you can't debug it well, alas. (Though presumably things aren't hopeless, there will still be a current frame in the thread state, and it will point to a recent ancestor of where you actually are.) But once you have multiple hangs you can advise a team to turn on a flag so their hangs will be more debuggable.)

Yeah, I think missing one frame at the end is not bad at all. I was working on this PR in case we plan to start missing a lot of intermediate frames or a lot of consecutive ones, in which case the traces would be very confusing

brandtbucher commented 8 months ago

Regarding the JIT causing frames to go missing, that's a slightly separate topic into which I have less insight. (I believe currently it faithfully implements _PUSH_FRAME and _POP_FRAME, but I don't know what the medium-term plans are.)

Just to clear this up: when talking about Python frames, the effect of the JIT can be ignored. It just does whatever the tier two interpreter does.

When dealing with C frames, the difference is that there may be frames with no debugging info in the C stack. That could be a challenge for tools that try to interleave the two stacks, or are specifically inspecting C code. However, I think frame pointers are still present, so it should be possible to unwind through JIT frames without knowing much about them (@mdboom recently had some success using perf with the tier two JIT enabled over here... my guess is that perf is just parsing /proc/self/maps to find JIT code). The caller of jitted code should always be _PyEval_EvalFrameDefault.

In the future, we could explicitly emit debug info in much the way that the perf trampoline does if this proves to be useful enough (it can be reasonably efficient to say "this is a JIT frame, ignore" once we switch to an arena-style JIT allocator).

pablogsal commented 8 months ago

(@mdboom recently had some success using perf with the tier two JIT enabled over here... my guess is that perf is just parsing /proc/self/maps to find JIT code).

I think this is because for his use case it's only using the leaf frame. If you look at every individual stack trace they are probably incomplete. I will run a test tomorrow to confirm.

BTW: I have accumulated quite a lot of experience working with Perf and the JIT interface so I may be able to help with that particular issue.

markshannon commented 8 months ago

If we don't do this, what do you propose we should do to not break these tools once the JIT is there or we start inlining stuff and frames start to be missing?

First, let's worry about that when it actually happens.

Second, mapping the actual machine state back to the canonical VM state is a more general and tricky problem, needing a carefully designed solution. It is not just "function inlining" (whatever is meant exactly by that), but deferred creation of objects, moving local variables into registers, etc. What PyPy calls "virtualization".

We need a solution that allows the original state to be reconstructed, but at a low enough cost to not undermine those optimizations. So, let's not rush into implementing partial solutions for a problem that we don't have yet.

markshannon commented 8 months ago

Looking at the list of system calls.

One system call to copy the frame One system call to copy the code object One system call to copy the line table data Two system calls to copy the name of the code object and the file.

Once the first system call has been performed, the only way you can get incorrect information is if the code object is destroyed before you have completed calls 4. I don't see how your proposal would prevent that. Also, why do all that work of 2-4 per frame, instead of per code-object?

pablogsal commented 8 months ago

Once the first system call has been performed, the only way you can get incorrect information is if the code object is destroyed before you have completed calls 4. I don't see how your proposal would prevent that.

For profilers this is a potential problem because they don't stop the program but they are prepared to deal with this (detect corruption or incorrect object copies). They need to do this race-y operation anyway because they must get the stack without stopping the process.

For debuggers it's not a problem at all because the process is either stopped or it has already crashed.

Also, why do all that work of 2-4 per frame, instead of per code-object?

What do you mean with "2-4"? Not sure I understand the question correctly.

mdboom commented 8 months ago

(@mdboom recently had some success using perf with the tier two JIT enabled over here... my guess is that perf is just parsing /proc/self/maps to find JIT code).

I think this is because for his use case it's only using the leaf frame. If you look at every individual stack trace they are probably incomplete. I will run a test tomorrow to confirm.

That's correct that for my use case I'm only collecting leaf frames, so I don't really know if it can seek "up through" a JIT frame or not.

markshannon commented 8 months ago

What do you mean with "2-4"? Not sure I understand the question correctly.

Items 2 to 4 on the list above, that is:

One system call to copy the code object One system call to copy the line table data Two system calls to copy the name of the code object and the file.

Why do these need to be done every sample, can't they done once per code-object and be cached?

mdboom commented 8 months ago

It looks like perf can trace through the JIT stack frames. For example, I'm seeing things like this:

    34.03%    19.33%  python       python [.] _PyEval_EvalFrameDefault

...

     0.22%     0.00%  python       [JIT] tid 4418                                    
            |
            ---0x7f9e898e9288
               |          
                --0.18%--list_contains
                          |          
                           --0.13%--PyObject_RichCompareBool

The "trick" to getting them to show is to decrease the threshold for which perf report shows children of a function in the graph. (The default is 0.5% of total runtime, I set it to 0.1% with `perf report -g graph,0.1

Note also these are "anonymous" JIT frames -- we aren't currently adding /tmp/perf-$PID.map entries.

pablogsal commented 8 months ago

Hummm, that's interesting but it's still not conclusive because we are not seeing full linear independent stacks that we can check that are not missing anything. We should inspect the actual full per-event stacks. We can do that by running "perf script" over the file and check that we are not missing frames.

I will run a bunch of tests if someone can point me out to some example code that surely gets jitted that has a bunch of stack frames

pablogsal commented 8 months ago

Additionally there is another check: perf uses libubwind (or libdw) as unwinder. What happens if you try to unwind using libubwind under a kitted frame?

@brandtbucher and I did the test back in the day and it failed.

Also what happens if gdb tries to unwind? That's an easier check to do 🤔

brandtbucher commented 8 months ago

I will run a bunch of tests if someone can point me out to some example code that surely gets jitted that has a bunch of stack frames

import operator

def f(n):
    if n <= 0:
        return
    for _ in range(16):
        operator.call(f, n - 1)

f(6)

You should usually see n jit frames with C code between them once it's warmed up a bit. Feel free to crank the numbers up higher if you need a deeper stack or a longer-running process.

brandtbucher commented 8 months ago

@brandtbucher and I did the test back in the day and it failed.

One important thing has changed since then: the JIT no longer omits frame pointers when compiling templates. So this may just be fixed now (not sure).

pablogsal commented 8 months ago

Ok, I tested with the latest JIT and I can confirm neither gdb nor libunwind can unwind through the JIT. I used this extension:

#include <Python.h>
#include <libunwind.h>
#include <stdio.h>

static PyObject* print_full_stack(PyObject* self, PyObject* args) {
    unw_cursor_t cursor;
    unw_context_t context;

    unw_getcontext(&context);
    unw_init_local(&cursor, &context);

    while (unw_step(&cursor) > 0) {
        unw_word_t offset, pc;
        char symbol[256];

        unw_get_reg(&cursor, UNW_REG_IP, &pc);
        unw_get_proc_name(&cursor, symbol, sizeof(symbol), &offset);

        printf("0x%lx: (%s+0x%lx)\n", (unsigned long)pc, symbol, (unsigned long)offset);
    }

    return Py_None;
}

static PyMethodDef methods[] = {
    {"print_full_stack", print_full_stack, METH_NOARGS, "Print the full stack trace"},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT,
    "full_stack_trace",
    NULL,
    -1,
    methods
};

PyMODINIT_FUNC PyInit_full_stack_trace(void) {
    return PyModule_Create(&module);
}

with @brandtbucher example:

import operator
import full_stack_trace

def f(n):
    if n <= 0:
        full_stack_trace.print_full_stack()
        return
    for _ in range(16):
        operator.call(f, n - 1)

f(6)

Attaching with gdb and unwinding shows:

Breakpoint 1, 0x0000fffff76a0938 in print_full_stack () from /src/full_stack_trace.so
(gdb) bt
#0  0x0000fffff76a0938 in print_full_stack () from /src/full_stack_trace.so
#1  0x0000aaaaaabf8ccc in cfunction_vectorcall_NOARGS (func=0xfffff7b33e20, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/methodobject.c:484
#2  0x0000aaaaaab97920 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0xfffff7b33e20, tstate=0xaaaaab059a90 <_PyRuntime+231872>)
    at ./Include/internal/pycore_call.h:168
#3  PyObject_Vectorcall (callable=0xfffff7b33e20, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:327
#4  0x0000aaaaaab32b78 in _PyEval_EvalFrameDefault (tstate=0xaaaaab059a90 <_PyRuntime+231872>, frame=0xfffff7fbe3b0, throwflag=-144045772) at Python/generated_cases.c.h:817
#5  0x0000aaaaaab97920 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0xfffff7ac6e80, tstate=0xaaaaab059a90 <_PyRuntime+231872>)
    at ./Include/internal/pycore_call.h:168
#6  PyObject_Vectorcall (callable=0xfffff7ac6e80, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:327
#7  0x0000fffff76b9578 in ?? ()
#8  0x0000fffff7b517b0 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

and libunwind shows:

0xaaaaaabf8ccc: (cfunction_vectorcall_NOARGS+0x6c)
0xffffffffd040: (+0x6c)
0xffffffffd040: (+0x6c)

This is with frame pointers:

./python.exe -m sysconfig | grep frame
    CFLAGS = "-fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall  -fno-omit-frame-pointer"
    LIBEXPAT_CFLAGS = "-I./Modules/expat -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall  -fno-omit-frame-pointer -D_Py_JIT -std=c11 -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal -I./Include/internal/mimalloc -I. -I./Include -fPIC"
    LIBHACL_CFLAGS = "-I./Modules/_hacl/include -D_BSD_SOURCE -D_DEFAULT_SOURCE -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall  -fno-omit-frame-pointer -D_Py_JIT -std=c11 -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal -I./Include/internal/mimalloc -I. -I./Include -fPIC"
    LIBMPDEC_CFLAGS = "-I./Modules/_decimal/libmpdec -DCONFIG_64=1 -DANSI=1 -DHAVE_UINT128_T=1 -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall  -fno-omit-frame-pointer -D_Py_JIT -std=c11 -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal -I./Include/internal/mimalloc -I. -I./Include -fPIC"
    PYTHONFRAMEWORKDIR = "no-framework"
    PY_BUILTIN_MODULE_CFLAGS = "-fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall  -fno-omit-frame-pointer -D_Py_JIT -std=c11 -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal -I./Include/internal/mimalloc -I. -I./Include -DPy_BUILD_CORE_BUILTIN"
    PY_CFLAGS = "-fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall  -fno-omit-frame-pointer"
    PY_CORE_CFLAGS = "-fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall  -fno-omit-frame-pointer -D_Py_JIT -std=c11 -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal -I./Include/internal/mimalloc -I. -I./Include -DPy_BUILD_CORE"
    PY_STDMODULE_CFLAGS = "-fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall  -fno-omit-frame-pointer -D_Py_JIT -std=c11 -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal -I./Include/internal/mimalloc -I. -I./Include"

pablogsal commented 8 months ago

I will check with perf later in the day

pablogsal commented 8 months ago

I checked and I can confirm perf cannot unwind throughout the JIT (with or without frame pointers). For this I used @brandtbucher script:

import operator

def f(n):
    if n <= 0:
        return
    for _ in range(16):
        operator.call(f, n - 1)

f(10)

I started it with ./python script.py, let it run for a while so the JIT is warmed up and then in another terminal attached with perf:

sudo -E perf record -F9999 -g -p PID

I let it run for a while and then run perf script over the result file and got this:

python 2794949 13914841.174912:     100010 cpu-clock:pppH:
            564f70647239 _PyEval_EvalFrameDefault+0x3e9 (/home/ubuntu/cpython/python)
            564f706b368c PyObject_Vectorcall+0x3c (/home/ubuntu/cpython/python)
            7f8abbece445 [unknown] (/tmp/perf-2794949.map)
            564f70b17440 [unknown] ([unknown])
            564f70aeb940 [unknown] ([unknown])

python 2794949 13914841.175012:     100010 cpu-clock:pppH:
            564f7080bb43 _PyEval_Vector+0xa3 (/home/ubuntu/cpython/python)
            564f706b368c PyObject_Vectorcall+0x3c (/home/ubuntu/cpython/python)
            7f8abbece445 [unknown] (/tmp/perf-2794949.map)
            564f70b17440 [unknown] ([unknown])
            564f70aeb940 [unknown] ([unknown])

python 2794949 13914841.175112:     100010 cpu-clock:pppH:
            564f7064e535 _PyEval_EvalFrameDefault+0x76e5 (/home/ubuntu/cpython/python)
            564f706b368c PyObject_Vectorcall+0x3c (/home/ubuntu/cpython/python)
            7f8abbece445 [unknown] (/tmp/perf-2794949.map)
            564f70b17440 [unknown] ([unknown])
            564f70aeb940 [unknown] ([unknown])

python 2794949 13914841.175212:     100010 cpu-clock:pppH:
            564f70809aa8 _PyEvalFramePushAndInit+0x8 (/home/ubuntu/cpython/python)
            564f7080bb36 _PyEval_Vector+0x96 (/home/ubuntu/cpython/python)
            564f706b368c PyObject_Vectorcall+0x3c (/home/ubuntu/cpython/python)
            7f8abbece445 [unknown] (/tmp/perf-2794949.map)
            564f70b17440 [unknown] ([unknown])
            564f70aeb940 [unknown] ([unknown])

python 2794949 13914841.175312:     100010 cpu-clock:pppH:
            564f70647234 _PyEval_EvalFrameDefault+0x3e4 (/home/ubuntu/cpython/python)
            564f706b368c PyObject_Vectorcall+0x3c (/home/ubuntu/cpython/python)
            7f8abbece445 [unknown] (/tmp/perf-2794949.map)
            564f70b17440 [unknown] ([unknown])
            564f70aeb940 [unknown] ([unknown])

python 2794949 13914841.175412:     100010 cpu-clock:pppH:
            564f70808866 initialize_locals+0x6 (/home/ubuntu/cpython/python)
            564f70809b5e _PyEvalFramePushAndInit+0xbe (/home/ubuntu/cpython/python)
            564f7080bb36 _PyEval_Vector+0x96 (/home/ubuntu/cpython/python)
            564f706b368c PyObject_Vectorcall+0x3c (/home/ubuntu/cpython/python)
            7f8abbece445 [unknown] (/tmp/perf-2794949.map)
            564f70b17440 [unknown] ([unknown])
            564f70aeb940 [unknown] ([unknown])

So as you can see every time perf his a Jitted frame, it fails to unwind. @mdboom I think what you are getting in your script it's these partial frames reordered, not the full stack. You can confirm running perf script and then finding one stack that has fitted frames and check if the topmost frame is main() or __libc_main

pablogsal commented 8 months ago

Note that if we confirm this it means that any current attribution of Perf to jitted frames it's inaccurate

mdboom commented 8 months ago

@pablogsal: Can you be more specific about how you are running perf script?

Note that if we confirm this it means that any current attribution of Perf to jitted frames it's inaccurate

If I'm understanding correctly, I see how this means that where the JIT'ted code in the stack is inaccurate, but I don't see how it follows that time attributed to JIT'ted code is inaccurate (which I realize is the easier subset of this whole problem). The numbers certainly add up and make sense.

pablogsal commented 8 months ago

@pablogsal: Can you be more specific about how you are running perf script?

Assuming your result file it's called "perf.data" (the default) you can just run perf script otherwise you need to specify the name of the file with perf script -i

If I'm understanding correctly, I see how this means that where the JIT'ted code in the stack is inaccurate, but I don't see how it follows that time attributed to JIT'ted code is inaccurate (which I realize is the easier subset of this whole problem). The numbers certainly add up and make sense.

This is not a problem if you only account for leaf frames and we believe Perf is correctly identifying jitted frames in all cases in these conditions. If you need to account for callers, because it cannot unwind the stack it cannot discover that above a given frame there are multiple jitted frames (like in @brandtbucher example) and therefore if you measure anything other than self-time (as opposed to self-time + children time, as in flamegraphs) it will be inaccurate.

mdboom commented 8 months ago

therefore if you measure anything other than self-time (as opposed to self-time + children time, as in flamegraphs) it will be inaccurate.

Got it, that makes sense.

I can confirm the output of perf script doesn't trace all the way to the root. This is true even if you provide perf-$PID.map entries (it was a long shot, but I thought maybe providing the code size metadata might help).

python / cpython