Python: frames missing due to failure to resolve autoTLSKey

What happened?

Under certain conditions (two examples detailed below), the python loader fails to resolve the autoTLSKey, which will cause python frames to be missing from the flame graph. Both examples are on amd64 ~and it's unclear whether this also potentially affects arm64 (TBD)~. Update: Only amd64 is impacted.

The two examples are detailed below:

1. Python 3.12+ compiled from source without LTO

Starting with Python 3.12 (commit), PyGILState_GetThisThreadState calls PyThread_tss_is_created first before calling PyThread_tss_get, see: https://github.com/python/cpython/blob/3.12/Python/pystate.c#L2171-L2174

On amd64 default builds of python (without --enable-optimizations, --with-lto, which are not enabled by default https://docs.python.org/3/using/configure.html#performance-options), the calls to PyThread_tss_is_created and PyThread_tss_get are not inlined, so the value of autoTLSKey is stored in a register (%rbp) before being passed to both function calls:

(gdb) disassemble PyGILState_GetThisThreadState
Dump of assembler code for function PyGILState_GetThisThreadState:
   0x000000000029c5d0 <+0>: mov    0x260381(%rip),%rax        # 0x4fc958
   0x000000000029c5d7 <+7>: push   %rbp
   0x000000000029c5d8 <+8>: lea    0x608(%rax),%rbp
   0x000000000029c5df <+15>:    mov    %rbp,%rdi
   0x000000000029c5e2 <+18>:    call   0x10afb0 <PyThread_tss_is_created@plt>
   0x000000000029c5e7 <+23>:    test   %eax,%eax
   0x000000000029c5e9 <+25>:    je     0x29c5f8 <PyGILState_GetThisThreadState+40>
   0x000000000029c5eb <+27>:    mov    %rbp,%rdi
   0x000000000029c5ee <+30>:    pop    %rbp
   0x000000000029c5ef <+31>:    jmp    0x10f080 <PyThread_tss_get@plt>
   0x000000000029c5f4 <+36>:    nopl   0x0(%rax)
   0x000000000029c5f8 <+40>:    xor    %eax,%eax
   0x000000000029c5fa <+42>:    pop    %rbp
   0x000000000029c5fb <+43>:    ret
End of assembler dump.

This causes the decode disassembler to not find the value in the call instruction since this is not supported in https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/bd37e9c/interpreter/python/decode_amd64.c#L11-L20

This is reproducible by downloading and building python from source, and running the profiler next to a python process.

2. Python bundled with `google-cloud-sdk`

With Google Cloud SDK 502.0.0, on amd64, which bundles python 3.11, PyGILState_GetThisThreadState disassembles as follows (gdb -q /lib/google-cloud-sdk/platform/bundledpythonunix/lib/libpython3.11.so):

(gdb) disassemble PyGILState_GetThisThreadState
Dump of assembler code for function PyGILState_GetThisThreadState:
   0x0000000000377d20 <+0>: mov    0x1129d61(%rip),%rax        # 0x14a1a88
   0x0000000000377d27 <+7>: cmpq   $0x0,0x248(%rax)
   0x0000000000377d2f <+15>:    je     0x377d42 <PyGILState_GetThisThreadState+34>
   0x0000000000377d31 <+17>:    mov    $0x250,%edi
   0x0000000000377d36 <+22>:    add    0x1129d4b(%rip),%rdi        # 0x14a1a88
   0x0000000000377d3d <+29>:    jmp    0x20eba0 <PyThread_tss_get@plt>
   0x0000000000377d42 <+34>:    xor    %eax,%eax
   0x0000000000377d44 <+36>:    ret
End of assembler dump.

The decode disassemble will find the mov $0x250,%edi instruction and assume that 0x250 is the address at which autoTSSKey will be found. However, this is modified later on by an ulterior add on %rdi.

Eventually, 0x250 will fail the distance check in: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/bd37e9ce23f9904ea7b11389daf50a115ac2389e/interpreter/python/python.go#L701-L704

Since it 0x250 is an offset to PyRuntime rather than the absolute value of autoTSSKey.

Potential fixes

A couple potential fixes we though about:

a. Disassemble a different function: It's possible to fix (1.) (but not 2.) by alternatively disassembling PyGILState_Release, either as a fallback (see patch in https://github.com/DataDog/opentelemetry-ebpf-profiler/commit/630fd97803b41dca87e5a0f6e1e45c31c7e43b90) or directly. Since python 3.0, and still as of python 3.13, this method directly calls PyThread_tss_get.

b. Add more logic in disassembly code: Add logic in decode_amd64.c to take into account variables in temporary registers, and the case where %rdi is populated via mov then add. We could potentially fix both issues at the cost of additionally complexity.

c. Use constant header values: Hardcode the offset of autoTSSKey from PyRuntime, with a different value depending on the python version.

Let me know whether there's any preference / different ideas as to how we should fix this 🙇 !

open-telemetry / opentelemetry-ebpf-profiler