python / cpython

The Python programming language
https://www.python.org
Other
63.33k stars 30.31k forks source link

Missing DTrace probes #104280

Open furkanonder opened 1 year ago

furkanonder commented 1 year ago

Bug report

On Linux, you can list available DTrace probes by running /bcc/tools/tplist.py tool;

$ ~/bcc/tools/tplist.py -l cpython/python    
cpython/python python:function__entry
cpython/python python:function__return
cpython/python python:line
cpython/python python:import__find__load__done
cpython/python python:import__find__load__start
cpython/python python:audit
cpython/python python:gc__start
cpython/python python:gc__done

The function__entry, function__return and line probes appear to be missing after the GH-103082: Implementation of PEP 669: Low Impact Monitoring for CPython update.

$ ~/bcc/tools/tplist.py -l cpython/python
cpython/python python:import__find__load__done
cpython/python python:import__find__load__start
cpython/python python:audit
cpython/python python:gc__start
cpython/python python:gc__done

Your environment

Linked PRs

sunmy2019 commented 1 year ago

CC: @markshannon @gaogaotiantian

gaogaotiantian commented 1 year ago

I think DTrace was intentionally removed because it was not being used. I forgot where I read that, but Mark will have a better idea of it.

brandtbucher commented 1 year ago

I think a bigger issue is that I'm pretty sure it isn't tested on any CI jobs or buildbots, so stuff can break (like #98894) without us noticing.

If it's not tested, it effectively isn't supported. Not sure whether the right path forward is for some motivated person to fix and maintain it, for us to remove the broken stuff and keep the working stuff (with a big disclaimer in the docs), or to just rip it out entirely.

Anecdotally, it's always been a bit uncomfortable to touch this stuff when modifying the interpreter, since I'm not very sure how it works or how to test it.

@ambv might have an idea of what the right path forward is here.

furkanonder commented 1 year ago

I think DTrace was intentionally removed because it was not being used. I forgot where I read that, but Mark will have a better idea of it.

Dtrace support was actually being used. DTrace markers, also known as "probes", are not only used by DTtrace. For example, these probes can be used with bpftrace.

There are tools written using these probes. I will even give a talk about these probes at EuroPython 2023 :)

kchoudhu commented 1 year ago

Just chiming in here: we use these probes in anger in production, and their absence is a blocker preventing us from upgrading to 3.11.

If tests are needed to make sure dtrace support is accounted for in the release process, I am happy to contribute time to write tests and set them up.

furkanonder commented 1 year ago

@ambv @brandtbucher I wrote a test case to check for the presence of probes. Could you review the PR?

encukou commented 1 year ago

The newly added tests fail in Fedora:

0:07:40 load avg: 5.34 Re-running test_dtrace in verbose mode (matching: test_check_probes, test_check_probes, test_check_probes, test_check_probes, test_check_probes)
readelf version: 2.40
test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) ... 
  test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: import__find__load__done') ... FAIL
  test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: import__find__load__start') ... FAIL
  test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: audit') ... FAIL
  test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: gc__start') ... FAIL
  test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: gc__done') ... FAIL
======================================================================
FAIL: test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: import__find__load__done')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/builddir/build/BUILD/Python-3.12.0rc1/Lib/test/test_dtrace.py", line 241, in test_check_probes
    self.assertIn(probe_name, readelf_output)
AssertionError: 'Name: import__find__load__done' not found in '\nDisplaying notes found in: .note.gnu.property\n  Owner                Data size \tDescription\n  GNU                  0x00000010\tNT_GNU_PROPERTY_TYPE_0\n      Properties: AArch64 feature: BTI, PAC\n\nDisplaying notes found in: .note.gnu.build-id\n  Owner                Data size \tDescription\n  GNU                  0x00000014\tNT_GNU_BUILD_ID (unique build ID bitstring)\n    Build ID: 3b49ac28c70a95104b93815144881aadefb7bad1\n\nDisplaying notes found in: .note.ABI-tag\n  Owner                Data size \tDescription\n  GNU                  0x00000010\tNT_GNU_ABI_TAG (ABI version tag)\n    OS: Linux, ABI: 3.7.0\n\nDisplaying notes found in: .note.package\n  Owner                Data size \tDescription\n  FDO                  0x00000084\tFDO_PACKAGING_METADATA\n    Packaging Metadata: {"type":"rpm","name":"python3.12","version":"3.12.0~rc1-1.fc39","architecture":"aarch64","osCpe":"cpe:/o:fedoraproject:fedora:39"}\n\nDisplaying notes found in: .gnu.build.attributes\n  Owner                Data size \tDescription\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x100c0 to 0x100f4\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x100f4 to 0x10108\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10000 to 0x10010\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10204 to 0x10210\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10110 to 0x101d8\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10204 to 0x10204\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10204 to 0x10204\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10010 to 0x1001c\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10210 to 0x1021c\n'

Full log: https://kojipkgs.fedoraproject.org//work/tasks/6927/104486927/build.log

Here's a readelf output printed out:

Displaying notes found in: .note.gnu.property
  Owner                Data size    Description
  GNU                  0x00000010   NT_GNU_PROPERTY_TYPE_0
      Properties: AArch64 feature: BTI, PAC

Displaying notes found in: .note.gnu.build-id
  Owner                Data size    Description
  GNU                  0x00000014   NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: 3b49ac28c70a95104b93815144881aadefb7bad1

Displaying notes found in: .note.ABI-tag
  Owner                Data size    Description
  GNU                  0x00000010   NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 3.7.0

Displaying notes found in: .note.package
  Owner                Data size    Description
  FDO                  0x00000084   FDO_PACKAGING_METADATA
    Packaging Metadata: {"type":"rpm","name":"python3.12","version":"3.12.0~rc1-1.fc39","architecture":"aarch64","osCpe":"cpe:/o:fedoraproject:fedora:39"}

Displaying notes found in: .gnu.build.attributes
  Owner                Data size    Description
  GA$<version>3a1      0x00000010   OPEN
    Applies to region from 0x100c0 to 0x100f4
  GA$<version>3a1      0x00000010   OPEN
    Applies to region from 0x100f4 to 0x10108
  GA$<version>3a1      0x00000010   OPEN
    Applies to region from 0x10000 to 0x10010
  GA$<version>3a1      0x00000010   OPEN
    Applies to region from 0x10204 to 0x10210
  GA$<version>3a1      0x00000010   OPEN
    Applies to region from 0x10110 to 0x101d8
  GA$<version>3a1      0x00000010   OPEN
    Applies to region from 0x10204 to 0x10204
  GA$<version>3a1      0x00000010   OPEN
    Applies to region from 0x10204 to 0x10204
  GA$<version>3a1      0x00000010   OPEN
    Applies to region from 0x10010 to 0x1001c
  GA$<version>3a1      0x00000010   OPEN
    Applies to region from 0x10210 to 0x1021c

I'm unfortunately unfamiliar with DTrace. Is this expected? How can I debug this?

ambv commented 1 year ago

Are you building your Python --with-dtrace? If not, it's a surprise that sysconfig.get_config_var('WITH_DTRACE') returns 1 in your case...

When I build --with-dtrace locally, I get 1 but when I don't, that config var returns 0. In this case the test is just skipped:

setUpClass (test.test_dtrace.CheckDtraceProbes) ... skipped 'CPython must be configured with the --with-dtrace option.'
ambv commented 1 year ago

BTW, this says "dtrace" but those probes are also SystemTap / eBPF.

encukou commented 1 year ago

Yes, --with-dtrace. I get the same failure in Codespaces.

ambv commented 1 year ago

OK, sorry for the trouble. I will look into this today.

encukou commented 1 year ago

No trouble at all! Thanks for looking in to it.

furkanonder commented 1 year ago

The newly added tests fail in Fedora:

0:07:40 load avg: 5.34 Re-running test_dtrace in verbose mode (matching: test_check_probes, test_check_probes, test_check_probes, test_check_probes, test_check_probes)
readelf version: 2.40
test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) ... 
  test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: import__find__load__done') ... FAIL
  test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: import__find__load__start') ... FAIL
  test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: audit') ... FAIL
  test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: gc__start') ... FAIL
  test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: gc__done') ... FAIL
======================================================================
FAIL: test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) (probe_name='Name: import__find__load__done')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/builddir/build/BUILD/Python-3.12.0rc1/Lib/test/test_dtrace.py", line 241, in test_check_probes
    self.assertIn(probe_name, readelf_output)
AssertionError: 'Name: import__find__load__done' not found in '\nDisplaying notes found in: .note.gnu.property\n  Owner                Data size \tDescription\n  GNU                  0x00000010\tNT_GNU_PROPERTY_TYPE_0\n      Properties: AArch64 feature: BTI, PAC\n\nDisplaying notes found in: .note.gnu.build-id\n  Owner                Data size \tDescription\n  GNU                  0x00000014\tNT_GNU_BUILD_ID (unique build ID bitstring)\n    Build ID: 3b49ac28c70a95104b93815144881aadefb7bad1\n\nDisplaying notes found in: .note.ABI-tag\n  Owner                Data size \tDescription\n  GNU                  0x00000010\tNT_GNU_ABI_TAG (ABI version tag)\n    OS: Linux, ABI: 3.7.0\n\nDisplaying notes found in: .note.package\n  Owner                Data size \tDescription\n  FDO                  0x00000084\tFDO_PACKAGING_METADATA\n    Packaging Metadata: {"type":"rpm","name":"python3.12","version":"3.12.0~rc1-1.fc39","architecture":"aarch64","osCpe":"cpe:/o:fedoraproject:fedora:39"}\n\nDisplaying notes found in: .gnu.build.attributes\n  Owner                Data size \tDescription\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x100c0 to 0x100f4\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x100f4 to 0x10108\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10000 to 0x10010\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10204 to 0x10210\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10110 to 0x101d8\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10204 to 0x10204\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10204 to 0x10204\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10010 to 0x1001c\n  GA$<version>3a1      0x00000010\tOPEN\n    Applies to region from 0x10210 to 0x1021c\n'

Full log: https://kojipkgs.fedoraproject.org//work/tasks/6927/104486927/build.log

Here's a readelf output printed out:

Displaying notes found in: .note.gnu.property
  Owner                Data size  Description
  GNU                  0x00000010 NT_GNU_PROPERTY_TYPE_0
      Properties: AArch64 feature: BTI, PAC

Displaying notes found in: .note.gnu.build-id
  Owner                Data size  Description
  GNU                  0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: 3b49ac28c70a95104b93815144881aadefb7bad1

Displaying notes found in: .note.ABI-tag
  Owner                Data size  Description
  GNU                  0x00000010 NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 3.7.0

Displaying notes found in: .note.package
  Owner                Data size  Description
  FDO                  0x00000084 FDO_PACKAGING_METADATA
    Packaging Metadata: {"type":"rpm","name":"python3.12","version":"3.12.0~rc1-1.fc39","architecture":"aarch64","osCpe":"cpe:/o:fedoraproject:fedora:39"}

Displaying notes found in: .gnu.build.attributes
  Owner                Data size  Description
  GA$<version>3a1      0x00000010 OPEN
    Applies to region from 0x100c0 to 0x100f4
  GA$<version>3a1      0x00000010 OPEN
    Applies to region from 0x100f4 to 0x10108
  GA$<version>3a1      0x00000010 OPEN
    Applies to region from 0x10000 to 0x10010
  GA$<version>3a1      0x00000010 OPEN
    Applies to region from 0x10204 to 0x10210
  GA$<version>3a1      0x00000010 OPEN
    Applies to region from 0x10110 to 0x101d8
  GA$<version>3a1      0x00000010 OPEN
    Applies to region from 0x10204 to 0x10204
  GA$<version>3a1      0x00000010 OPEN
    Applies to region from 0x10204 to 0x10204
  GA$<version>3a1      0x00000010 OPEN
    Applies to region from 0x10010 to 0x1001c
  GA$<version>3a1      0x00000010 OPEN
    Applies to region from 0x10210 to 0x1021c

I'm unfortunately unfamiliar with DTrace. Is this expected? How can I debug this?

Did you run the readelf tool with the -n parameter? I don't see the probes in your output.

Now I have build cpython and I can see the probes.

[ted@rasp4 cpython]$ readelf -n ./python

Displaying notes found in: .note.gnu.build-id
  Owner                Data size    Description
  GNU                  0x00000014   NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: ac4de5af42642e3a69705be808d9c8a73aeee03f

Displaying notes found in: .note.ABI-tag
  Owner                Data size    Description
  GNU                  0x00000010   NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 3.7.0

Displaying notes found in: .note.stapsdt
  Owner                Data size    Description
  stapsdt              0x00000043   NT_STAPSDT (SystemTap probe descriptors)
    Provider: python
    Name: import__find__load__done
    Location: 0x00000000002bad4c, Base: 0x0000000000478105, Semaphore: 0x00000000006c3c7c
    Arguments: 8@x0 -4@x1
  stapsdt              0x0000003e   NT_STAPSDT (SystemTap probe descriptors)
    Provider: python
    Name: import__find__load__start
    Location: 0x00000000002bb084, Base: 0x0000000000478105, Semaphore: 0x00000000006c3c7a
    Arguments: 8@x0
  stapsdt              0x00000031   NT_STAPSDT (SystemTap probe descriptors)
    Provider: python
    Name: audit
    Location: 0x00000000002ef1e0, Base: 0x0000000000478105, Semaphore: 0x00000000006c3c7e
    Arguments: 8@x21 8@x20
  stapsdt              0x00000036   NT_STAPSDT (SystemTap probe descriptors)
    Provider: python
    Name: gc__start
    Location: 0x000000000030a3c8, Base: 0x0000000000478105, Semaphore: 0x00000000006c3c76
    Arguments: -4@[sp, 132]
  stapsdt              0x0000002f   NT_STAPSDT (SystemTap probe descriptors)
    Provider: python
    Name: gc__done
    Location: 0x000000000030a8e4, Base: 0x0000000000478105, Semaphore: 0x00000000006c3c78
    Arguments: -8@x19
[ted@rasp4 cpython]$ 

I didn't see a problem in the tests.

[ted@rasp4 cpython]$ sudo ./python -m test test_dtrace
0:00:00 load avg: 0.46 Run tests sequentially
0:00:00 load avg: 0.46 [1/1] test_dtrace

== Tests result: SUCCESS ==

1 test OK.

Total duration: 4.1 sec
Tests result: SUCCESS
[ted@rasp4 cpython]$ 
[ted@rasp4 cpython]$ ./python -m test test_dtrace -v
== CPython 3.13.0a0 (heads/main:0e6e32fb84, Aug 8 2023, 19:47:11) [GCC 12.1.0]
== Linux-5.15.89-1-MANJARO-ARM-RPI-aarch64-with-glibc2.35 little-endian
== Python build: release dtrace
== cwd: /home/ted/cpython/build/test_python_2604899æ
== CPU count: 4
== encodings: locale=UTF-8, FS=utf-8
0:00:00 load avg: 0.52 Run tests sequentially
0:00:00 load avg: 0.52 [1/1] test_dtrace
readelf version: 2.38
test_check_probes (test.test_dtrace.CheckDtraceProbes.test_check_probes) ... ok
test_missing_probes (test.test_dtrace.CheckDtraceProbes.test_missing_probes) ... expected failure
skipped 'dtrace(1) failed: /usr/bin/dtrace invalid option -q\nUsage /usr/bin/dtrace [--help] [-h | -G] [-C [-I<Path>]] -s File.d [-o <File>]'
setUpClass (test.test_dtrace.DTraceOptimizedTests) ... skipped 'dtrace(1) failed: /usr/bin/dtrace invalid option -q\nUsage /usr/bin/dtrace [--help] [-h | -G] [-C [-I<Path>]] -s File.d [-o <File>]'
setUpClass (test.test_dtrace.SystemTapNormalTests) ... skipped 'stap(1) failed: You are trying to run systemtap as a normal user.\nYou should either be root, or be part of the group "stapusr" and possibly one of the groups "stapsys" or "stapdev".  [man stap]\nTry \'--help\' for more information. [man stap]\nA message like [man foo] means for more info, run % man foo'
setUpClass (test.test_dtrace.SystemTapOptimizedTests) ... skipped 'stap(1) failed: You are trying to run systemtap as a normal user.\nYou should either be root, or be part of the group "stapusr" and possibly one of the groups "stapsys" or "stapdev".  [man stap]\nTry \'--help\' for more information. [man stap]\nA message like [man foo] means for more info, run % man foo'

----------------------------------------------------------------------
Ran 2 tests in 1.927s

OK (skipped=4, expected failures=1)

== Tests result: SUCCESS ==

1 test OK.

Total duration: 2.1 sec
Tests result: SUCCESS
[ted@rasp4 cpython]$
ambv commented 1 year ago

Your readelf versions differ. Maybe there's a difference in output between 2.38 and 2.40?

encukou commented 1 year ago

Oh, got it! We build with --enable-shared, so thetraces and notes are in the shared library. (And when I went to reproduce in Codespaces, I copied my favourite ./config invocation without thinking.)

$ readelf -n libpython3.13d.so

Displaying notes found in: .note.gnu.build-id
  Owner                Data size    Description
  GNU                  0x00000014   NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: 4765c16f47a5e0e0a0fadcd18056ec129b71a557

Displaying notes found in: .note.stapsdt
  Owner                Data size    Description
  stapsdt              0x00000046   NT_STAPSDT (SystemTap probe descriptors)
    Provider: python
    Name: import__find__load__start
    Location: 0x00000000003346c4, Base: 0x00000000005611f0, Semaphore: 0x000000000078d0ba
    Arguments: 8@-144(%rbp)
[...]

Getting to that library might be complicated -- I can't think of something easier than ldd on sys.executable and extracting the path corresponding to sysconfig.get_config_var('LDLIBRARY').

hrnciar commented 1 year ago

Note: This affects Fedora also on 3.11.5. We will skip the test_check_probes in the meantime.

markshannon commented 1 year ago

dtrace is supposed to have zero (or very close to it) overhead when not in use. The CPython implementation failed in this. If it had, perhaps PEP 669 would not have been needed. I think it is best to push users towards using sys.monitoring now that 3.12 has been released.

I realize that's not the ideal solution for existing users of dtrace, but sys.monitoring doesn't need a special build and is tested on CI.

markshannon commented 1 year ago

It also looks like dtrace function entry robes are broken in 3.11, as they are only invoked if sys.settrace tracing is also turned on. The probe point is present, but the normal code path skips it.

kchoudhu commented 1 year ago

With dtrace support effectively removed in 3.12, a few thoughts.

My initial read of sys.monitoring is that while it is great for bespoke, python-specific monitoring and tooling, it doesn't quite work in a world that is increasingly moving towards one-size fits all distributed tracing via ebpf (i.e. dtrace).

Is there room in Python for both visions of tracing? If not, why? If yes, what kind of resourcing would be needed to make sure that dtrace isn't removed again from the build?

Zheaoli commented 1 year ago

dtrace is supposed to have zero (or very close to it) overhead when not in use. The CPython implementation failed in this. If it had, perhaps PEP 669 would not have been needed. I think it is best to push users towards using sys.monitoring now that 3.12 has been released.

I realize that's not the ideal solution for existing users of dtrace, but sys.monitoring doesn't need a special build and is tested on CI.

This is a little bit weird, the DTrace and the PEP 669 are different features.

the DTrace API is for people who want to trace the Python process without restarting the process and injecting some code for the PEP 669

We can solve the process tracing problem by using #96123 in some circumstances. But there are still some issues.

  1. the perf API seems unstable
  2. we can only get the address after the process has been started. it will cause us to miss some information we need.

If we decide to remove DTrace Support officially, we may need to update the documentation https://docs.python.org/3/howto/instrumentation.html

avioligo commented 9 months ago

When building python 3.11.7 from source, I ran into this error: python configure: error: dtrace command not found on $PATH

Fixed using: sudo apt-get install systemtap-sdt-dev By installing dtrace (well almost) toolkit - the binary is not really used for tracing but only to list probes if I am correct.

Bill-Sommerfeld commented 8 months ago

I took a look at this yesterday in 3.12 as part of my current effort to package 3.12 for OpenIndiana (illumos) and I got the function entry/exit probes restored and working well enough to pass the test_dtrace.py test.

I'll share when I get a chance - I'm doing this with original dtrace on illumos (the open-source continuation of opensolaris) but it would likely also work for those who want to use these probes via systemtap.

I'll also look at the line probes when I get a chance.

Bill-Sommerfeld commented 8 months ago

https://github.com/Bill-Sommerfeld/cpython/tree/dtrace-3.12 has patches that restore the function-entry and function-return probes for me on illumos. The tests now pass for me, except for test.test_dtrace.DTraceNormalTests.test_line and test.test_dtrace.DTraceOptimizedTests.test_line which depend on the line probe which I haven't restored yet.

I'd appreciate it if others could give these a whirl on other platforms.