maartenbreddels / per4m

Profiling and tracing information for Python using viztracer and perf, the GIL exposed.
MIT License
55 stars 5 forks source link

Possibly incorrect assertion on sequence: take_gil -> drop_gil -> drop_gil__return #7

Open olumide-x opened 2 months ago

olumide-x commented 2 months ago

I've been running per4m extensively on perf data sampled from a heavily multithreaded application that my team is developing and per4m consistently asserts on sequences take_gil -> drop_gil -> drop_gil__return, for example

       viztracer  5533 [003]  3220.317244274:         python:take_gil: (55fe99b0d01e)
       viztracer  5533 [003]  3220.317407813:         python:drop_gil: (55fe99b0cf20)
       viztracer  5533 [003]  3220.317412443: python:drop_gil__return: (55fe99b0cf20 <- 55fe99baaa9e)  # example 1
       viztracer  5533 [003]  3220.317419189:         python:take_gil: (55fe99b0d01e)
       viztracer  5533 [003]  3220.317422951:         python:drop_gil: (55fe99b0cf20)
       viztracer  5533 [003]  3220.317425869: python:drop_gil__return: (55fe99b0cf20 <- 55fe99baaa9e)  # example 1

Yet a read of the Python source shows that the function take_gil can call drop_gil and the function drop_gil returns to take_gil.

Is this a corner case that per4m does not anticipate?

maartenbreddels commented 2 months ago

Oof, it’s been a long time since I’ve worked on this or used it. So I don’t know. If you can improve it, I’m happy to accept a pr

olumide-x commented 1 month ago

I've been profiling the bug some more and I now have a pattern of GIL acquisition that reproduces the bug:

sudo perf script -i perf.data --no-inline --ns | grep -n "python  2269"
280262:          python  2269 [002]   574.166612735:         python:take_gil: (562c85985b10)
280265:          python  2269 [002]   574.166826298: python:take_gil__return: (562c85985b10 <- 562c85a14636)
280267:          python  2269 [002]   574.167545160:         python:drop_gil: (562c85985e65)
280330:          python  2269 [002]   574.171798239:         python:take_gil: (562c8598601e)
280332:          python  2269 [002]   574.173052402:         python:drop_gil: (562c85985e65)
280333:          python  2269 [002]   574.175078510:         python:take_gil: (562c8598601e)
280335:          python  2269 [002]   574.175685923:         python:drop_gil: (562c85985e65)
280372:          python  2269 [002]   574.185163792:         python:take_gil: (562c8598601e)
280375:          python  2269 [002]   574.186388026:         python:drop_gil: (562c85985e65)
280380:          python  2269 [002]   574.194330472:         python:take_gil: (562c8598601e)
280387:          python  2269 [002]   574.194721644:         python:drop_gil: (562c85985e65)
280415:          python  2269 [002]   574.195718706:         python:take_gil: (562c8598601e)
280417:          python  2269 [002]   574.196368758:         python:drop_gil: (562c85985e65)
280468:          python  2269 [000]   574.223191977:         python:take_gil: (562c8598601e)
280470:          python  2269 [000]   574.223241056:         python:drop_gil: (562c85985e65)
280471:          python  2269 [000]   574.225114334:         python:take_gil: (562c8598601e)
280472:          python  2269 [000]   574.225522201:         python:drop_gil: (562c85985e65)
280473:          python  2269 [000]   574.227129472:         python:take_gil: (562c8598601e)
280475:          python  2269 [000]   574.227508047:         python:drop_gil: (562c85985e65)
280548:          python  2269 [000]   574.244793681:         python:take_gil: (562c8598601e)
280550:          python  2269 [000]   574.245304399:         python:drop_gil: (562c85985e65)
280551:          python  2269 [000]   574.246684149:         python:take_gil: (562c8598601e)
280552:          python  2269 [000]   574.246733449:         python:drop_gil: (562c85985e65)
280553:          python  2269 [000]   574.246754309:         python:take_gil: (562c8598601e)
280554:          python  2269 [000]   574.246828468:         python:drop_gil: (562c85985f20)
280555:          python  2269 [000]   574.246832768: python:drop_gil__return: (562c85985f20 <- 562c85a23a9e)

The problem is that, when processing the event python:drop_gil__return, the timestamp of the event python:take_gil__return is incorrectly assigned to time_gil_take. Therefore line 267 of perf2trace.py asserts: 574166826.29 >= 574246754.3000001.

I'm reading to code properly to determine how best to fix the bug but, state machine-wise, I am not sure why:

  1. The event python:take_gil__return matters if it follows a python:take_gil, or why
  2. The event python:drop_gil__return matters if it follows a python:drop_gil.

Here are a couple of an excerpts of a trace with --call-graph dwarf enabled:

python  3952 [003]  8795.807072:         python:drop_gil: (563fb7a79f20)
            563fb7a79f20 drop_gil+0x0 (/usr/local/bin/python3.12)
            563fb7b17a9d _PyThreadState_DeleteCurrent+0x2d (/usr/local/bin/python3.12)
            563fb7b222a7 pythread_wrapper+0x17 (/usr/local/bin/python3.12)
            7f476a074608 start_thread+0xd8 (/usr/lib/x86_64-linux-gnu/libpthread-2.31.so)
            7f4769e3f352 __GI___clone+0x42 (inlined)

python  3952 [003]  8795.807082: python:drop_gil__return: (563fb7a79f20 <- 563fb7b17a9e)
            563fb7b17a9e free_threadstate+0x2e (inlined)
            563fb7b17a9e _PyThreadState_DeleteCurrent+0x2e (/usr/local/bin/python3.12)
            563fb7b222a7 pythread_wrapper+0x17 (/usr/local/bin/python3.12)
            7f476a074608 start_thread+0xd8 (/usr/lib/x86_64-linux-gnu/libpthread-2.31.so)
            7f4769e3f352 __GI___clone+0x42 (inlined)

Similarly

python  3891 [001]  8769.617141:         python:take_gil: (563fb7a79b10)
            563fb7a79b10 take_gil+0x0 (/usr/local/bin/python3.12)
            563fb7b08175 _PyEval_InitGIL+0x45 (/usr/local/bin/python3.12)
            563fb7b15040 init_interp_create_gil+0x60 (/usr/local/bin/python3.12)
            563fb7b14e7e pycore_create_interpreter+0x1be (inlined)
            563fb7b14e7e pyinit_config+0x1be (inlined)
            563fb7b14e7e pyinit_core+0x1be (inlined)
            563fb7b14b96 Py_InitializeFromConfig+0x56 (/usr/local/bin/python3.12)
            563fb7b2567a pymain_init+0x9a (/usr/local/bin/python3.12)
            563fb7b255a0 pymain_main+0x20 (/usr/local/bin/python3.12)
            563fb7b2555c Py_BytesMain+0x3c (/usr/local/bin/python3.12)
            7f4769d44082 __libc_start_main+0xf2 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)
            563fb7a908cd _start+0x2d (/usr/local/bin/python3.12)

python  3891 [001]  8769.617190: python:take_gil__return: (563fb7a79b10 <- 563fb7b08176)
            563fb7b08176 _PyEval_InitGIL+0x46 (/usr/local/bin/python3.12)
            563fb7b15040 init_interp_create_gil+0x60 (/usr/local/bin/python3.12)
            563fb7b14e7e pycore_create_interpreter+0x1be (inlined)
            563fb7b14e7e pyinit_config+0x1be (inlined)
            563fb7b14e7e pyinit_core+0x1be (inlined)
            563fb7b14b96 Py_InitializeFromConfig+0x56 (/usr/local/bin/python3.12)
            563fb7b2567a pymain_init+0x9a (/usr/local/bin/python3.12)
            563fb7b255a0 pymain_main+0x20 (/usr/local/bin/python3.12)
            563fb7b2555c Py_BytesMain+0x3c (/usr/local/bin/python3.12)
            7f4769d44082 __libc_start_main+0xf2 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)
            563fb7a908cd _start+0x2d (/usr/local/bin/python3.12)

Any guidance would be appreciated.