Open olumide-x opened 2 months ago
Oof, it’s been a long time since I’ve worked on this or used it. So I don’t know. If you can improve it, I’m happy to accept a pr
I've been profiling the bug some more and I now have a pattern of GIL acquisition that reproduces the bug:
sudo perf script -i perf.data --no-inline --ns | grep -n "python 2269"
280262: python 2269 [002] 574.166612735: python:take_gil: (562c85985b10)
280265: python 2269 [002] 574.166826298: python:take_gil__return: (562c85985b10 <- 562c85a14636)
280267: python 2269 [002] 574.167545160: python:drop_gil: (562c85985e65)
280330: python 2269 [002] 574.171798239: python:take_gil: (562c8598601e)
280332: python 2269 [002] 574.173052402: python:drop_gil: (562c85985e65)
280333: python 2269 [002] 574.175078510: python:take_gil: (562c8598601e)
280335: python 2269 [002] 574.175685923: python:drop_gil: (562c85985e65)
280372: python 2269 [002] 574.185163792: python:take_gil: (562c8598601e)
280375: python 2269 [002] 574.186388026: python:drop_gil: (562c85985e65)
280380: python 2269 [002] 574.194330472: python:take_gil: (562c8598601e)
280387: python 2269 [002] 574.194721644: python:drop_gil: (562c85985e65)
280415: python 2269 [002] 574.195718706: python:take_gil: (562c8598601e)
280417: python 2269 [002] 574.196368758: python:drop_gil: (562c85985e65)
280468: python 2269 [000] 574.223191977: python:take_gil: (562c8598601e)
280470: python 2269 [000] 574.223241056: python:drop_gil: (562c85985e65)
280471: python 2269 [000] 574.225114334: python:take_gil: (562c8598601e)
280472: python 2269 [000] 574.225522201: python:drop_gil: (562c85985e65)
280473: python 2269 [000] 574.227129472: python:take_gil: (562c8598601e)
280475: python 2269 [000] 574.227508047: python:drop_gil: (562c85985e65)
280548: python 2269 [000] 574.244793681: python:take_gil: (562c8598601e)
280550: python 2269 [000] 574.245304399: python:drop_gil: (562c85985e65)
280551: python 2269 [000] 574.246684149: python:take_gil: (562c8598601e)
280552: python 2269 [000] 574.246733449: python:drop_gil: (562c85985e65)
280553: python 2269 [000] 574.246754309: python:take_gil: (562c8598601e)
280554: python 2269 [000] 574.246828468: python:drop_gil: (562c85985f20)
280555: python 2269 [000] 574.246832768: python:drop_gil__return: (562c85985f20 <- 562c85a23a9e)
The problem is that, when processing the event python:drop_gil__return
, the timestamp of the event python:take_gil__return
is incorrectly assigned to time_gil_take
. Therefore line 267 of perf2trace.py asserts: 574166826.29 >= 574246754.3000001
.
I'm reading to code properly to determine how best to fix the bug but, state machine-wise, I am not sure why:
python:take_gil__return
matters if it follows a python:take_gil
, or whypython:drop_gil__return
matters if it follows a python:drop_gil
.Here are a couple of an excerpts of a trace with --call-graph dwarf
enabled:
python 3952 [003] 8795.807072: python:drop_gil: (563fb7a79f20)
563fb7a79f20 drop_gil+0x0 (/usr/local/bin/python3.12)
563fb7b17a9d _PyThreadState_DeleteCurrent+0x2d (/usr/local/bin/python3.12)
563fb7b222a7 pythread_wrapper+0x17 (/usr/local/bin/python3.12)
7f476a074608 start_thread+0xd8 (/usr/lib/x86_64-linux-gnu/libpthread-2.31.so)
7f4769e3f352 __GI___clone+0x42 (inlined)
python 3952 [003] 8795.807082: python:drop_gil__return: (563fb7a79f20 <- 563fb7b17a9e)
563fb7b17a9e free_threadstate+0x2e (inlined)
563fb7b17a9e _PyThreadState_DeleteCurrent+0x2e (/usr/local/bin/python3.12)
563fb7b222a7 pythread_wrapper+0x17 (/usr/local/bin/python3.12)
7f476a074608 start_thread+0xd8 (/usr/lib/x86_64-linux-gnu/libpthread-2.31.so)
7f4769e3f352 __GI___clone+0x42 (inlined)
Similarly
python 3891 [001] 8769.617141: python:take_gil: (563fb7a79b10)
563fb7a79b10 take_gil+0x0 (/usr/local/bin/python3.12)
563fb7b08175 _PyEval_InitGIL+0x45 (/usr/local/bin/python3.12)
563fb7b15040 init_interp_create_gil+0x60 (/usr/local/bin/python3.12)
563fb7b14e7e pycore_create_interpreter+0x1be (inlined)
563fb7b14e7e pyinit_config+0x1be (inlined)
563fb7b14e7e pyinit_core+0x1be (inlined)
563fb7b14b96 Py_InitializeFromConfig+0x56 (/usr/local/bin/python3.12)
563fb7b2567a pymain_init+0x9a (/usr/local/bin/python3.12)
563fb7b255a0 pymain_main+0x20 (/usr/local/bin/python3.12)
563fb7b2555c Py_BytesMain+0x3c (/usr/local/bin/python3.12)
7f4769d44082 __libc_start_main+0xf2 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)
563fb7a908cd _start+0x2d (/usr/local/bin/python3.12)
python 3891 [001] 8769.617190: python:take_gil__return: (563fb7a79b10 <- 563fb7b08176)
563fb7b08176 _PyEval_InitGIL+0x46 (/usr/local/bin/python3.12)
563fb7b15040 init_interp_create_gil+0x60 (/usr/local/bin/python3.12)
563fb7b14e7e pycore_create_interpreter+0x1be (inlined)
563fb7b14e7e pyinit_config+0x1be (inlined)
563fb7b14e7e pyinit_core+0x1be (inlined)
563fb7b14b96 Py_InitializeFromConfig+0x56 (/usr/local/bin/python3.12)
563fb7b2567a pymain_init+0x9a (/usr/local/bin/python3.12)
563fb7b255a0 pymain_main+0x20 (/usr/local/bin/python3.12)
563fb7b2555c Py_BytesMain+0x3c (/usr/local/bin/python3.12)
7f4769d44082 __libc_start_main+0xf2 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)
563fb7a908cd _start+0x2d (/usr/local/bin/python3.12)
Any guidance would be appreciated.
I've been running per4m extensively on perf data sampled from a heavily multithreaded application that my team is developing and per4m consistently asserts on sequences take_gil -> drop_gil -> drop_gil__return, for example
Yet a read of the Python source shows that the function
take_gil
can calldrop_gil
and the functiondrop_gil
returns totake_gil
.Is this a corner case that per4m does not anticipate?