Closed phlax closed 1 year ago
What Pants command or commands do you run in CI?. If it's just 1 command, you should definitely try adding --no-pantsd
, this may offer a workaround. If there is more than one command you can still use this workaround, but will suffer slower builds than you would otherwise have with pantsd.
I have seen behavior in this ballpark noted here but do not have a solution yet: https://github.com/pantsbuild/scie-pants/blob/4f06c9de4e99e667bacde43c6367accd930b7013/package/src/main.rs#L861-L879
I have since learned how to get ssh access to GitHub actions CI jobs; so I'll try to repro this out there tomorrow and examine the core dump for better clues.
Im pretty sure i tested with --no-pantsd
- iirc it seemed to lessen but still happened - i tried quite a few other things to resolve - like running uncached - but the flake seemed to persist (albeit infrequently)
i reverted the update as i couldnt find a solution but you can see some failed runs here https://github.com/envoyproxy/pytooling/actions?query=is%3Afailure+branch%3Amain - there are some other pr ones where i was trying to fix also
this may give more of a clue https://github.com/envoyproxy/pytooling/actions/runs/4049513918/jobs/6965968013#step:6:144
...
README (README.md) is up-to-date
Fatal Python error: PyGILState_Release: thread state 0x7f3068001140 must be current when releasing
Python runtime state: finalizing (tstate=0xb092f0)
Thread 0x00007f308227e000 (most recent call first):
<no Python frame>
/home/runner/work/_temp/0ac5f650-e9a0-482b-9a38-eefb912f3a2a.sh: line 1: 1696 Aborted (core dumped) ./scripts/pants readme --check=README.md ::
What Pants command or commands do you run in CI?
The commands running are mostly the standard lint/check/pytest ones - with an additional hacky readme goal that i added
https://github.com/envoyproxy/pytooling/blob/main/.github/workflows/ci.yml
Ok, I was able to repro by using the GH Actions ssh trick @benjyw taught me. https://github.com/pantsbuild/scie-pants/pull/93 yielded a session where I was able to generate 6 core dumps. I'll post the 1st one here (last in time order), but I continue to examine the others. This is for Pants 2.15.0rc0 and was triggered out on a GH Actions ubuntu 20.04 box with:
runner@fv-az218-666:~/work/scie-pants/scie-pants$ while pants -V 2>&1 | tee -a log; do sleep 1; if [ -z "${PANTS_VERSION}" ]; then export PANTS_VERSION=2.14.1; else unset PANTS_VERSION && export PANTS_VERSION && rm -rf ~/.cache/nce; fi done
16:47:44.26 [INFO] Initializing scheduler...
16:47:44.93 [INFO] Scheduler initialized.
2.15.0rc0
Fatal Python error: PyGILState_Release: thread state 0x7fb6080011b0 must be current when releasing
Python runtime state: finalizing (tstate=0x159ae30)
Thread 0x00007fb6200b0000 (most recent call first):
<no Python frame>
The backtrace looks like:
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139257791374912) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=139257791374912) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=139257791374912, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007f27784b8476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#4 <signal handler called>
#5 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139257791374912) at ./nptl/pthread_kill.c:44
#6 __pthread_kill_internal (signo=6, threadid=139257791374912) at ./nptl/pthread_kill.c:78
#7 __GI___pthread_kill (threadid=139257791374912, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#8 0x00007f27784b8476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#9 0x00007f277849e7f3 in __GI_abort () at ./stdlib/abort.c:79
#10 0x00007f27784ff45c in __libc_message (action=do_abort, fmt=0x7f27786517b1 "%s", fmt=0x7f27786517b1 "%s", action=do_abort) at ../sysdeps/posix/libc_fatal.c:155
#11 0x00007f27784ff770 in __GI___libc_fatal (message=<optimized out>) at ../sysdeps/posix/libc_fatal.c:164
#12 0x00007f2778513476 in unwind_cleanup (reason=<optimized out>, exc=<optimized out>) at ./nptl/unwind.c:114
#13 0x00007f277752ad0f in panic_unwind::real_imp::cleanup () at library/panic_unwind/src/gcc.rs:78
#14 panic_unwind::__rust_panic_cleanup () at library/panic_unwind/src/lib.rs:100
#15 0x00007f27769216ba in std::panicking::try::cleanup () at library/std/src/panicking.rs:473
#16 0x00007f2776bef17c in std::panicking::try::do_catch<engine::externs::interface::__pyfunction_scheduler_execute::{closure_env#0}, core::result::Result<*mut pyo3_ffi::object::PyObject, pyo3::err::PyErr>> (
payload=0x7ea77b2fecb0, data=<optimized out>) at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panicking.rs:517
#17 std::panicking::try<core::result::Result<*mut pyo3_ffi::object::PyObject, pyo3::err::PyErr>, engine::externs::interface::__pyfunction_scheduler_execute::{closure_env#0}> (f=...)
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panicking.rs:456
#18 std::panic::catch_unwind<engine::externs::interface::__pyfunction_scheduler_execute::{closure_env#0}, core::result::Result<*mut pyo3_ffi::object::PyObject, pyo3::err::PyErr>> (f=...)
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panic.rs:137
#19 engine::externs::interface::__pyfunction_scheduler_execute (_slf=<optimized out>, _args=<optimized out>, _nargs=<optimized out>, _kwnames=<optimized out>) at src/externs/interface.rs:1060
#20 0x00007f2778a8c3c4 in cfunction_vectorcall_FASTCALL_KEYWORDS (func=0x7f27779872c0, args=0x7ee454647df8, nargsf=<optimized out>, kwnames=0x0) at Objects/methodobject.c:446
#21 0x00007f2778aca0db in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f27779872c0, args=0x7ee454647df8, nargsf=9223372036854775811, kwnames=0x0) at ./Include/cpython/abstract.h:118
#22 PyObject_Vectorcall (callable=0x7f27779872c0, args=0x7ee454647df8, nargsf=9223372036854775811, kwnames=0x0) at ./Include/cpython/abstract.h:127
#23 call_function (tstate=0x7ea28e4d3e20, oparg=3, kwnames=0x0, pp_stack=<optimized out>) at Python/ceval.c:5077
#24 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee454647c40, throwflag=<optimized out>) at Python/ceval.c:3489
#25 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee454647c40, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#26 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f2776510920, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#27 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
--Type <RET> for more, q to quit, c to continue without paging--c
#28 0x00007f2778ac8a62 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2776486430, args=0x7ee454659d10, nargsf=9223372036854775810, kwnames=0x0) at ./Include/cpython/abstract.h:118
#29 PyObject_Vectorcall (callable=0x7f2776486430, args=0x7ee454659d10, nargsf=9223372036854775810, kwnames=0x0) at ./Include/cpython/abstract.h:127
#30 call_function (tstate=0x7ea28e4d3e20, oparg=2, kwnames=0x0, pp_stack=<optimized out>) at Python/ceval.c:5077
#31 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee454659b80, throwflag=<optimized out>) at Python/ceval.c:3506
#32 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee454659b80, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#33 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f2776510c90, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#34 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
#35 0x00007f2778ac8a62 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2776486550, args=0x7ee4545dd7c0, nargsf=9223372036854775810, kwnames=0x0) at ./Include/cpython/abstract.h:118
#36 PyObject_Vectorcall (callable=0x7f2776486550, args=0x7ee4545dd7c0, nargsf=9223372036854775810, kwnames=0x0) at ./Include/cpython/abstract.h:127
#37 call_function (tstate=0x7ea28e4d3e20, oparg=2, kwnames=0x0, pp_stack=<optimized out>) at Python/ceval.c:5077
#38 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee4545dd610, throwflag=<optimized out>) at Python/ceval.c:3506
#39 0x00007f2778a6864b in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee4545dd610, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#40 _PyEval_EvalCode (tstate=0x7ea28e4d3e20, _co=0x7f2776510f50, globals=<optimized out>, locals=0x0, args=<optimized out>, argcount=<optimized out>, kwnames=0x7f2775891b98, kwcount=3, kwstep=1, defs=0x0, defcount=0, kwdefs=0x7f27764d7a00, closure=<optimized out>, name=<optimized out>, qualname=0x7f27765115d0, kwargs=<optimized out>) at Python/ceval.c:4329
#41 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:396
#42 0x00007f2778a6aedd in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2776486670, args=0x7ee4545dd5b0, nargsf=2, kwnames=0x7ea77b2fac40) at ./Include/cpython/abstract.h:118
#43 method_vectorcall (method=<optimized out>, args=0x7ee4545dd5b8, nargsf=<optimized out>, kwnames=0x7ea77b2fac40) at Objects/classobject.c:53
#44 0x00007f2778ac8047 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f277474e240, args=0x0, nargsf=<optimized out>, kwnames=0x7f2775891b80) at ./Include/cpython/abstract.h:118
#45 PyObject_Vectorcall (callable=0x7f277474e240, args=0x0, nargsf=<optimized out>, kwnames=0x7f2775891b80) at ./Include/cpython/abstract.h:127
#46 call_function (tstate=0x7ea28e4d3e20, oparg=<optimized out>, kwnames=0x7f2775891b80, pp_stack=<optimized out>) at Python/ceval.c:5077
#47 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee4545dd420, throwflag=<optimized out>) at Python/ceval.c:3537
#48 0x00007f2778a6864b in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee4545dd420, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#49 _PyEval_EvalCode (tstate=0x7ea28e4d3e20, _co=0x7f27765a1b30, globals=<optimized out>, locals=0x0, args=<optimized out>, argcount=<optimized out>, kwnames=0x7f277659a358, kwcount=1, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=<optimized out>, name=<optimized out>, qualname=0x7f27754c40d0, kwargs=<optimized out>) at Python/ceval.c:4329
#50 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:396
#51 0x00007f2778a6aedd in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f27754c3c10, args=0x7ee4541739f8, nargsf=2, kwnames=0x7ea77b2fac40) at ./Include/cpython/abstract.h:118
#52 method_vectorcall (method=<optimized out>, args=0x7ee454173a00, nargsf=<optimized out>, kwnames=0x7ea77b2fac40) at Objects/classobject.c:53
#53 0x00007f2778ac8047 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7ee454143400, args=0x0, nargsf=<optimized out>, kwnames=0x7f277659a340) at ./Include/cpython/abstract.h:118
#54 PyObject_Vectorcall (callable=0x7ee454143400, args=0x0, nargsf=<optimized out>, kwnames=0x7f277659a340) at ./Include/cpython/abstract.h:127
#55 call_function (tstate=0x7ea28e4d3e20, oparg=<optimized out>, kwnames=0x7f277659a340, pp_stack=<optimized out>) at Python/ceval.c:5077
#56 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee454173840, throwflag=<optimized out>) at Python/ceval.c:3537
#57 0x00007f2778a6864b in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee454173840, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#58 _PyEval_EvalCode (tstate=0x7ea28e4d3e20, _co=0x7f277550b030, globals=<optimized out>, locals=0x0, args=<optimized out>, argcount=<optimized out>, kwnames=0x7f277659a358, kwcount=1, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=<optimized out>, name=<optimized out>, qualname=0x7f27754c6030, kwargs=<optimized out>) at Python/ceval.c:4329
#59 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:396
#60 0x00007f2778a6aedd in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f27754c3ca0, args=0x7ee4546479d0, nargsf=1, kwnames=0x7ea77b2fac40) at ./Include/cpython/abstract.h:118
#61 method_vectorcall (method=<optimized out>, args=0x7ee4546479d8, nargsf=<optimized out>, kwnames=0x7ea77b2fac40) at Objects/classobject.c:53
#62 0x00007f2778ac8047 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f276c041b00, args=0x0, nargsf=<optimized out>, kwnames=0x7f277659a340) at ./Include/cpython/abstract.h:118
#63 PyObject_Vectorcall (callable=0x7f276c041b00, args=0x0, nargsf=<optimized out>, kwnames=0x7f277659a340) at ./Include/cpython/abstract.h:127
#64 call_function (tstate=0x7ea28e4d3e20, oparg=<optimized out>, kwnames=0x7f277659a340, pp_stack=<optimized out>) at Python/ceval.c:5077
#65 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee454647840, throwflag=<optimized out>) at Python/ceval.c:3537
#66 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee454647840, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#67 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f27754c1c90, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#68 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
#69 0x00007f2778a6b004 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f27754c3ee0, args=0x7ea77b2fceb8, nargsf=1, kwnames=0x0) at ./Include/cpython/abstract.h:118
#70 method_vectorcall (method=<optimized out>, args=0x7f27780f8058, nargsf=<optimized out>, kwnames=0x7ea77b2fac40) at Objects/classobject.c:61
#71 0x00007f2778ac96b3 in PyVectorcall_Call (callable=0x7ee4545e2cc0, tuple=0x0, kwargs=0x7ee4545e2b40) at Objects/call.c:231
#72 _PyObject_Call (tstate=<optimized out>, callable=0x7ee4545e2cc0, args=0x0, kwargs=0x7ee4545e2b40) at Objects/call.c:266
#73 PyObject_Call (callable=0x7ee4545e2cc0, args=0x0, kwargs=0x7ee4545e2b40) at Objects/call.c:293
#74 do_call_core (tstate=0x7ea28e4d3e20, func=0x7ee4545e2cc0, callargs=0x0, kwdict=0x7ee4545e2b40) at Python/ceval.c:5125
#75 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7f276c07a3e0, throwflag=<optimized out>) at Python/ceval.c:3582
#76 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7f276c07a3e0, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#77 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f2777e775b0, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#78 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
#79 0x00007f2778ac8a62 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2777e86160, args=0x7ee4546475b8, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:118
#80 PyObject_Vectorcall (callable=0x7f2777e86160, args=0x7ee4546475b8, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:127
#81 call_function (tstate=0x7ea28e4d3e20, oparg=1, kwnames=0x0, pp_stack=<optimized out>) at Python/ceval.c:5077
#82 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee454647440, throwflag=<optimized out>) at Python/ceval.c:3506
#83 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee454647440, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#84 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f2777e77920, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#85 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
#86 0x00007f2778ac8a62 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2777e86430, args=0x7f276c0761b8, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:118
#87 PyObject_Vectorcall (callable=0x7f2777e86430, args=0x7f276c0761b8, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:127
#88 call_function (tstate=0x7ea28e4d3e20, oparg=1, kwnames=0x0, pp_stack=<optimized out>) at Python/ceval.c:5077
#89 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7f276c076040, throwflag=<optimized out>) at Python/ceval.c:3506
#90 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7f276c076040, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#91 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f2777e77660, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#92 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
#93 0x00007f2778a6b004 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2777e861f0, args=0x7ea77b2fdda8, nargsf=1, kwnames=0x0) at ./Include/cpython/abstract.h:118
#94 method_vectorcall (method=<optimized out>, args=0x7f27780f8058, nargsf=<optimized out>, kwnames=0x7ea77b2fac40) at Objects/classobject.c:61
#95 0x00007f2778c6f6ca in t_bootstrap (boot_raw=0x7ee4545e8900) at ./Modules/_threadmodule.c:1040
#96 0x00007f2778c01891 in pythread_wrapper (arg=<optimized out>) at Python/thread_pthread.h:245
#97 0x00007f277850ab43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#98 0x00007f277859ca00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)
The full backtrace is attached here: gdb.txt
The lines surround the panicing line are:
#18 std::panic::catch_unwind<engine::externs::interface::__pyfunction_scheduler_execute::{closure_env#0}, core::result::Result<*mut pyo3_ffi::object::PyObject, pyo3::err::PyErr>> (f=...) at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panic.rs:137
No locals.
#19 engine::externs::interface::__pyfunction_scheduler_execute (_slf=<optimized out>, _args=<optimized out>, _nargs=<optimized out>, _kwnames=<optimized out>) at src/externs/interface.rs:1060
_py = <optimized out>
gil = pyo3::gil::GILPool {start: core::option::Option<usize>::Some(0), _not_send: pyo3::impl_::not_send::NotSend (core::marker::PhantomData<*mut pyo3::marker::Python>)}
#20 0x00007f2778a8c3c4 in cfunction_vectorcall_FASTCALL_KEYWORDS (func=0x7f27779872c0, args=0x7ee454647df8, nargsf=<optimized out>, kwnames=0x0) at Objects/methodobject.c:446
tstate = 0x7ea28e4d3e20
nargs = 3
meth = 0x0
result = <optimized out>
The line that panics is here at 1060 in interface.rs
:
https://github.com/pantsbuild/pants/blob/54712e15fdee9ea6cb38e134761222eb3284a3a0/src/rust/engine/src/externs/interface.rs#L1060-L1088
The complete full backtrace of all threads is here: full-bt.log
@stuhood this is probably a lot to page in, but is this helpful? FWICT the macro-expanded infra code is panicing - i.e. that's why the line number is 1060 - and not our code.
It may or may not be real or important, but I could not trigger the repro until I switched running pnats -V
to pants -V ... | tee
. My motivation was the fact that the scie-pants CI runs that hit this error all run Pants via Rust subprocess executions via Command. That led me to think about stdio re-direction / terminals or lack thereof, etc. It struck me piping output would simulate the failures I had seen. Less than fuzzy thinking, may not be a direct correlate, give credence at your own risk.
Thanks for that: I'll take a look.
I don't think that I'll be able to tell from the backtraces whether pantsd
was in use. Was it?
That's ~in the 1st code block, but yes, pantsd is in use. @phlax weakly claims --no-pantsd
did not help, but scie-pants ITs definitely are helped by this. They ~80%+ failed before and 100% pass after disabling pantsd here: https://github.com/pantsbuild/scie-pants/blob/4f06c9de4e99e667bacde43c6367accd930b7013/package/src/main.rs#L861-L879
not sure if it didnt help - fairly sure it didnt cure
✓ mypy succeeded.
FATAL: exception not rethrown
/home/runner/work/_temp/77ef4a23-a00c-431e-abc8-b8ae9a15d4b8.sh: line 1: 1718 Aborted (core dumped) ./scripts/pants --no-pantsd check ::
https://github.com/envoyproxy/pytooling/actions/runs/4053370248/jobs/6973910733#step:4:46
So, the first observation here is that:
@phlax: Are you able to try running with the most recent PANTS_SHA
from this branch: https://github.com/pantsbuild/pants/commits/stuhood/2.15.x-panic-abort , and see if that changes the output at all? It is 2.15.0rc2
plus one commit.
so preliminary reports using 25f1cee6b5796f001e8f993b1c2b207eb673a6a4
in order to do the upgrade i had to disable my readme goal (i can look further at updating this to work with >=2.15 but initial attempts failed)
the first run of gh actions did not throw the rethrown
issue - so with the caveat that a flake not failing is not proof of it not existing - it seems to have worked
the question im now wondering is whether it was this commit or the disabling of my custom goal that (may have) resolved the issue - i guess i can test with my current version (2.14.1) and the goal disabled to see if that also works
disabling the readme goal seems to work without fail (2 x CI runs) with current version
arguably then the bug is on my side - with a custom goal that needs some work
otoh, the fact that it works flawlessly using ./pants
, works mostly when using ./pantsup.sh
and that the error message gives so little away in terms of what has gone wrong, there is still an issue on the pants side i think
(assuming it just didnt flake in my tests - the flake i was seeing previously failed ~1 in 5)
i think i spoke too soon:
✓ mypy succeeded.
FATAL: exception not rethrown
/home/runner/work/_temp/1e159c1d-68bb-4fd0-940c-c04b7c297776.sh: line 1: 1723 Aborted (core dumped) ./scripts/pants check ::
https://github.com/envoyproxy/pytooling/actions/runs/4074228494/jobs/7019114947#step:5:44
ill revert to using the unreleased commit and give it a few ci runs to see if that helps
reverting to 25f1cee6b5796f001e8f993b1c2b207eb673a6a4 seems to have worked - 3x CI successful runs
https://github.com/envoyproxy/pytooling/actions?query=branch%3Ascie-testing
Even if there is a bug in your custom goal, I wouldn't expect it to be able to fail so spectacularly, so presumably that just exposed a problem in the Pants engine. A problem that seems, however, to be fixed at https://github.com/pantsbuild/pants/commit/25f1cee6b5796f001e8f993b1c2b207eb673a6a4 ...
Thanks to the backtrace from @jsirois's repro, I have a slightly clearer picture of what is going on there.
scheduler_execute
method, and being told by the Python runtime: "hey, you can't still be running: I'm exiting!": it then aborts.So it is definitely related to #16105. The followup questions then are:
pantsd
is in fact being used. That would mean that pantsd
is the process exiting, and AFAICT it should not be doing that in this case (unless there is something about John's repro loop causing it to restart itself? Possibly the PANTS_VERSION fiddling).
.pants.d/pants.log
for your repro case? At -ldebug
, ideally.pantsd
is shutting down, what task is still trying to do work in Thread 1? pantsd
is supposed to shut itself down in an orderly manner, which involves waiting for the exit of any threads which could be poking the scheduler.
-ldebug
log highlights this.@phlax : I think that because your error is so different, we will need to try and capture a backtrace similar to the one John was able to capture. To do that, you would need:
gdb
installedgdb $python_interpreter $core
~/.cache/nce/*/cpython-*/python/bin/python
/var/lib/apport/coredump
(see https://askubuntu.com/a/1349048) thread apply all bt
I recognize that that is a lot to ask though. If you're able to do it it would be appreciated, but I will also attempt to capture a similar thing from Pants' own CI in parallel.
@stuhood "my repro case" is remote. The job I linked here: https://github.com/pantsbuild/scie-pants/pull/93 used this GH Actions YAML to expose ssh and I just ssh'd into the box and used the command I showed to eventually get a crash:
while pants -V 2>&1 | tee -a log; do sleep 1; if [ -z "${PANTS_VERSION}" ]; then export PANTS_VERSION=2.14.1; else unset PANTS_VERSION && export PANTS_VERSION && rm -rf ~/.cache/nce; fi done
I won't be back at a keyboard to devote time to this until Friday or later, but anyone can use the above too; otherwise I'll report back here at some point.
FWIW this action lets you ssh into a github actions runner: https://github.com/marketplace/actions/debugging-with-ssh
I was able to get a repro of a FATAL: exception not rethrown
failure in this harness: #18155. Thanks for the SSH tip: that's a game changer.
The backtrace is: exception-not-rethrown.bt.txt
So, based on the "main thread exiting, so other thread experiences abort" shape of both of these errors, I now do suspect that they are related. Because this is a unit test which uses the engine, rather than a full blown Pants run, this one exhibits a little bit differently.
I think that I know what changes will be necessary to fix the unit-test teardown case, but I'm going to try to get the log output from John's to increase the chances that a fix gets both codepaths.
Hm, ok. John's crash is different, because the reason pantsd
is exiting in that case is a SIGTERM from a Pants client with a different fingerprint (it toggles between two Pants versions). I've opened #18159 to cover that one, as the signal handling aspect adds a new dimension that should make it much more rare.
I'll try to get a patch out for this one first thing tomorrow, but @benjyw : if it has the shape I'm thinking it will, we probably will need to pick it to 2.15.x
.
happy to debug further (and excited to try out the gh ssh access) but a bit snowed under today - ill try and have a look over the weekend
i sshed into a job and tried running
$ ./scripts/pants check ::
$ rm -rf ~/.cache
but i havent been able to trigger the issue so far - not sure how you activate the ssh on failure (other than in a subsequent job) - i just see running a job standalone
trying to trigger this from the gh action - the following never (seems to) trigger:
while ./scripts/pants -V 2>&1; do sleep 1; rm -rf ~/.cache/nce; done
adding a pipe does however fairly readily trigger ~the issue
while ./scripts/pants -V 2>&1 | tee -a log; do sleep 1; rm -rf ~/.cache/nce; done
not sure if that is just switching to a different bug as discussed above, or whether this is the bug that is being triggered elsewhere
i tried the above using ./scripts/pants check ::
and no pipe also - and was unable to trigger after running it for a while
in the normal CI jobs the error persists (with more or less info)
✓ mypy succeeded.
Fatal Python error: PyGILState_Release: thread state 0x7fc540001140 must be current when releasing
Python runtime state: finalizing (tstate=0x14210a0)
FATAL: exception not rethrown
/home/runner/work/_temp/19b3bc93-8085-4072-8f0a-70c8e87c5201.sh: line 1: 1700 Aborted (core dumped) ./scripts/pants check ::
putting aside the ssh debugging im wondering if its possible to get backtrace when it fails
fwiw this is the backtrace from looping output to a pipe as mentioned above
Thread 5 (Thread 0x7f1bacd38b80 (LWP 5868)):
#0 0x00007f1bad3d12c4 in update_refs (containers=0xdf23f0) at Modules/gcmodule.c:414
#1 deduce_unreachable (base=0xdf23f0, unreachable=0x7ffe2a147ef0) at Modules/gcmodule.c:1091
#2 collect (tstate=0xdf4fd0, generation=2, n_collected=0x7ffe2a147fd0, n_uncollectable=0x7ffe2a147fc8, nofail=0) at Modules/gcmodule.c:1212
#3 0x00007f1bad4cc6b7 in collect_with_callback (generation=2, tstate=<optimized out>) at Modules/gcmodule.c:1387
#4 PyGC_Collect () at Modules/gcmodule.c:2075
#5 0x00007f1bad3c7450 in _PyGC_CollectIfEnabled () at Modules/gcmodule.c:2086
#6 Py_FinalizeEx () at Python/pylifecycle.c:1423
#7 0x00007f1bad2594d8 in Py_Exit (sts=14623736) at Python/pylifecycle.c:2433
#8 0x00007f1bad25b75a in handle_system_exit () at Python/pythonrun.c:723
#9 0x00007f1bad25b3d9 in _PyErr_PrintEx (tstate=0xdf23f8, set_sys_last_vars=0) at Python/pythonrun.c:733
#10 0x00007f1bad25adf8 in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:828
#11 PyErr_Print () at Python/pythonrun.c:834
#12 pyrun_simple_file (fp=<optimized out>, filename=<optimized out>, closeit=1, flags=0x7ffe2a148398) at Python/pythonrun.c:456
#13 0x00007f1bad25ac0f in PyRun_SimpleFileExFlags (fp=0xdea340, filename=<optimized out>, closeit=1, flags=0x7ffe2a148398) at Python/pythonrun.c:483
#14 0x00007f1bad263d98 in pymain_run_file (config=<optimized out>, cf=0x7ffe2a148398) at Modules/main.c:373
#15 0x00007f1bad4cc02b in pymain_run_python (exitcode=0x7ffe2a148394) at Modules/main.c:598
#16 Py_RunMain () at Modules/main.c:677
#17 0x00007f1bad264537 in Py_BytesMain (argc=<optimized out>, argv=0x0) at Modules/main.c:731
#18 0x00007f1bacd63d90 in __libc_start_call_main (main=main@entry=0x401060 <main>, argc=argc@entry=5, argv=argv@entry=0x7ffe2a148548) at ../sysdeps/nptl/libc_start_call_main.h:58
#19 0x00007f1bacd63e40 in __libc_start_main_impl (main=0x401060 <main>, argc=5, argv=0x7ffe2a148548, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe2a148538) at ../csu/libc-start.c:392
#20 0x000000000040108e in _start ()
Thread 4 (Thread 0x7f1ba9d93640 (LWP 5918)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007f1babf4dd79 in ?? ()
#2 0x00007f1ba9fa8f00 in ?? ()
#3 0x00007f1b9c000cd0 in ?? ()
#4 0x0000000001154cd8 in ?? ()
#5 0x0000000001154cc8 in ?? ()
#6 0x0000000000000000 in ?? ()
Thread 3 (Thread 0x7f1ba9f94640 (LWP 5917)):
#0 0x00007f1bace5ffde in epoll_wait (epfd=5, events=0x1477170, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007f1babf4bcae in ?? ()
#2 0x0000000001300cf8 in ?? ()
#3 0x00007f1babf293d1 in ?? ()
#4 0x0000000001300cf8 in ?? ()
#5 0x00007f1ba9f93940 in ?? ()
#6 0x0000000001477170 in ?? ()
#7 0x0000000000000400 in ?? ()
#8 0x0000000000000000 in ?? ()
Thread 2 (Thread 0x7f1ba9b92640 (LWP 5922)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007f1babf4dd3d in ?? ()
#2 0x00007f1bac409808 in ?? ()
#3 0x00007f1ba0000cd0 in ?? ()
#4 0x00000000010308c0 in ?? ()
#5 0x0000000001030958 in ?? ()
#6 0x0000000000000001 in ?? ()
#7 0x00000000062dc517 in ?? ()
#8 0x00000000000008f4 in ?? ()
#9 0x0000000000000001 in ?? ()
#10 0x0000000000000009 in ?? ()
#11 0x000000003b9abb8c in ?? ()
#12 0x00007f1ba9b91c40 in ?? ()
#13 0x00000000010308c0 in ?? ()
#14 0x00000000010308b0 in ?? ()
#15 0x0000000001173f01 in ?? ()
#16 0x0000000000000000 in ?? ()
Thread 1 (Thread 0x7f1ba9991640 (LWP 5923) (Exiting)):
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139756786226752) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=139756786226752) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=139756786226752, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007f1bacd7c476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007f1bacd627f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007f1bacdc345c in __libc_message (action=do_abort, fmt=0x7f1bacf157b1 "%s", fmt=0x7f1bacf157b1 "%s", action=do_abort) at ../sysdeps/posix/libc_fatal.c:155
#6 0x00007f1bacdc3770 in __GI___libc_fatal (message=<optimized out>) at ../sysdeps/posix/libc_fatal.c:164
#7 0x00007f1bacdd7476 in unwind_cleanup (reason=<optimized out>, exc=<optimized out>) at ./nptl/unwind.c:114
#8 0x00007f1bac02c1cf in ?? ()
#9 0x00007f1ba9991cb0 in ?? ()
#10 0x00007f1bac3f0388 in ?? ()
Do we understand why scie-pants is exposing this bug, which seems like it's always been lurking? I am seeing it consistently in my PR to use pants
in the setup repo: https://github.com/pantsbuild/setup/pull/142/checks
Besides clearly being a difference in PBS vs other Python distributions I think no. I say clearly because both ./pants
and pants
exec
into a pants
venv console script; so the only relevant difference can be the venv interpreter at that point. The PBS obviously differs in the ~10 patch files it applies to CPython sources and any C build / link optimizations it uses above and beyond vanilla CPython builds / system package builds. I have not dug into either the patch content or the optimization differences myself.
@benjyw I believe it's not the case for @phlax but it is the case for me / scie-pants ITs in CI that disabling pantsd works around: https://github.com/pantsbuild/scie-pants/blob/7fc1471d6d0d66615fb0217459d5942cd161a799/package/src/main.rs#L867-L885
The solution for me was to upgrade from Pants 2.14.0 to 2.14.2. That version has updates (from https://github.com/pantsbuild/pants/pull/18642) that appear to have resolved this issue.
EDIT: Nope, 2.14.2 still didn't work (see https://github.com/pantsbuild/pants/issues/19269). Looks like the theory is that this is fixed in the 2.15.x latest version, but I haven't confirmed it yet.
Describe the bug
I upgraded a repo im working on to remove
./pants
and add./pantsup.sh
, ran the new script and started using thepants
it created.Seemingly only in CI (github actions) it flakily fails with (for example):
Pants version 2.14.1
OS Linux
Additional info