pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.32k stars 636 forks source link

Using `pants` from `pantsup.sh` can fail with `FATAL: exception not rethrown` #18135

Closed phlax closed 1 year ago

phlax commented 1 year ago

Describe the bug

I upgraded a repo im working on to remove ./pants and add ./pantsup.sh, ran the new script and started using the pants it created.

Seemingly only in CI (github actions) it flakily fails with (for example):


✓ mypy succeeded.
FATAL: exception not rethrown
FATAL: exception not rethrown
/home/runner/work/_temp/f3725c23-b66e-48a4-b13a-9692347a0610.sh: line 1:  1684 Aborted                 (core dumped) ./scripts/pants check ::
Error: Process completed with exit code 134.

Pants version 2.14.1

OS Linux

Additional info

jsirois commented 1 year ago

What Pants command or commands do you run in CI?. If it's just 1 command, you should definitely try adding --no-pantsd, this may offer a workaround. If there is more than one command you can still use this workaround, but will suffer slower builds than you would otherwise have with pantsd.

I have seen behavior in this ballpark noted here but do not have a solution yet: https://github.com/pantsbuild/scie-pants/blob/4f06c9de4e99e667bacde43c6367accd930b7013/package/src/main.rs#L861-L879

I have since learned how to get ssh access to GitHub actions CI jobs; so I'll try to repro this out there tomorrow and examine the core dump for better clues.

phlax commented 1 year ago

Im pretty sure i tested with --no-pantsd - iirc it seemed to lessen but still happened - i tried quite a few other things to resolve - like running uncached - but the flake seemed to persist (albeit infrequently)

i reverted the update as i couldnt find a solution but you can see some failed runs here https://github.com/envoyproxy/pytooling/actions?query=is%3Afailure+branch%3Amain - there are some other pr ones where i was trying to fix also

this may give more of a clue https://github.com/envoyproxy/pytooling/actions/runs/4049513918/jobs/6965968013#step:6:144

...
README (README.md) is up-to-date
Fatal Python error: PyGILState_Release: thread state 0x7f3068001140 must be current when releasing
Python runtime state: finalizing (tstate=0xb092f0)

Thread 0x00007f308227e000 (most recent call first):
<no Python frame>
/home/runner/work/_temp/0ac5f650-e9a0-482b-9a38-eefb912f3a2a.sh: line 1:  1696 Aborted                 (core dumped) ./scripts/pants readme --check=README.md ::

What Pants command or commands do you run in CI?

The commands running are mostly the standard lint/check/pytest ones - with an additional hacky readme goal that i added

https://github.com/envoyproxy/pytooling/blob/main/.github/workflows/ci.yml

jsirois commented 1 year ago

Ok, I was able to repro by using the GH Actions ssh trick @benjyw taught me. https://github.com/pantsbuild/scie-pants/pull/93 yielded a session where I was able to generate 6 core dumps. I'll post the 1st one here (last in time order), but I continue to examine the others. This is for Pants 2.15.0rc0 and was triggered out on a GH Actions ubuntu 20.04 box with:

runner@fv-az218-666:~/work/scie-pants/scie-pants$ while pants -V 2>&1 | tee -a log; do sleep 1; if [ -z "${PANTS_VERSION}" ]; then export PANTS_VERSION=2.14.1; else unset PANTS_VERSION && export PANTS_VERSION && rm -rf ~/.cache/nce; fi done

16:47:44.26 [INFO] Initializing scheduler...
16:47:44.93 [INFO] Scheduler initialized.
2.15.0rc0
Fatal Python error: PyGILState_Release: thread state 0x7fb6080011b0 must be current when releasing
Python runtime state: finalizing (tstate=0x159ae30)

Thread 0x00007fb6200b0000 (most recent call first):
<no Python frame>

The backtrace looks like:

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139257791374912) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139257791374912) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139257791374912, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f27784b8476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139257791374912) at ./nptl/pthread_kill.c:44
#6  __pthread_kill_internal (signo=6, threadid=139257791374912) at ./nptl/pthread_kill.c:78
#7  __GI___pthread_kill (threadid=139257791374912, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#8  0x00007f27784b8476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#9  0x00007f277849e7f3 in __GI_abort () at ./stdlib/abort.c:79
#10 0x00007f27784ff45c in __libc_message (action=do_abort, fmt=0x7f27786517b1 "%s", fmt=0x7f27786517b1 "%s", action=do_abort) at ../sysdeps/posix/libc_fatal.c:155
#11 0x00007f27784ff770 in __GI___libc_fatal (message=<optimized out>) at ../sysdeps/posix/libc_fatal.c:164
#12 0x00007f2778513476 in unwind_cleanup (reason=<optimized out>, exc=<optimized out>) at ./nptl/unwind.c:114
#13 0x00007f277752ad0f in panic_unwind::real_imp::cleanup () at library/panic_unwind/src/gcc.rs:78
#14 panic_unwind::__rust_panic_cleanup () at library/panic_unwind/src/lib.rs:100
#15 0x00007f27769216ba in std::panicking::try::cleanup () at library/std/src/panicking.rs:473
#16 0x00007f2776bef17c in std::panicking::try::do_catch<engine::externs::interface::__pyfunction_scheduler_execute::{closure_env#0}, core::result::Result<*mut pyo3_ffi::object::PyObject, pyo3::err::PyErr>> (
    payload=0x7ea77b2fecb0, data=<optimized out>) at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panicking.rs:517
#17 std::panicking::try<core::result::Result<*mut pyo3_ffi::object::PyObject, pyo3::err::PyErr>, engine::externs::interface::__pyfunction_scheduler_execute::{closure_env#0}> (f=...)
    at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panicking.rs:456
#18 std::panic::catch_unwind<engine::externs::interface::__pyfunction_scheduler_execute::{closure_env#0}, core::result::Result<*mut pyo3_ffi::object::PyObject, pyo3::err::PyErr>> (f=...)
    at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panic.rs:137
#19 engine::externs::interface::__pyfunction_scheduler_execute (_slf=<optimized out>, _args=<optimized out>, _nargs=<optimized out>, _kwnames=<optimized out>) at src/externs/interface.rs:1060
#20 0x00007f2778a8c3c4 in cfunction_vectorcall_FASTCALL_KEYWORDS (func=0x7f27779872c0, args=0x7ee454647df8, nargsf=<optimized out>, kwnames=0x0) at Objects/methodobject.c:446
#21 0x00007f2778aca0db in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f27779872c0, args=0x7ee454647df8, nargsf=9223372036854775811, kwnames=0x0) at ./Include/cpython/abstract.h:118
#22 PyObject_Vectorcall (callable=0x7f27779872c0, args=0x7ee454647df8, nargsf=9223372036854775811, kwnames=0x0) at ./Include/cpython/abstract.h:127
#23 call_function (tstate=0x7ea28e4d3e20, oparg=3, kwnames=0x0, pp_stack=<optimized out>) at Python/ceval.c:5077
#24 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee454647c40, throwflag=<optimized out>) at Python/ceval.c:3489
#25 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee454647c40, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#26 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f2776510920, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#27 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
--Type <RET> for more, q to quit, c to continue without paging--c
#28 0x00007f2778ac8a62 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2776486430, args=0x7ee454659d10, nargsf=9223372036854775810, kwnames=0x0) at ./Include/cpython/abstract.h:118
#29 PyObject_Vectorcall (callable=0x7f2776486430, args=0x7ee454659d10, nargsf=9223372036854775810, kwnames=0x0) at ./Include/cpython/abstract.h:127
#30 call_function (tstate=0x7ea28e4d3e20, oparg=2, kwnames=0x0, pp_stack=<optimized out>) at Python/ceval.c:5077
#31 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee454659b80, throwflag=<optimized out>) at Python/ceval.c:3506
#32 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee454659b80, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#33 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f2776510c90, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#34 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
#35 0x00007f2778ac8a62 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2776486550, args=0x7ee4545dd7c0, nargsf=9223372036854775810, kwnames=0x0) at ./Include/cpython/abstract.h:118
#36 PyObject_Vectorcall (callable=0x7f2776486550, args=0x7ee4545dd7c0, nargsf=9223372036854775810, kwnames=0x0) at ./Include/cpython/abstract.h:127
#37 call_function (tstate=0x7ea28e4d3e20, oparg=2, kwnames=0x0, pp_stack=<optimized out>) at Python/ceval.c:5077
#38 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee4545dd610, throwflag=<optimized out>) at Python/ceval.c:3506
#39 0x00007f2778a6864b in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee4545dd610, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#40 _PyEval_EvalCode (tstate=0x7ea28e4d3e20, _co=0x7f2776510f50, globals=<optimized out>, locals=0x0, args=<optimized out>, argcount=<optimized out>, kwnames=0x7f2775891b98, kwcount=3, kwstep=1, defs=0x0, defcount=0, kwdefs=0x7f27764d7a00, closure=<optimized out>, name=<optimized out>, qualname=0x7f27765115d0, kwargs=<optimized out>) at Python/ceval.c:4329
#41 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:396
#42 0x00007f2778a6aedd in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2776486670, args=0x7ee4545dd5b0, nargsf=2, kwnames=0x7ea77b2fac40) at ./Include/cpython/abstract.h:118
#43 method_vectorcall (method=<optimized out>, args=0x7ee4545dd5b8, nargsf=<optimized out>, kwnames=0x7ea77b2fac40) at Objects/classobject.c:53
#44 0x00007f2778ac8047 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f277474e240, args=0x0, nargsf=<optimized out>, kwnames=0x7f2775891b80) at ./Include/cpython/abstract.h:118
#45 PyObject_Vectorcall (callable=0x7f277474e240, args=0x0, nargsf=<optimized out>, kwnames=0x7f2775891b80) at ./Include/cpython/abstract.h:127
#46 call_function (tstate=0x7ea28e4d3e20, oparg=<optimized out>, kwnames=0x7f2775891b80, pp_stack=<optimized out>) at Python/ceval.c:5077
#47 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee4545dd420, throwflag=<optimized out>) at Python/ceval.c:3537
#48 0x00007f2778a6864b in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee4545dd420, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#49 _PyEval_EvalCode (tstate=0x7ea28e4d3e20, _co=0x7f27765a1b30, globals=<optimized out>, locals=0x0, args=<optimized out>, argcount=<optimized out>, kwnames=0x7f277659a358, kwcount=1, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=<optimized out>, name=<optimized out>, qualname=0x7f27754c40d0, kwargs=<optimized out>) at Python/ceval.c:4329
#50 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:396
#51 0x00007f2778a6aedd in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f27754c3c10, args=0x7ee4541739f8, nargsf=2, kwnames=0x7ea77b2fac40) at ./Include/cpython/abstract.h:118
#52 method_vectorcall (method=<optimized out>, args=0x7ee454173a00, nargsf=<optimized out>, kwnames=0x7ea77b2fac40) at Objects/classobject.c:53
#53 0x00007f2778ac8047 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7ee454143400, args=0x0, nargsf=<optimized out>, kwnames=0x7f277659a340) at ./Include/cpython/abstract.h:118
#54 PyObject_Vectorcall (callable=0x7ee454143400, args=0x0, nargsf=<optimized out>, kwnames=0x7f277659a340) at ./Include/cpython/abstract.h:127
#55 call_function (tstate=0x7ea28e4d3e20, oparg=<optimized out>, kwnames=0x7f277659a340, pp_stack=<optimized out>) at Python/ceval.c:5077
#56 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee454173840, throwflag=<optimized out>) at Python/ceval.c:3537
#57 0x00007f2778a6864b in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee454173840, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#58 _PyEval_EvalCode (tstate=0x7ea28e4d3e20, _co=0x7f277550b030, globals=<optimized out>, locals=0x0, args=<optimized out>, argcount=<optimized out>, kwnames=0x7f277659a358, kwcount=1, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=<optimized out>, name=<optimized out>, qualname=0x7f27754c6030, kwargs=<optimized out>) at Python/ceval.c:4329
#59 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:396
#60 0x00007f2778a6aedd in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f27754c3ca0, args=0x7ee4546479d0, nargsf=1, kwnames=0x7ea77b2fac40) at ./Include/cpython/abstract.h:118
#61 method_vectorcall (method=<optimized out>, args=0x7ee4546479d8, nargsf=<optimized out>, kwnames=0x7ea77b2fac40) at Objects/classobject.c:53
#62 0x00007f2778ac8047 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f276c041b00, args=0x0, nargsf=<optimized out>, kwnames=0x7f277659a340) at ./Include/cpython/abstract.h:118
#63 PyObject_Vectorcall (callable=0x7f276c041b00, args=0x0, nargsf=<optimized out>, kwnames=0x7f277659a340) at ./Include/cpython/abstract.h:127
#64 call_function (tstate=0x7ea28e4d3e20, oparg=<optimized out>, kwnames=0x7f277659a340, pp_stack=<optimized out>) at Python/ceval.c:5077
#65 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee454647840, throwflag=<optimized out>) at Python/ceval.c:3537
#66 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee454647840, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#67 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f27754c1c90, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#68 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
#69 0x00007f2778a6b004 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f27754c3ee0, args=0x7ea77b2fceb8, nargsf=1, kwnames=0x0) at ./Include/cpython/abstract.h:118
#70 method_vectorcall (method=<optimized out>, args=0x7f27780f8058, nargsf=<optimized out>, kwnames=0x7ea77b2fac40) at Objects/classobject.c:61
#71 0x00007f2778ac96b3 in PyVectorcall_Call (callable=0x7ee4545e2cc0, tuple=0x0, kwargs=0x7ee4545e2b40) at Objects/call.c:231
#72 _PyObject_Call (tstate=<optimized out>, callable=0x7ee4545e2cc0, args=0x0, kwargs=0x7ee4545e2b40) at Objects/call.c:266
#73 PyObject_Call (callable=0x7ee4545e2cc0, args=0x0, kwargs=0x7ee4545e2b40) at Objects/call.c:293
#74 do_call_core (tstate=0x7ea28e4d3e20, func=0x7ee4545e2cc0, callargs=0x0, kwdict=0x7ee4545e2b40) at Python/ceval.c:5125
#75 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7f276c07a3e0, throwflag=<optimized out>) at Python/ceval.c:3582
#76 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7f276c07a3e0, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#77 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f2777e775b0, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#78 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
#79 0x00007f2778ac8a62 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2777e86160, args=0x7ee4546475b8, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:118
#80 PyObject_Vectorcall (callable=0x7f2777e86160, args=0x7ee4546475b8, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:127
#81 call_function (tstate=0x7ea28e4d3e20, oparg=1, kwnames=0x0, pp_stack=<optimized out>) at Python/ceval.c:5077
#82 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7ee454647440, throwflag=<optimized out>) at Python/ceval.c:3506
#83 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7ee454647440, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#84 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f2777e77920, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#85 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
#86 0x00007f2778ac8a62 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2777e86430, args=0x7f276c0761b8, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:118
#87 PyObject_Vectorcall (callable=0x7f2777e86430, args=0x7f276c0761b8, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:127
#88 call_function (tstate=0x7ea28e4d3e20, oparg=1, kwnames=0x0, pp_stack=<optimized out>) at Python/ceval.c:5077
#89 _PyEval_EvalFrameDefault (tstate=0x7ea28e4d3e20, f=0x7f276c076040, throwflag=<optimized out>) at Python/ceval.c:3506
#90 0x00007f2778a682f4 in _PyEval_EvalFrame (tstate=0x7ea28e4d3e20, f=0x7f276c076040, throwflag=0) at ./Include/internal/pycore_ceval.h:40
#91 function_code_fastcall (tstate=0x7ea28e4d3e20, co=0x7f2777e77660, args=<optimized out>, nargs=<optimized out>, globals=<optimized out>) at Objects/call.c:330
#92 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:367
#93 0x00007f2778a6b004 in _PyObject_VectorcallTstate (tstate=0x7ea28e4d3e20, callable=0x7f2777e861f0, args=0x7ea77b2fdda8, nargsf=1, kwnames=0x0) at ./Include/cpython/abstract.h:118
#94 method_vectorcall (method=<optimized out>, args=0x7f27780f8058, nargsf=<optimized out>, kwnames=0x7ea77b2fac40) at Objects/classobject.c:61
#95 0x00007f2778c6f6ca in t_bootstrap (boot_raw=0x7ee4545e8900) at ./Modules/_threadmodule.c:1040
#96 0x00007f2778c01891 in pythread_wrapper (arg=<optimized out>) at Python/thread_pthread.h:245
#97 0x00007f277850ab43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#98 0x00007f277859ca00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)

The full backtrace is attached here: gdb.txt

The lines surround the panicing line are:

#18 std::panic::catch_unwind<engine::externs::interface::__pyfunction_scheduler_execute::{closure_env#0}, core::result::Result<*mut pyo3_ffi::object::PyObject, pyo3::err::PyErr>> (f=...) at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panic.rs:137
No locals.
#19 engine::externs::interface::__pyfunction_scheduler_execute (_slf=<optimized out>, _args=<optimized out>, _nargs=<optimized out>, _kwnames=<optimized out>) at src/externs/interface.rs:1060
        _py = <optimized out>
        gil = pyo3::gil::GILPool {start: core::option::Option<usize>::Some(0), _not_send: pyo3::impl_::not_send::NotSend (core::marker::PhantomData<*mut pyo3::marker::Python>)}
#20 0x00007f2778a8c3c4 in cfunction_vectorcall_FASTCALL_KEYWORDS (func=0x7f27779872c0, args=0x7ee454647df8, nargsf=<optimized out>, kwnames=0x0) at Objects/methodobject.c:446
        tstate = 0x7ea28e4d3e20
        nargs = 3
        meth = 0x0
        result = <optimized out>

The line that panics is here at 1060 in interface.rs: https://github.com/pantsbuild/pants/blob/54712e15fdee9ea6cb38e134761222eb3284a3a0/src/rust/engine/src/externs/interface.rs#L1060-L1088

The complete full backtrace of all threads is here: full-bt.log

@stuhood this is probably a lot to page in, but is this helpful? FWICT the macro-expanded infra code is panicing - i.e. that's why the line number is 1060 - and not our code.

jsirois commented 1 year ago

It may or may not be real or important, but I could not trigger the repro until I switched running pnats -V to pants -V ... | tee. My motivation was the fact that the scie-pants CI runs that hit this error all run Pants via Rust subprocess executions via Command. That led me to think about stdio re-direction / terminals or lack thereof, etc. It struck me piping output would simulate the failures I had seen. Less than fuzzy thinking, may not be a direct correlate, give credence at your own risk.

stuhood commented 1 year ago

Thanks for that: I'll take a look.

I don't think that I'll be able to tell from the backtraces whether pantsd was in use. Was it?

jsirois commented 1 year ago

That's ~in the 1st code block, but yes, pantsd is in use. @phlax weakly claims --no-pantsd did not help, but scie-pants ITs definitely are helped by this. They ~80%+ failed before and 100% pass after disabling pantsd here: https://github.com/pantsbuild/scie-pants/blob/4f06c9de4e99e667bacde43c6367accd930b7013/package/src/main.rs#L861-L879

phlax commented 1 year ago

not sure if it didnt help - fairly sure it didnt cure

phlax commented 1 year ago

✓ mypy succeeded.
FATAL: exception not rethrown
/home/runner/work/_temp/77ef4a23-a00c-431e-abc8-b8ae9a15d4b8.sh: line 1:  1718 Aborted                 (core dumped) ./scripts/pants --no-pantsd check ::

https://github.com/envoyproxy/pytooling/actions/runs/4053370248/jobs/6973910733#step:4:46

stuhood commented 1 year ago

So, the first observation here is that:

@phlax: Are you able to try running with the most recent PANTS_SHA from this branch: https://github.com/pantsbuild/pants/commits/stuhood/2.15.x-panic-abort , and see if that changes the output at all? It is 2.15.0rc2 plus one commit.

phlax commented 1 year ago

so preliminary reports using 25f1cee6b5796f001e8f993b1c2b207eb673a6a4

in order to do the upgrade i had to disable my readme goal (i can look further at updating this to work with >=2.15 but initial attempts failed)

the first run of gh actions did not throw the rethrown issue - so with the caveat that a flake not failing is not proof of it not existing - it seems to have worked

the question im now wondering is whether it was this commit or the disabling of my custom goal that (may have) resolved the issue - i guess i can test with my current version (2.14.1) and the goal disabled to see if that also works

phlax commented 1 year ago

disabling the readme goal seems to work without fail (2 x CI runs) with current version

arguably then the bug is on my side - with a custom goal that needs some work

otoh, the fact that it works flawlessly using ./pants, works mostly when using ./pantsup.sh and that the error message gives so little away in terms of what has gone wrong, there is still an issue on the pants side i think

(assuming it just didnt flake in my tests - the flake i was seeing previously failed ~1 in 5)

phlax commented 1 year ago

i think i spoke too soon:

✓ mypy succeeded.
FATAL: exception not rethrown
/home/runner/work/_temp/1e159c1d-68bb-4fd0-940c-c04b7c297776.sh: line 1:  1723 Aborted                 (core dumped) ./scripts/pants check ::

https://github.com/envoyproxy/pytooling/actions/runs/4074228494/jobs/7019114947#step:5:44

ill revert to using the unreleased commit and give it a few ci runs to see if that helps

phlax commented 1 year ago

reverting to 25f1cee6b5796f001e8f993b1c2b207eb673a6a4 seems to have worked - 3x CI successful runs

https://github.com/envoyproxy/pytooling/actions?query=branch%3Ascie-testing

benjyw commented 1 year ago

Even if there is a bug in your custom goal, I wouldn't expect it to be able to fail so spectacularly, so presumably that just exposed a problem in the Pants engine. A problem that seems, however, to be fixed at https://github.com/pantsbuild/pants/commit/25f1cee6b5796f001e8f993b1c2b207eb673a6a4 ...

stuhood commented 1 year ago

Thanks to the backtrace from @jsirois's repro, I have a slightly clearer picture of what is going on there.

So it is definitely related to #16105. The followup questions then are:


@phlax : I think that because your error is so different, we will need to try and capture a backtrace similar to the one John was able to capture. To do that, you would need:

  1. SSH access after a crash (it sounds like @jsirois / @benjyw have a recipe for that)
  2. gdb installed
  3. To load the core dump that is reported in your error, using gdb $python_interpreter $core
    • The relevant interpreter will be somewhere like: ~/.cache/nce/*/cpython-*/python/bin/python
    • The core will be somewhere specific to Ubuntu: probably /var/lib/apport/coredump (see https://askubuntu.com/a/1349048)
  4. Capturing the backtrace with thread apply all bt

I recognize that that is a lot to ask though. If you're able to do it it would be appreciated, but I will also attempt to capture a similar thing from Pants' own CI in parallel.

jsirois commented 1 year ago

@stuhood "my repro case" is remote. The job I linked here: https://github.com/pantsbuild/scie-pants/pull/93 used this GH Actions YAML to expose ssh and I just ssh'd into the box and used the command I showed to eventually get a crash:

while pants -V 2>&1 | tee -a log; do sleep 1; if [ -z "${PANTS_VERSION}" ]; then export PANTS_VERSION=2.14.1; else unset PANTS_VERSION && export PANTS_VERSION && rm -rf ~/.cache/nce; fi done

I won't be back at a keyboard to devote time to this until Friday or later, but anyone can use the above too; otherwise I'll report back here at some point.

benjyw commented 1 year ago

FWIW this action lets you ssh into a github actions runner: https://github.com/marketplace/actions/debugging-with-ssh

stuhood commented 1 year ago

I was able to get a repro of a FATAL: exception not rethrown failure in this harness: #18155. Thanks for the SSH tip: that's a game changer.

The backtrace is: exception-not-rethrown.bt.txt

So, based on the "main thread exiting, so other thread experiences abort" shape of both of these errors, I now do suspect that they are related. Because this is a unit test which uses the engine, rather than a full blown Pants run, this one exhibits a little bit differently.


I think that I know what changes will be necessary to fix the unit-test teardown case, but I'm going to try to get the log output from John's to increase the chances that a fix gets both codepaths.

stuhood commented 1 year ago

Hm, ok. John's crash is different, because the reason pantsd is exiting in that case is a SIGTERM from a Pants client with a different fingerprint (it toggles between two Pants versions). I've opened #18159 to cover that one, as the signal handling aspect adds a new dimension that should make it much more rare.

I'll try to get a patch out for this one first thing tomorrow, but @benjyw : if it has the shape I'm thinking it will, we probably will need to pick it to 2.15.x.

phlax commented 1 year ago

happy to debug further (and excited to try out the gh ssh access) but a bit snowed under today - ill try and have a look over the weekend

phlax commented 1 year ago

i sshed into a job and tried running

$ ./scripts/pants check ::
$ rm -rf ~/.cache

but i havent been able to trigger the issue so far - not sure how you activate the ssh on failure (other than in a subsequent job) - i just see running a job standalone

phlax commented 1 year ago

trying to trigger this from the gh action - the following never (seems to) trigger:

while ./scripts/pants -V 2>&1; do sleep 1; rm -rf ~/.cache/nce; done

adding a pipe does however fairly readily trigger ~the issue

while ./scripts/pants -V 2>&1 | tee -a log; do sleep 1; rm -rf ~/.cache/nce; done

not sure if that is just switching to a different bug as discussed above, or whether this is the bug that is being triggered elsewhere

i tried the above using ./scripts/pants check :: and no pipe also - and was unable to trigger after running it for a while

in the normal CI jobs the error persists (with more or less info)

✓ mypy succeeded.
Fatal Python error: PyGILState_Release: thread state 0x7fc540001140 must be current when releasing
Python runtime state: finalizing (tstate=0x14210a0)
FATAL: exception not rethrown
/home/runner/work/_temp/19b3bc93-8085-4072-8f0a-70c8e87c5201.sh: line 1:  1700 Aborted                 (core dumped) ./scripts/pants check ::

putting aside the ssh debugging im wondering if its possible to get backtrace when it fails

phlax commented 1 year ago

fwiw this is the backtrace from looping output to a pipe as mentioned above

Thread 5 (Thread 0x7f1bacd38b80 (LWP 5868)):
#0  0x00007f1bad3d12c4 in update_refs (containers=0xdf23f0) at Modules/gcmodule.c:414
#1  deduce_unreachable (base=0xdf23f0, unreachable=0x7ffe2a147ef0) at Modules/gcmodule.c:1091
#2  collect (tstate=0xdf4fd0, generation=2, n_collected=0x7ffe2a147fd0, n_uncollectable=0x7ffe2a147fc8, nofail=0) at Modules/gcmodule.c:1212
#3  0x00007f1bad4cc6b7 in collect_with_callback (generation=2, tstate=<optimized out>) at Modules/gcmodule.c:1387
#4  PyGC_Collect () at Modules/gcmodule.c:2075
#5  0x00007f1bad3c7450 in _PyGC_CollectIfEnabled () at Modules/gcmodule.c:2086
#6  Py_FinalizeEx () at Python/pylifecycle.c:1423
#7  0x00007f1bad2594d8 in Py_Exit (sts=14623736) at Python/pylifecycle.c:2433
#8  0x00007f1bad25b75a in handle_system_exit () at Python/pythonrun.c:723
#9  0x00007f1bad25b3d9 in _PyErr_PrintEx (tstate=0xdf23f8, set_sys_last_vars=0) at Python/pythonrun.c:733
#10 0x00007f1bad25adf8 in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:828
#11 PyErr_Print () at Python/pythonrun.c:834
#12 pyrun_simple_file (fp=<optimized out>, filename=<optimized out>, closeit=1, flags=0x7ffe2a148398) at Python/pythonrun.c:456
#13 0x00007f1bad25ac0f in PyRun_SimpleFileExFlags (fp=0xdea340, filename=<optimized out>, closeit=1, flags=0x7ffe2a148398) at Python/pythonrun.c:483
#14 0x00007f1bad263d98 in pymain_run_file (config=<optimized out>, cf=0x7ffe2a148398) at Modules/main.c:373
#15 0x00007f1bad4cc02b in pymain_run_python (exitcode=0x7ffe2a148394) at Modules/main.c:598
#16 Py_RunMain () at Modules/main.c:677
#17 0x00007f1bad264537 in Py_BytesMain (argc=<optimized out>, argv=0x0) at Modules/main.c:731
#18 0x00007f1bacd63d90 in __libc_start_call_main (main=main@entry=0x401060 <main>, argc=argc@entry=5, argv=argv@entry=0x7ffe2a148548) at ../sysdeps/nptl/libc_start_call_main.h:58
#19 0x00007f1bacd63e40 in __libc_start_main_impl (main=0x401060 <main>, argc=5, argv=0x7ffe2a148548, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe2a148538) at ../csu/libc-start.c:392
#20 0x000000000040108e in _start ()

Thread 4 (Thread 0x7f1ba9d93640 (LWP 5918)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f1babf4dd79 in ?? ()
#2  0x00007f1ba9fa8f00 in ?? ()
#3  0x00007f1b9c000cd0 in ?? ()
#4  0x0000000001154cd8 in ?? ()
#5  0x0000000001154cc8 in ?? ()
#6  0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f1ba9f94640 (LWP 5917)):
#0  0x00007f1bace5ffde in epoll_wait (epfd=5, events=0x1477170, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f1babf4bcae in ?? ()
#2  0x0000000001300cf8 in ?? ()
#3  0x00007f1babf293d1 in ?? ()
#4  0x0000000001300cf8 in ?? ()
#5  0x00007f1ba9f93940 in ?? ()
#6  0x0000000001477170 in ?? ()
#7  0x0000000000000400 in ?? ()
#8  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f1ba9b92640 (LWP 5922)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f1babf4dd3d in ?? ()
#2  0x00007f1bac409808 in ?? ()
#3  0x00007f1ba0000cd0 in ?? ()
#4  0x00000000010308c0 in ?? ()
#5  0x0000000001030958 in ?? ()
#6  0x0000000000000001 in ?? ()
#7  0x00000000062dc517 in ?? ()
#8  0x00000000000008f4 in ?? ()
#9  0x0000000000000001 in ?? ()
#10 0x0000000000000009 in ?? ()
#11 0x000000003b9abb8c in ?? ()
#12 0x00007f1ba9b91c40 in ?? ()
#13 0x00000000010308c0 in ?? ()
#14 0x00000000010308b0 in ?? ()
#15 0x0000000001173f01 in ?? ()
#16 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f1ba9991640 (LWP 5923) (Exiting)):
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139756786226752) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139756786226752) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139756786226752, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f1bacd7c476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f1bacd627f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007f1bacdc345c in __libc_message (action=do_abort, fmt=0x7f1bacf157b1 "%s", fmt=0x7f1bacf157b1 "%s", action=do_abort) at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007f1bacdc3770 in __GI___libc_fatal (message=<optimized out>) at ../sysdeps/posix/libc_fatal.c:164
#7  0x00007f1bacdd7476 in unwind_cleanup (reason=<optimized out>, exc=<optimized out>) at ./nptl/unwind.c:114
#8  0x00007f1bac02c1cf in ?? ()
#9  0x00007f1ba9991cb0 in ?? ()
#10 0x00007f1bac3f0388 in ?? ()
benjyw commented 1 year ago

Do we understand why scie-pants is exposing this bug, which seems like it's always been lurking? I am seeing it consistently in my PR to use pants in the setup repo: https://github.com/pantsbuild/setup/pull/142/checks

jsirois commented 1 year ago

Besides clearly being a difference in PBS vs other Python distributions I think no. I say clearly because both ./pants and pants exec into a pants venv console script; so the only relevant difference can be the venv interpreter at that point. The PBS obviously differs in the ~10 patch files it applies to CPython sources and any C build / link optimizations it uses above and beyond vanilla CPython builds / system package builds. I have not dug into either the patch content or the optimization differences myself.

jsirois commented 1 year ago

@benjyw I believe it's not the case for @phlax but it is the case for me / scie-pants ITs in CI that disabling pantsd works around: https://github.com/pantsbuild/scie-pants/blob/7fc1471d6d0d66615fb0217459d5942cd161a799/package/src/main.rs#L867-L885

blimmer commented 1 year ago

The solution for me was to upgrade from Pants 2.14.0 to 2.14.2. That version has updates (from https://github.com/pantsbuild/pants/pull/18642) that appear to have resolved this issue.

EDIT: Nope, 2.14.2 still didn't work (see https://github.com/pantsbuild/pants/issues/19269). Looks like the theory is that this is fixed in the 2.15.x latest version, but I haven't confirmed it yet.