pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.21k stars 620 forks source link

thread state must be current when releasing error when using scie-pants #19269

Closed blimmer closed 5 months ago

blimmer commented 1 year ago

Describe the bug

I tried switching from the ./pants wrapper script over to use scie-pants, but I'm encountering a reproducible error in my CI environment when running pants:

Fatal Python error: PyGILState_Release: thread state 0x7fc564001140 must be current when releasing
Python runtime state: finalizing (tstate=0x55d327fc0dc0)

Thread 0x00007fc587a07000 (most recent call first):
<no Python frame>
./pants: line 22:  9558 Aborted                 (core dumped) pants "$@"

The script I'm running looks like this:

#! /bin/bash

set -eo pipefail

# Pants doesn't have built-in tracking of successful deploys, so we use git tags to track them. See monorepo-pants-success for the tagging logic.
# The use of `dd` is important here because CircleCI uses `-pipefail`, which will fail the script if the pipe fails. Once a repo has a significant
# number of tags, the `head` command will fail with a broken pipe error. `dd` ignores the broken pipe error and allows the script to continue.
# See https://github.com/stedolan/jq/issues/1017#issuecomment-845399005 for the inspiration for this fix.
PROJECT_NAME_PERL_REGEX='{project_name} $_ = s/(.*\/)(.*):.*/$2/r' # this makes the parallel output more readable
most_recent_deployed_tag=$(git tag -l "$MONOREPO_PANTS_TAG_PREFIX-$STAGE-*" --sort "-version:refname" | (head -n 1; dd status=none of=/dev/null))
echo "Previous deployment found: $most_recent_deployed_tag. Deploying only stacks that have changed since then."
pants --changed-since="$most_recent_deployed_tag" --changed-dependees=transitive --filter-tag-regex='^cdk_deploy$' list | parallel -L1 --rpl "$PROJECT_NAME_PERL_REGEX" --tagstring '{project_name} |' -P 1 pants run

What I'm trying to do here is pretty straightforward, but not directly supported by pants (in lerna all of this is just yarn lerna run --since '' cdk -- deploy). This script basically just runs the cdk_deploy task (which is an experimental_run_shell_command) in all projects that changed since the last deployment.

However, I consistently get the exception above when using scie-pants. I went back to the ./pants wrapper and don't encounter this problem anymore.

Pants version

2.14.2

OS

This process only runs in our CI (CircleCI), a Linux Ubuntu 22.04 environment.

Additional info

Here's the gdb output from the CoreDump. Shout out to @jsirois, who patiently helped me a ton in Slack.

circleci@ip-10-0-102-195:/var/crash/unpack$ ls ~/.cache/nce/
22652cf60b12e7a187ea0cf21d7bc9d8234bbb630fa94ca6f9c28655bd6a81fb
2b6e146234a4ef2a8946081fc3fbfffe0765b80b690425a49ebe40b47c33445b
5a50ec9774acbbcb281fa6e51f5162a777d759e96ec375b8b9df120091c447eb
5b7e43d9b857d76bdfaa946b6b58c47c04d02fb5b087532e0f7f403bffe5eddb
c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d
circleci@ip-10-0-102-195:/var/crash/unpack$ gdb -iex "set solib-search-path /home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/lib/python3.9/site-packages/pants/engine/internals:/home/circleci/.cache/nce/2b6e146234a4ef2a8946081fc3fbfffe0765b80b690425a49ebe40b47c33445b/cpython-3.9.16+20230507-x86_64-unknown-linux-gnu-install_only.tar.gz/python/lib/" ~/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/bin/python3.9 CoreDump
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/bin/python3.9...

warning: core file may not match specified executable file.
[New LWP 9643]
[New LWP 9644]
[New LWP 9645]
[New LWP 9612]
[New LWP 9613]
[New LWP 9641]
[New LWP 9646]
[New LWP 9647]
[New LWP 9558]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140486283343424)
    at ./nptl/pthread_kill.c:44
44  ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7fc58303b640 (LWP 9643))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140486283343424)
    at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140486283343424) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140486283343424, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007fc586242476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fc5862287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007fc5867ab94a in fatal_error_exit (status=9558) at Python/pylifecycle.c:2201
#6  0x00007fc5867ab7ff in fatal_error (stream=<optimized out>, header=0, prefix=0x0,
    msg=<optimized out>, status=-1) at Python/pylifecycle.c:2216
#7  0x00007fc5867abd52 in _Py_FatalErrorFormat (func=0x7fc587390645 "PyGILState_Release",
    format=0x7fc587390699 "thread state %p must be current when releasing")
    at Python/pylifecycle.c:2335
#8  0x00007fc5869fe747 in PyGILState_Release (oldstate=PyGILState_UNLOCKED) at Python/pystate.c:1410
#9  0x00007fc584d29ea7 in pyo3::marker::Python::with_gil<engine::externs::scheduler::{impl#2}::__new__::{closure#0}::{closure_env#1}, ()> (f=...)
    at /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.16.5/src/marker.rs:315
#10 engine::externs::scheduler::{impl#2}::__new__::{closure#0} () at src/externs/scheduler.rs:37
#11 0x00007fc5854276d5 in tokio::runtime::blocking::pool::Inner::run (self=<optimized out>,
    worker_thread_id=<optimized out>) at src/runtime/blocking/pool.rs:304
#12 tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure#0} ()
    at src/runtime/blocking/pool.rs:294
#13 std::sys_common::backtrace::__rust_begin_short_backtrace<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()> (f=...)
    at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/sys_common/backtrace.rs:122
#14 0x00007fc585429e64 in std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure#0}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()> ()
    at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/thread/mod.rs:505
#15 core::panic::unwind_safe::{impl#23}::call_once<(), std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()>> (
    self=<error reading variable: Cannot access memory at address 0x0>, _args=<optimized out>)
    at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panic/unwind_safe.rs:271
#16 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()>>, ()> (data=<optimized out>)
    at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:492
#17 std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()>>> (f=<error reading variable: Cannot access memory at address 0x0>)
    at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:456
#18 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()>>, ()> (f=<error reading variable: Cannot access memory at address 0x0>)
    at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panic.rs:137
#19 std::thread::{impl#0}::spawn_unchecked_::{closure#1}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()> ()
    at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/thread/mod.rs:504
#20 core::ops::function::FnOnce::call_once<std::thread::{impl#0}::spawn_unchecked_::{closure_env#1}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()>, ()> ()
    at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248
#21 0x00007fc58552fb33 in alloc::boxed::{impl#44}::call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1951
#22 alloc::boxed::{impl#44}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1951
#23 std::sys::unix::thread::{impl#2}::new::thread_start () at library/std/src/sys/unix/thread.rs:108
#24 0x00007fc586294b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#25 0x00007fc586326a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)
quit

CircleCI CrashDump Parsing Steps

For future travelers using CircleCI, here's what I did to capture a stack trace on CircleCI:

  1. Re-run your job using SSH (docs). This will allow you runtime access to the instance. SSH to the instance via the instructions in the docs.
  2. Run ulimit -c unlimited. If you get an error, the ulimit is already likely set to unlimited.
  3. Create an apport config file. I had to do this to get the crash report saved to a known location.
    mkdir -p ~/.config/apport
    printf '[main]
    unpackaged=true
    ' >> ~/.config/apport/settings
  4. Run the process that produces the crashdump.
  5. Extract the coredump from the apport .crash file:
    cd /var/crash
    apport-unpack <crash-file>.crash output
  6. Start gdb with the proper so files linked up. NOTE: you will need to update the paths in ~/.circleci/.cache to match what's on your runner instance.
    cd unpack
    gdb -iex "set solib-search-path /home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/lib/python3.9/site-packages/pants/engine/internals:/home/circleci/.cache/nce/2b6e146234a4ef2a8946081fc3fbfffe0765b80b690425a49ebe40b47c33445b/cpython-3.9.16+20230507-x86_64-unknown-linux-gnu-install_only.tar.gz/python/lib/" ~/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/bin/python3.9 CoreDump
  7. Get the backtrace
    (gdb) bt
WorkerPants commented 1 year ago

Welcome to the Pantsbuild Community. This looks like your first issue here. Thanks for taking the time to write it.

If you haven't already, feel free to come say hi on Slack.

If you have questions, or just want to surface this issue, check out the #development channel. (If you want to check it out without logging in, check out our Linen mirror)

Thanks again, and we look forward to your next Issue/PR :smile:!

jsirois commented 1 year ago

@blimmer so ... what about this: https://github.com/pantsbuild/pants/issues/18135#issuecomment-1563196282

That comment and this bug seem to be in contradiction.

jsirois commented 1 year ago

@blimmer it looks to me like #18135 (fixed by #18166) is only in 2.15.1+.

blimmer commented 1 year ago

@blimmer so ... what about this: https://github.com/pantsbuild/pants/issues/18135#issuecomment-1563196282

Yes, I'll update that comment. I thought it was fixed, but then I encountered this issue.

blimmer commented 5 months ago

I finally got back to this and upgraded to pants 2.19.0. I haven't seen this issue again. Thanks for the help debugging this!