heroku / heroku-buildpack-php

Heroku's classic buildpack for PHP applications.
https://devcenter.heroku.com/categories/php-support
MIT License
808 stars 1.59k forks source link

Improve handling of concurrent or unknown signals in boot scripts #697

Closed dzuelke closed 9 months ago

dzuelke commented 9 months ago

It could happen that a controlling process sends SIGTERM several times in quick succession. For example, timeout sometimes behaves this way.

When we are, for whatever reason, sent the same signal multiple times, the very first operation inside the trap (trap - ...) currently re-sets the signal handler (to SIG_DFL). Since Bash 4.3 RC2, delivery of signals is no longer blocked during the execution of traps for those same signals (i.e., the same trap handler that is executing fires again during its own execution). Since we've re-set to SIG_DFL at the beginning of the trap, if another TERM signal arrives while the trap is running, the default behavior of immediate termination kicks in, which also kills the currently running trap and its cleanup procedure.

Instead, we now perform a trap "" ... to ignore this signal immediately (SIG_IGN), perform the cleanup, and then re-set the trap to the default handler before triggering it by sending the signal to ourselves - this is the correct behavior for a program that terminates in response to a signal in order for other programs to be able to correctly determine the cause of a termination (so e.g. when receiving SIGTERM, a program must, as its final action, kill itself using that signal, and not exit with e.g. 0).

This means that the subshells will now exit with a non-0 exit status; this triggers an error condition in the main shell, which fires off the signal handler for the signal received - so a subshell that terminates with kill -USR1 $$ in response to a USR1 signal will have an exit status of 128+USR1, and trigger a USR1 signal in the main shell. That's why we ignore that USR1 signal before issuing the cleanup, since it would otherwise trigger the (now reset) EXIT handler, and terminate the outer shell.

This behavior is quite useful for any unhandled signals now - if the main program or a subshell receive e.g. a USR2, which by default causes termination, the main EXIT trap fires and tears stuff down.

There is a possible race condition in Bash's delivery of signals to trap handlers that will occasionally cause the internal trap list to become corrupted if signals arrive in rapid succession between the execution of a trap and the changing of a signal handler (to SIG_IGN in our case). This is often, but not always, accompanied by a message like "warning: run_pending_traps: bad value in trap_list[15]", and it will happen if signals arrive in rapid succession (see run_pending_traps() in Bash's trap.c).

If this occurs, the re-setting of the trap to SIG_DFL, and the subsequent termination due to the kill -$SIGNAL $$, will not execute (and the bash process will hang indefinitely) until the trap exits for another reason. We "trigger" this exiting with an explicit exit 0 call in the outer shell's traps (in this case, the exit 0 will not actually execute, and the termination status of the outer process will be the correct e.g. "143" status code representing termination in response to SIGTERM). In the subshells, this is not necessary, because they get cleaned up by the outer bash process later.

The tests for termination behavior now fire off two SIGTERMs in rapid succession (using &), then a third one after a brief sleep, to assert that this behavior is correct in all cases.

As a bonus, the main script and the subshells now behave correctly (i.e., perform orderly shutdown) when unhandled signals arrive, and even e.g. a single USR1 from outside to one of the subshells tears stuff down correctly.

GUS-W-15071619

dzuelke commented 9 months ago

For ease of review, here is the diff of diffs between bin/heroku-php-apache2 and bin/heroku-php-nginx:

diff -u <(git diff main..fix-boot-sigterm-races -- bin/heroku-php-apache2) <(git diff main..fix-boot-sigterm-races -- bin/heroku-php-nginx)

--- /dev/fd/11  2024-02-16 20:25:10.630578564 +0100
+++ /dev/fd/12  2024-02-16 20:25:10.633258159 +0100
@@ -1,7 +1,7 @@
-diff --git a/bin/heroku-php-apache2 b/bin/heroku-php-apache2
-index cfae5bf4..ff90beb3 100755
---- a/bin/heroku-php-apache2
-+++ b/bin/heroku-php-apache2
+diff --git a/bin/heroku-php-nginx b/bin/heroku-php-nginx
+index 4964afd3..93a1f6bb 100755
+--- a/bin/heroku-php-nginx
++++ b/bin/heroku-php-nginx
 @@ -492,18 +492,23 @@ if [[ -n "${DYNO:-}" && "${HEROKU_PHP_GRACEFUL_SIGTERM:-1}" != "0" ]]; then
  fi

@@ -110,23 +110,23 @@
  ) & pids+=($!)

  (
--  trap 'echo "httpd" >&3;' EXIT
+-  trap 'echo "nginx" >&3;' EXIT
 +  trap 'echo "nginx" >&3; wait 2> /dev/null' EXIT
    trap '' INT;

    # wait for FPM to write its PID file, which it does after initialization
 @@ -645,12 +654,12 @@ fi
    # execute the command we built earlier, with the correct quoting etc expanded
-   "${httpd_command[@]}" & pid=$!
+   "${nginx_command[@]}" & pid=$!
    # wait for the pidfile in the trap; otherwise, a previous subshell failing may result in us getting a SIGTERM and forwarding it to the child process before that child process is ready to process signals
--  trap 'echo "Stopping httpd gracefully..." >&2; wait_pid_and_pidfile $pid "$httpd_pidfile"; kill -WINCH $pid 2> /dev/null || true' USR1
-+  trap 'trap "" USR1; trap - EXIT; echo "Stopping httpd gracefully..." >&2; wait_pid_and_pidfile $pid "$httpd_pidfile"; kill -WINCH $pid 2> /dev/null || true; wait $pid 2> /dev/null || true; trap - USR1; kill -USR1 $$' USR1
-   # we always want to try and stop gracefully, especially since on Heroku we might be getting a process group wide SIGTERM but running a patched HTTPD that ignores SIGTERM; so if that env var is set, we ignore SIGTERM in this subshell as well
+-  trap 'echo "Stopping nginx gracefully..." >&2; wait_pid_and_pidfile $pid "$nginx_pidfile"; kill -QUIT $pid 2> /dev/null || true' USR1
++  trap 'trap "" USR1; trap - EXIT; echo "Stopping nginx gracefully..." >&2; wait_pid_and_pidfile $pid "$nginx_pidfile"; kill -QUIT $pid 2> /dev/null || true; wait $pid 2> /dev/null || true; trap - USR1; kill -USR1 $$' USR1
+   # we always want to try and stop gracefully, especially since on Heroku we might be getting a process group wide SIGTERM but running a patched Nginx that ignores SIGTERM; so if that env var is set, we ignore SIGTERM in this subshell as well
    if [[ $graceful_sigterm ]]; then
        trap '' TERM
    else
--      trap 'echo "Stopping httpd..." >&2; wait_pid_and_pidfile $pid "$httpd_pidfile"; kill -TERM $pid 2> /dev/null || true' TERM
-+      trap 'trap "" TERM; echo "Stopping httpd..." >&2; wait_pid_and_pidfile $pid "$httpd_pidfile"; kill -TERM $pid 2> /dev/null || true; wait $pid 2> /dev/null || true; trap - TERM; kill -TERM $$' TERM
+-      trap 'echo "Stopping nginx..." >&2; wait_pid_and_pidfile $pid "$nginx_pidfile"; kill -TERM $pid 2> /dev/null || true' TERM
++      trap 'trap "" TERM; echo "Stopping nginx..." >&2; wait_pid_and_pidfile $pid "$nginx_pidfile"; kill -TERM $pid 2> /dev/null || true; wait $pid 2> /dev/null || true; trap - TERM; kill -TERM $$' TERM
    fi

    # first, we wait for the pid file to be written, so that we can emit a ready message