Closed dzuelke closed 9 months ago
For ease of review, here is the diff of diffs between bin/heroku-php-apache2
and bin/heroku-php-nginx
:
diff -u <(git diff main..fix-boot-sigterm-races -- bin/heroku-php-apache2) <(git diff main..fix-boot-sigterm-races -- bin/heroku-php-nginx)
--- /dev/fd/11 2024-02-16 20:25:10.630578564 +0100
+++ /dev/fd/12 2024-02-16 20:25:10.633258159 +0100
@@ -1,7 +1,7 @@
-diff --git a/bin/heroku-php-apache2 b/bin/heroku-php-apache2
-index cfae5bf4..ff90beb3 100755
---- a/bin/heroku-php-apache2
-+++ b/bin/heroku-php-apache2
+diff --git a/bin/heroku-php-nginx b/bin/heroku-php-nginx
+index 4964afd3..93a1f6bb 100755
+--- a/bin/heroku-php-nginx
++++ b/bin/heroku-php-nginx
@@ -492,18 +492,23 @@ if [[ -n "${DYNO:-}" && "${HEROKU_PHP_GRACEFUL_SIGTERM:-1}" != "0" ]]; then
fi
@@ -110,23 +110,23 @@
) & pids+=($!)
(
-- trap 'echo "httpd" >&3;' EXIT
+- trap 'echo "nginx" >&3;' EXIT
+ trap 'echo "nginx" >&3; wait 2> /dev/null' EXIT
trap '' INT;
# wait for FPM to write its PID file, which it does after initialization
@@ -645,12 +654,12 @@ fi
# execute the command we built earlier, with the correct quoting etc expanded
- "${httpd_command[@]}" & pid=$!
+ "${nginx_command[@]}" & pid=$!
# wait for the pidfile in the trap; otherwise, a previous subshell failing may result in us getting a SIGTERM and forwarding it to the child process before that child process is ready to process signals
-- trap 'echo "Stopping httpd gracefully..." >&2; wait_pid_and_pidfile $pid "$httpd_pidfile"; kill -WINCH $pid 2> /dev/null || true' USR1
-+ trap 'trap "" USR1; trap - EXIT; echo "Stopping httpd gracefully..." >&2; wait_pid_and_pidfile $pid "$httpd_pidfile"; kill -WINCH $pid 2> /dev/null || true; wait $pid 2> /dev/null || true; trap - USR1; kill -USR1 $$' USR1
- # we always want to try and stop gracefully, especially since on Heroku we might be getting a process group wide SIGTERM but running a patched HTTPD that ignores SIGTERM; so if that env var is set, we ignore SIGTERM in this subshell as well
+- trap 'echo "Stopping nginx gracefully..." >&2; wait_pid_and_pidfile $pid "$nginx_pidfile"; kill -QUIT $pid 2> /dev/null || true' USR1
++ trap 'trap "" USR1; trap - EXIT; echo "Stopping nginx gracefully..." >&2; wait_pid_and_pidfile $pid "$nginx_pidfile"; kill -QUIT $pid 2> /dev/null || true; wait $pid 2> /dev/null || true; trap - USR1; kill -USR1 $$' USR1
+ # we always want to try and stop gracefully, especially since on Heroku we might be getting a process group wide SIGTERM but running a patched Nginx that ignores SIGTERM; so if that env var is set, we ignore SIGTERM in this subshell as well
if [[ $graceful_sigterm ]]; then
trap '' TERM
else
-- trap 'echo "Stopping httpd..." >&2; wait_pid_and_pidfile $pid "$httpd_pidfile"; kill -TERM $pid 2> /dev/null || true' TERM
-+ trap 'trap "" TERM; echo "Stopping httpd..." >&2; wait_pid_and_pidfile $pid "$httpd_pidfile"; kill -TERM $pid 2> /dev/null || true; wait $pid 2> /dev/null || true; trap - TERM; kill -TERM $$' TERM
+- trap 'echo "Stopping nginx..." >&2; wait_pid_and_pidfile $pid "$nginx_pidfile"; kill -TERM $pid 2> /dev/null || true' TERM
++ trap 'trap "" TERM; echo "Stopping nginx..." >&2; wait_pid_and_pidfile $pid "$nginx_pidfile"; kill -TERM $pid 2> /dev/null || true; wait $pid 2> /dev/null || true; trap - TERM; kill -TERM $$' TERM
fi
# first, we wait for the pid file to be written, so that we can emit a ready message
It could happen that a controlling process sends
SIGTERM
several times in quick succession. For example,timeout
sometimes behaves this way.When we are, for whatever reason, sent the same signal multiple times, the very first operation inside the trap (
trap - ...
) currently re-sets the signal handler (toSIG_DFL
). Since Bash 4.3 RC2, delivery of signals is no longer blocked during the execution of traps for those same signals (i.e., the same trap handler that is executing fires again during its own execution). Since we've re-set toSIG_DFL
at the beginning of the trap, if anotherTERM
signal arrives while the trap is running, the default behavior of immediate termination kicks in, which also kills the currently running trap and its cleanup procedure.Instead, we now perform a
trap "" ...
to ignore this signal immediately (SIG_IGN
), perform the cleanup, and then re-set the trap to the default handler before triggering it by sending the signal to ourselves - this is the correct behavior for a program that terminates in response to a signal in order for other programs to be able to correctly determine the cause of a termination (so e.g. when receivingSIGTERM
, a program must, as its final action, kill itself using that signal, and notexit
with e.g. 0).This means that the subshells will now exit with a non-0 exit status; this triggers an error condition in the main shell, which fires off the signal handler for the signal received - so a subshell that terminates with
kill -USR1 $$
in response to aUSR1
signal will have an exit status of128+USR1
, and trigger aUSR1
signal in the main shell. That's why we ignore thatUSR1
signal before issuing the cleanup, since it would otherwise trigger the (now reset)EXIT
handler, and terminate the outer shell.This behavior is quite useful for any unhandled signals now - if the main program or a subshell receive e.g. a
USR2
, which by default causes termination, the mainEXIT
trap fires and tears stuff down.There is a possible race condition in Bash's delivery of signals to trap handlers that will occasionally cause the internal trap list to become corrupted if signals arrive in rapid succession between the execution of a trap and the changing of a signal handler (to
SIG_IGN
in our case). This is often, but not always, accompanied by a message like "warning: run_pending_traps: bad value in trap_list[15]
", and it will happen if signals arrive in rapid succession (seerun_pending_traps()
in Bash'strap.c
).If this occurs, the re-setting of the trap to
SIG_DFL
, and the subsequent termination due to thekill -$SIGNAL $$
, will not execute (and the bash process will hang indefinitely) until the trap exits for another reason. We "trigger" this exiting with an explicitexit 0
call in the outer shell's traps (in this case, theexit 0
will not actually execute, and the termination status of the outer process will be the correct e.g. "143
" status code representing termination in response toSIGTERM
). In the subshells, this is not necessary, because they get cleaned up by the outer bash process later.The tests for termination behavior now fire off two
SIGTERMs
in rapid succession (using&
), then a third one after a briefsleep
, to assert that this behavior is correct in all cases.As a bonus, the main script and the subshells now behave correctly (i.e., perform orderly shutdown) when unhandled signals arrive, and even e.g. a single
USR1
from outside to one of the subshells tears stuff down correctly.GUS-W-15071619