strongloop / node-foreman

A Node.js Version of Foreman
http://strongloop.github.io/node-foreman/
Other
1.27k stars 119 forks source link

`nf` claims to have delivered `SIGINT` to all children on exit from one, but does not actually #176

Open benweint opened 11 months ago

benweint commented 11 months ago

The README says:

If your processes exit, Node Foreman will assume an error has occurred and shut your application down.

nf does seem to detect the exit of a single child process, and claims to be sending a SIGINT to all children in response to it, but in fact will not deliver the SIGINT in all cases.

Here's a simple repro case:

❯ cat wait-for-sigint.sh 
#!/bin/bash

function handle_sigint {
    echo "got SIGINT, exiting ..."
    exit
}

trap handle_sigint SIGINT

echo "started, sleeping forever awaiting SIGINT"
sleep 1000000

❯ cat Procfile  
a: sleep 10 && exit 1
b: ./wait-for-sigint.sh

❯ nf start
12:38:47 PM b.1 |  started, sleeping forever awaiting SIGINT
[DONE] Killing all processes with signal  SIGINT
12:38:57 PM a.1 Exited with exit code null

< ... `nf` does not actually exit here, nor doe the `b` child process running `wait-for-sigint.sh` ... >

Observations

If I modify wait-for-sigint.sh to emit a constant stream of output while it is waiting, then the test case works as expected:

❯ cat Procfile 
a: sleep 5 && exit 1
b: ./wait-for-sigint-with-output.sh

❯ cat wait-for-sigint-with-output.sh 
#!/bin/bash

function handle_sigint {
    echo "got SIGINT, exiting ..."
    exit
}

trap handle_sigint SIGINT

echo "started, sleeping forever awaiting SIGINT"

while true
do
  echo 'still here'
  sleep 1
done

❯ nf start                               
12:43:41 PM b.1 |  started, sleeping forever awaiting SIGINT
12:43:41 PM b.1 |  still here
12:43:42 PM b.1 |  still here
12:43:43 PM b.1 |  still here
12:43:44 PM b.1 |  still here
12:43:45 PM b.1 |  still here
[DONE] Killing all processes with signal  SIGINT
12:43:45 PM a.1 Exited with exit code null
12:43:46 PM b.1 |  got SIGINT, exiting ...
12:43:46 PM b.1 Exited Successfully

Comparison to other implementations

foreman (Ruby)

❯ foreman start
12:47:50 a.1    | started with pid 59856
12:47:50 b.1    | started with pid 59857
12:47:50 b.1    | started, sleeping forever awaiting SIGINT
12:47:55 a.1    | exited with code 1
12:47:55 system | sending SIGTERM to all processes
12:47:56 b.1    | terminated by SIGTERM

goreman (Go)

goreman has different default behavior wrt a single child process exiting:

❯ goreman start
12:45:24 a | Starting a on port 5000
12:45:24 b | Starting b on port 5100
12:45:24 b | started, sleeping forever awaiting SIGINT
12:45:29 a | Terminating a

... but with -exit-on-error ('Exit goreman if a subprocess quits with a nonzero return code'):

❯ goreman -exit-on-error start
12:46:10 a | Starting a on port 5000
12:46:10 b | Starting b on port 5100
12:46:10 b | started, sleeping forever awaiting SIGINT
12:46:15 a | Terminating a
12:46:15 b | got SIGINT, exiting ...
12:46:15 b | Terminating b
goreman: exit status 1
benweint commented 11 months ago

Turns out I had misdiagnosed this!

nf really was delivering SIGINT to all direct children, but because it doesn't use process groups for each spawned child, if the child processes spawned their own children and didn't respond to SIGINT by exiting or forwarding to their children, then nf would just hang when one child exited.

In the repro case that I gave, the process tree looks like this after a exits:

❯ pstree -s nf.js
... snip ...
     \-+= 83622 ben node nf.js start
       \-+- 83628 ben /bin/bash ./wait-for-sigint.sh
         \--- 83632 ben sleep 1000000

The bash process (pid 83628) actually has received the SIGINT, but per the bash manual:

When Bash receives a signal for which a trap has been set while waiting for a command to complete, the trap will not be executed until the command completes.

So in this example:

  1. a exited
  2. nf sent SIGINT to the direct child process for b (bash, pid=83628)
  3. bash got the SIGINT, but was waiting to invoke the trap handler until the sleep command (pid=83632) exited
  4. The sleep command itself never received the SIGINT

The way that goreman solves this is by creating a process group for each spawned child, and then delivering the SIGINT signals to the group, rather than the direct child.

benweint commented 11 months ago

I've implemented support for using process groups in my fork (https://github.com/benweint/node-foreman/commit/5cb9ee5009772fce10eb1cafd9ffa00b7d780102) and can PR it if there's interest, but it looks like this project might be dead.