valhalla / valhalla

Open Source Routing Engine for OpenStreetMap
https://valhalla.github.io/valhalla/
Other
4.46k stars 677 forks source link

docker: trap SIGTERM in valhalla_route_service #634

Open missinglink opened 7 years ago

missinglink commented 7 years ago

heya, I've been running valhalla using Docker and docker-compose.

when running the docker-compose down command a SIGTERM signal is first sent to the process and then it waits an interval (default 10s) before sending it a SIGKILL.

it seems like valhalla_route_service is ignoring the SIGTERM signal.

because of how docker runs the process as PID 1 it behaves a little differently regarding trapping signals, it might require that the signal is explicitly trapped with code.

eg: (in nodejs):

process.on('SIGTERM', function(){ app.close(); });

the benefit is that valhalla docker containers would restart almost instantly, rather than waiting ~10s for the SIGKILL.

missinglink commented 7 years ago

ref: https://www.ctl.io/developers/blog/post/gracefully-stopping-docker-containers/

explains it better than me :)

docker stop

When you issue a docker stop command Docker will first ask nicely for the process to stop and if it doesn't comply within 10 seconds it will forcibly kill it. If you've ever issued a docker stop and had to wait 10 seconds for the command to return you've seen this in action

The docker stop command attempts to stop a running container first by sending a SIGTERM signal to the root process (PID 1) in the container. If the process hasn't exited within the timeout period a SIGKILL signal will be sent.

Whereas a process can choose to ignore a SIGTERM, a SIGKILL goes straight to the kernel which will terminate the process. The process never even gets to see the signal.

When using docker stop the only thing you can control is the number of seconds that the Docker daemon will wait before sending the SIGKILL:

docker stop ----time=30 foo
kevinkreiser commented 7 years ago

@missinglink yeah we've done this for test programs (so that the get killed by SIGALRM) but sadly havent yet made it a habit for the main programs. as a first step we can add this to the main all in one service and essentially copy paste this bit of code wherever its needed. all we need is to do something like:

#include <csignal>
#include <cstdlib>
int main(int argc, char** argv) {
  signal(SIGTERM, [](int sig_num)->void{ exit(sig_num); });
  return 0;
}
kevinkreiser commented 7 years ago

something strange seems to be going on in docker land... so without any change to the code i just wanted to see that indeed our process sticks around if you send SIGTERM to it...

kkreiser@HP-ZBook-15:~/sandbox/valhalla/valhalla/.libs$ ./valhalla_route_service ../conf.json 1
2017/03/30 16:46:16.921048 [INFO] Tile extract successfully loaded
^C
kkreiser@HP-ZBook-15:~/sandbox/valhalla/valhalla/.libs$ ./valhalla_route_service ../conf.json 1 &
[1] 11599
kkreiser@HP-ZBook-15:~/sandbox/valhalla/valhalla/.libs$ 2017/03/30 16:46:30.327750 [INFO] Tile extract successfully loaded

kkreiser@HP-ZBook-15:~/sandbox/valhalla/valhalla/.libs$ kill -l
 1) SIGHUP   2) SIGINT   3) SIGQUIT  4) SIGILL   5) SIGTRAP
 6) SIGABRT  7) SIGBUS   8) SIGFPE   9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT   17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG  24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF 28) SIGWINCH    29) SIGIO   30) SIGPWR
31) SIGSYS  34) SIGRTMIN    35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX    
kkreiser@HP-ZBook-15:~/sandbox/valhalla/valhalla/.libs$ kill -15 11599
kkreiser@HP-ZBook-15:~/sandbox/valhalla/valhalla/.libs$ 
[1]+  Beendet                 ./valhalla_route_service ../conf.json 1
kkreiser@HP-ZBook-15:~/sandbox/valhalla/valhalla/.libs$ 

the above would suggest that the program quits when it gets SIGTERM so i'm at a loss here to whats going on in docker :frowning_face: any ideas?

missinglink commented 7 years ago

hmm... I'm not sure what's going on then..

I can confirm that the docker container appears to be hanging around until the timeout is reached (10s) when a docker-compose down is executed.

few ideas:

I will investigate further tomorrow and see what's going on, it's possible that I made an error in my configuration which is causing the problem, so I'll double-check that.

missinglink commented 7 years ago

this is the relevant section of the Docker container:

CMD valhalla_route_service valhalla.json 1

what does the 1 do there? I copied that off a readme

edit: never mind, I RTFM

  //number of workers to use at each stage
  auto worker_concurrency = std::thread::hardware_concurrency();
  if(argc > 2)
    worker_concurrency = std::stoul(argv[2]);
kevinkreiser commented 7 years ago

@missinglink it shouldnt take time to wind down, when i ctrl-c it it goes down instantly. i think your third bullet is the most likely culprit here, of course its also the hardest to test. maybe worth writing a small dockerized program just to see what signals are sent when. actually, we could do that with just some bash...

missinglink commented 7 years ago

so.. I have officially dived down the rabbit-hole which is docker and came up with only more questions... :)

I set up a container running the server with the CMD as such:

CMD valhalla_route_service valhalla.json 1

I then created an interactive bash shell inside the running container and ran ps:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        26  0.0  0.0  19880  3644 ?        Ss   16:06   0:00 bash
root        39  0.0  0.0  36088  3204 ?        R+   16:07   0:00  \_ ps auxf
root         1  0.0  0.0   4512   804 ?        Ss   16:02   0:00 /bin/sh -c valhalla_route_service /data/valhalla.json 1
root         6  0.0  0.1 1276788 32476 ?       Sl   16:02   0:00 valhalla_route_service /data/valhalla.json 1

it seems that when the CMD is in 'shell form', it is executed with /bin/sh -c.

sending kill -15 1 had no effect but sending kill -15 6 killed the process and the container exited.


I then changed the CMD definition to exec form (ie. array form) as such:

CMD ["valhalla_route_service", "/data/valhalla.json", "1"]

now the binary is executed directly and becomes PID 1

root@18bf075d3d11:/data# ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        25  0.3  0.0  19880  3560 ?        Ss   16:15   0:00 bash
root        34  0.0  0.0  36088  3184 ?        R+   16:15   0:00  \_ ps auxf
root         1  0.5  0.1 1276788 28496 ?       Ssl  16:15   0:00 valhalla_route_service /data/valhalla.json 1

I again opened a bash shell in the running container and ran kill -15 1, nothing happened!


In either form I find that the container takes >10s to come 'down':

$ time docker-compose down
Stopping valhalla ... done
Removing valhalla ... done
Removing network valhallaissue634_default

real    0m11.280s
user    0m0.340s
sys 0m0.036s
¯\_(ツ)_/¯
missinglink commented 7 years ago

docker files: https://github.com/missinglink/valhalla-issue-634

missinglink commented 7 years ago

right! so I read this https://www.fpcomplete.com/blog/2016/10/docker-demons-pid1-orphans-zombies-signals.

The reason for this is some Linux kernel magic: the kernel treats a process with PID 1 specially, and does not, by default, kill the process when receiving the SIGTERM or SIGINT signals. This can be very surprising behavior.

the tl;dr is that an explicit signal handler must be defined in any process which could be run as PID1

missinglink commented 7 years ago

interesting read: https://github.com/phusion/baseimage-docker/blob/next/README.md