tilt-dev / tilt-extensions

Extensions for Tilt
https://tilt.dev/
Apache License 2.0
200 stars 156 forks source link

`restart_process` restart-helper leaks zombie processes #570

Open glennpratt opened 4 months ago

glennpratt commented 4 months ago
root@api-5dd8c8769-nwdv7:/# ps aux wwwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          49  0.0  0.0   4188  3332 pts/0    Ss   14:30   0:00 bash
root          80  0.0  0.0   8088  3940 pts/0    R+   14:31   0:00  \_ ps aux wwwf
root           1  0.0  0.0 703588  1664 ?        Ssl  05:37   0:00 /tilt-restart-wrapper --watch_file=/tmp/.restart-proc sh -c /api-server
root          11  0.0  0.0   1220   104 ?        S    05:37   0:00 /entr -rz sh -c /api-server
root          70  0.0  0.0   2576   860 ?        S    14:30   0:00  \_ sh -c /api-server
root          71  0.1  0.4 1287044 72528 ?       Sl   14:30   0:00      \_ /api-server
root          13  0.0  0.0      0     0 ?        Z    05:37   0:02 [api-server] <defunct>
root          38  0.0  0.0      0     0 ?        Z    14:29   0:00 [api-server] <defunct>

Seems like it might need a minimal init (tini, dumb-init, etc).

I noticed this because entr also doesn't reliably kill the process before starting a new one... perhaps this is because of sh, I'm not sure. That's a different issue.

nicks commented 4 months ago

hmmm...i was not able to reproduce this problem. here are the steps i tried:

  1. cd into tilt-extensions/restart_process/test
  2. Run tilt up -f custom_deploy.Tiltfile
  3. Clicked test_update a bunch of times
  4. Exec'd into the pod and ran ps aux wwwf
  5. Saw that there was exactly one start.sh, as i expected.

when i poked around in the entr repo, i found this issue - https://github.com/eradman/entr/pull/38 - which seems to indicate to me that it's at least expected for entr to kill the process

samuliy commented 1 month ago

I'm running into a sort of similar situation where a process can be left around hanging as a zombie process.

I tried to create a reproducable testcase and came up with something which creates a zombie process. This is not supposed to be a useful or optimized usecase for anything, just commands put together to create a zombie process.

diff --git a/restart_process/test/Dockerfile.test b/restart_process/test/Dockerfile.test
index cfbeb9f..89e11d3 100644
--- a/restart_process/test/Dockerfile.test
+++ b/restart_process/test/Dockerfile.test
@@ -1,5 +1,7 @@
 FROM alpine

+RUN apk add nginx
+
 RUN echo 0 > restart_count.txt

 ADD start.sh /
diff --git a/restart_process/test/start.sh b/restart_process/test/start.sh
index c9bc981..6277158 100755
--- a/restart_process/test/start.sh
+++ b/restart_process/test/start.sh
@@ -12,8 +12,13 @@ handle_sigterm() {

 trap handle_sigterm SIGTERM

-while true
-do
-  echo running
-  sleep 5
+while pgrep -f nginx >/dev/null ; do
+  echo "waiting for nginx to shut down"
+  set +e
+  pkill -f nginx
+  set -e
+  sleep 3
 done
+nginx
+
+tail -F /var/log/nginx/*
  1. Run tilt up -f custom_deploy.Tiltfile
  2. The restart should happen right away with the first start.
  3. The new run of the entrypoint script is left waiting for the previous instance of nginx to shut down, since it will remain there as a zombie process because tilt-restart-wrapper as pid 1 does not call waitpid on it.

Going into the container and running ps faux:

ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 /tilt-restart-wrapper --watch_file=/tmp/.restart-proc sh -c /start.sh
   17 root      0:00 /entr -rz sh -c /start.sh
   25 root      0:00 [nginx]
   62 root      0:00 {start.sh} /bin/sh /start.sh
   78 root      0:00 sleep 3
   79 root      0:00 sh
   85 root      0:00 ps faux

The pid 25 is there as a zombie.


I assume the issue is tilt-restart-wrapper not reaping zombies, which should be a job of a process running as PID 1.