Closed kelson42 closed 2 years ago
After restarting the library.kiwix.org docker container, then the load was back to normal and all the defunct kiwix-serve process have vanished
Do you have an idea what happens? Where do those defunct processes come from? Do they correspond to threads handling concurrent requests? Did the kiwix-serve process run continuously since it was started or it was automatically restarted multiple times because of becoming unresponsive? Are there any logs?
@veloman-yunkan The problems seems to come from the kill
operation we rung here https://github.com/kiwix/maintenance/blob/master/library-docker/bin/restart-kiwix-serve.sh#L19.
I can not reproduce the problem locally, but I have tried directly on library.kiwix.org
and this really seems to be the root cause. I even have tried with 15 (TERM)
but doing this I still get zombies.
So, if we don't have an obvious performance problem (for the moment), we have a clear problem that it seems not possible to kill cleanly the process without leaving zombies.
@veloman-yunkan The problems seems to come from the
kill
operation we rung here https://github.com/kiwix/maintenance/blob/master/library-docker/bin/restart-kiwix-serve.sh#L19.
An immediate question is: why is a KILL signal used to stop kiwix-serve? Processes should be killed with a KILL signal only as a last resort.
@veloman-yunkan Sending SIGTERM leads to the same behaviour. What would you propose as alternative to request kiwix-serve daemon to stop?
@kelson42 SIGTERM must be used for terminating a process. Any problems with handling of SIGTERM must be debugged and fixed.
@veloman-yunkan Yesterday, I have rebuilded/relaunched the library container with latest kiwix-serve and used the opportunity to put a proper SIGTERM
signal, see:
https://github.com/kiwix/maintenance/pull/217/files
After restart I had to <defunct>
kiwix-serve process... but this morning, the first one have already appeared :(
$ ps aux | grep kiwix-serve
root 3607 0.4 0.0 0 0 ? Z 07:02 1:30 [kiwix-serve] <defunct>
kelson 5708 0.0 0.0 12752 876 pts/0 S+ 12:19 0:00 grep kiwix-serve
root 6343 0.2 0.0 0 0 ? Z 09:03 0:24 [kiwix-serve] <defunct>
root 7992 0.0 0.0 0 0 ? Z 09:17 0:09 [kiwix-serve] <defunct>
root 8436 1.9 5.7 97090236 3779152 ? Sl 09:19 3:28 kiwix-serve --daemon --port=8000 --library --monitorLibrary --threads=16 --nodatealias /var/www/library.kiwix.org/library.kiwix.org.xml
root 13923 0.0 0.0 0 0 ? Z 04:19 0:06 [kiwix-serve] <defunct>
root 14456 1.1 0.0 0 0 ? Z 04:21 5:21 [kiwix-serve] <defunct>
root 15077 0.0 0.0 0 0 ? Z 08:40 0:04 [kiwix-serve] <defunct>
root 15643 0.2 0.0 0 0 ? Z 08:42 0:28 [kiwix-serve] <defunct>
root 40612 1.4 0.0 0 0 ? Z Mar24 14:41 [kiwix-serve] <defunct>
My analysis of kiwix/maintenance/library-docker/bin/start.sh and kiwix/maintenance/library-docker/bin/restart-kiwix-serve.sh suggests that the zombie processes of kiwix-serve
have nothing to do with its reaction to termination requests.
In library-docker/bin/start.sh
cron
is configured to execute restart-kiwix-serve.sh
every minute without any arguments. That command only checks if kiwix-serve
is alive and, if so, does nothing. Otherwise, it starts a new instance of kiwix-serve
.
Attempts to restart (i.e. terminate the running instance of kiwix-serve
and start a new one) are made only if restart-kiwix-serve.sh
is passed the restart
command line argument. This is configured only for the start-up of the container and is meaningless since on start-up no kiwix-serve
should be expected to be already running.
Therefore zombie/\<defunct> processes of kiwix-serve
appear independently from restart-kiwix-serve.sh
(which simply recovers from that situation by launching kiwix-serve
again). Why kiwix-serve
goes \<defunct> has to be debugged.
My analysis of kiwix/maintenance/library-docker/bin/start.sh and kiwix/maintenance/library-docker/bin/restart-kiwix-serve.sh suggests that the zombie processes of
kiwix-serve
have nothing to do with its reaction to termination requests.
This is probably wrong, here is something I made live on library.kiwix.org
(unfortunately I can not reproduce localy)
$ docker-compose up --detach --force-recreate library
Recreating library ... done
$ date
Sat Mar 26 10:29:08 CET 2022
$ ps aux | grep kiwix-serve
root 6157 0.0 0.0 12752 948 pts/0 S+ 10:29 0:00 grep kiwix-serve
root 36150 19.8 0.8 3991372 561872 ? Sl 10:28 0:06 kiwix-serve --daemon --port=8000 --library --monitorLibrary --threads=16 --nodatealias /var/www/library.kiwix.org/library.kiwix.org.xml
$ date
Sat Mar 26 10:29:45 CET 2022
$ kill -TERM 36150
$ ps aux | grep kiwix-serve
root 7022 0.0 0.0 12752 964 pts/0 S+ 10:30 0:00 grep kiwix-serve
root 36150 6.8 0.0 0 0 ? Z 10:28 0:07 [kiwix-serve] <defunct>
But your argumentation/analysis is proper: since we use th e --monitorLibrary
, kiwix-serve is not started arbitrarly anymore from this script (beside at container start). That said, the kill -TERM
still generates such <defunc>
processes. Therefore, either their is an other root cause or something else kills the process. At least I believe, using kill -TERM
might be an interesting "artificial" way to reproduce the bug and diagnosis it.
here is something I made live on
library.kiwix.org
(unfortunately I can not reproduce localy)$ docker-compose up --detach --force-recreate library Recreating library ... done $ date Sat Mar 26 10:29:08 CET 2022 $ ps aux | grep kiwix-serve root 6157 0.0 0.0 12752 948 pts/0 S+ 10:29 0:00 grep kiwix-serve root 36150 19.8 0.8 3991372 561872 ? Sl 10:28 0:06 kiwix-serve --daemon --port=8000 --library --monitorLibrary --threads=16 --nodatealias /var/www/library.kiwix.org/library.kiwix.org.xml $ date Sat Mar 26 10:29:45 CET 2022 $ kill -TERM 36150 $ ps aux | grep kiwix-serve root 7022 0.0 0.0 12752 964 pts/0 S+ 10:30 0:00 grep kiwix-serve root 36150 6.8 0.0 0 0 ? Z 10:28 0:07 [kiwix-serve] <defunct>
Do you kill a process belonging to a docker container from outside the container?
@veloman-yunkan This is the case here, yes.
It look like the problem belongs to the parent process of kiwix-serve. See https://linuxreviews.org/Defunct_process
Unix manages an explicit parent-child relationships between processes (Windows does not do this).
When a child process dies, the parent process recieves a notification. It is then the duty of the parent process to explicitly take notice of the childs demise by using the wait() system call.
The return value of the wait() is the process ID of the child, which gives the parent exact control about which of its children are still alive. Upon returning, wait() will have set the integer pointed to by its argument to the exit status of the child. A shell programm like "bash" could then decide how to process following commands and set the special $? variable accordingly.
As long as the parent hasn't called wait(), the system needs to keep the dead child in the global process list, because that's the only place where the process ID is stored. The purpose of the "zombies" is really just for the system to remember the process ID, so that it can inform the parent process about it on request.
If the parent "forgets" to collect on its children, then the zombie will stay undead forever.
Well, almost forever. If the parent itself dies, then "init" (the system process with the ID 0) will take over fostership over its children and catch up on the neglected parental duties. This is why you need to identify the parent process and stop or restart it in order to get rid of defunct processes. If the zombie process has id nnnnn, you can do ps -ef | grep nnnnn and find the id of the parent process, which you can then kill if no longer needed. Then the defunct process will be removed from the list.
@kelson42 Will you please find out the parent process of the defunct kiwix-serve
processes using the ps -ef
command?
$ ps -ef | grep kiwix-serve
root 852 35156 0 10:45 ? 00:00:06 [kiwix-serve] <defunct>
root 966 35156 1 10:47 ? 00:04:26 [kiwix-serve] <defunct>
root 3331 35156 0 Mar26 ? 00:01:53 [kiwix-serve] <defunct>
root 7583 35156 0 Mar26 ? 00:02:24 [kiwix-serve] <defunct>
root 7632 35156 0 Mar26 ? 00:06:35 [kiwix-serve] <defunct>
kelson 21314 21143 0 15:49 pts/2 00:00:00 grep kiwix-serve
root 22859 35156 0 00:23 ? 00:00:06 [kiwix-serve] <defunct>
root 23620 35156 1 00:26 ? 00:12:09 [kiwix-serve] <defunct>
root 35907 35156 0 Mar26 ? 00:03:12 [kiwix-serve] <defunct>
root 36150 35156 0 Mar26 ? 00:00:07 [kiwix-serve] <defunct>
root 39560 35156 0 Mar26 ? 00:01:32 [kiwix-serve] <defunct>
root 43009 35156 0 14:38 ? 00:00:04 [kiwix-serve] <defunct>
root 43192 35156 2 14:40 ? 00:01:47 kiwix-serve --daemon --port=8000 --library --monitorLibrary --threads=16 --nodatealias /var/www/library.kiwix.org/library.kiwix.org.xml
and
$ ps aux | grep 35156
kelson 21660 0.0 0.0 12752 904 pts/2 S+ 15:51 0:00 grep 35156
systemd+ 35156 0.0 0.0 18604 1428 ? SLs Mar26 0:05 /usr/sbin/varnishd -F -a :80 -T localhost:6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret -s malloc,256m
This problem seems śtrongly tight to docker. Looks like somehow the PPID of kiwix-serve is the Varnish process in place of ths bourne shell process to me. Might be a duplicate of https://github.com/kiwix/maintenance/issues/204.
I think that everything is correct. When started in --daemon
mode kiwix-serve
detaches itself from the parent shell process and becomes a child of the init
process (which in this container is varnishd
)
@veloman-yunkan So, what is wrong to your opinion?
I guess the problem is that we don't have a proper init
process in our container (dumbinit
or tini
, see https://stackoverflow.com/questions/37374310/how-critical-is-dumb-init-for-docker). It would be needed even if kiwix/maintenance#204 is implemented.
@veloman-yunkan Thank you for your analysis on this. @rgaudin @mgautierfr Should we close this ticket in favour of https://github.com/kiwix/maintenance/issues/204 (as duplicate)?
You should use dumbinit (or tini) as said by @veloman-yunkan and how it was made in https://github.com/kiwix/kiwix-tools/pull/489
If the parent itself dies, then "init" (the system process with the ID 0) will take over fostership over its children and catch up on the neglected parental duties
In our case varnish is the PID 0 and it becomes parent of kiwix-serve process (and never wait()
for it as it is not a init process).
kiwix/maintenance#204 is not related to this issue.
Closing this as we're not using this image anymore. New container(s) uses plain kiwix-tools image with kiwix-serve directly (through its dumb-init) and we're not killing it manually anymore.
We have rolled out this option on
library.kiwix.org
and this is now really slow. We are not sure this is related, but we are suspicious. CPU seems good but overall load is super high