kiwix / container-images

10 stars 4 forks source link

kiwix-serve leaves zombies if killed (was: Does kiwix-serve --monitoringLibrary uses too much resources?) #222

Closed kelson42 closed 2 years ago

kelson42 commented 2 years ago

We have rolled out this option on library.kiwix.org and this is now really slow. We are not sure this is related, but we are suspicious. CPU seems good but overall load is super high

$ ps aux | grep kiwix-serve
root      1435  0.0  0.0      0     0 ?        Z    Feb21   2:04 [kiwix-serve] <defunct>
root      2157  0.0  0.0      0     0 ?        Z    Feb24   0:03 [kiwix-serve] <defunct>
root      2417  0.0  0.0      0     0 ?        Z    Feb24   0:11 [kiwix-serve] <defunct>
root      3527  0.0  0.0      0     0 ?        Z    Feb26   0:33 [kiwix-serve] <defunct>
root      3656  0.0  0.0      0     0 ?        Z    Feb24   0:03 [kiwix-serve] <defunct>
root      4185  0.0  0.0      0     0 ?        Z    Feb24   0:07 [kiwix-serve] <defunct>
root      4340  0.1  0.0      0     0 ?        Z    Feb23   8:09 [kiwix-serve] <defunct>
root      5053  0.0  0.0      0     0 ?        Z    Feb23   4:25 [kiwix-serve] <defunct>
root      5606  0.0  0.0    220     0 ?        Ss   Jan31   0:00 /usr/bin/dumb-init -- kiwix-serve-folder /data
root      5979  0.5  3.3 43307740 2218284 ?    Ssl  Jan31 212:42 kiwix-serve --library --monitorLibrary /tmp/library.xml
root      6002  0.0  0.0      0     0 ?        Z    Jan31   0:00 [kiwix-serve-fol] <defunct>
root      6680  0.0  0.0      0     0 ?        Z    Feb23   5:14 [kiwix-serve] <defunct>
root      7740  0.0  0.0      0     0 ?        Z    Feb25   0:06 [kiwix-serve] <defunct>
root      8432  0.0  0.0      0     0 ?        Z    Feb23   1:59 [kiwix-serve] <defunct>
root      8568  0.0  0.0      0     0 ?        Z    Feb25   3:15 [kiwix-serve] <defunct>
root      9381  0.0  0.0      0     0 ?        Z    Feb23   1:04 [kiwix-serve] <defunct>
root     10692  1.2  0.0      0     0 ?        Z    08:29   0:16 [kiwix-serve] <defunct>
root     12333  0.0  0.0      0     0 ?        Z    Feb24   2:06 [kiwix-serve] <defunct>
root     12432  0.0  0.0      0     0 ?        Z    Feb24   2:42 [kiwix-serve] <defunct>
root     12676  0.0  0.0      0     0 ?        Z    Feb25   0:07 [kiwix-serve] <defunct>
root     13696  0.2  0.0      0     0 ?        Z    Feb25  12:47 [kiwix-serve] <defunct>
root     15056  0.0  0.0      0     0 ?        Z    04:16   0:07 [kiwix-serve] <defunct>
root     15137  0.0  0.0      0     0 ?        Z    Feb27   0:03 [kiwix-serve] <defunct>
root     15384  0.0  0.0      0     0 ?        Z    Feb27   0:04 [kiwix-serve] <defunct>
root     15941  0.0  0.0      0     0 ?        Z    Feb27   0:25 [kiwix-serve] <defunct>
root     15957  0.0  0.0      0     0 ?        Z    Feb21   0:05 [kiwix-serve] <defunct>
root     16407  0.0  0.0      0     0 ?        Z    Feb27   0:49 [kiwix-serve] <defunct>
root     16886  0.0  0.0      0     0 ?        Z    Feb26   0:07 [kiwix-serve] <defunct>
root     17830  0.0  0.0      0     0 ?        Z    Feb26   1:14 [kiwix-serve] <defunct>
root     17999  0.1  0.0      0     0 ?        Z    Feb23  10:31 [kiwix-serve] <defunct>
root     18212  0.0  0.0      0     0 ?        Z    Feb27   0:04 [kiwix-serve] <defunct>
root     18309  0.0  0.0      0     0 ?        Z    Feb27   0:04 [kiwix-serve] <defunct>
root     18464  0.0  0.0      0     0 ?        Z    Feb27   0:04 [kiwix-serve] <defunct>
root     18570  0.0  0.0      0     0 ?        Z    Feb27   0:19 [kiwix-serve] <defunct>
root     18753  0.1  0.0      0     0 ?        Z    Feb27   2:16 [kiwix-serve] <defunct>
root     19434  3.8  0.4 2230528 286528 ?      Sl   08:50   0:04 kiwix-serve --daemon --port=8000 --library --monitorLibrary --threads=16 --nodatealias /var/www/library.kiwix.org/library.kiwix.org.xml
kelson   20185  0.0  0.0  12752   952 pts/1    S+   08:52   0:00 grep kiwix-serve
root     20218  0.0  0.0      0     0 ?        Z    Feb25   0:04 [kiwix-serve] <defunct>
root     21295  0.3  0.0      0     0 ?        Z    Feb22  30:13 [kiwix-serve] <defunct>
root     21363  0.0  0.0      0     0 ?        Z    Feb27   0:04 [kiwix-serve] <defunct>
root     21743  0.3  0.0      0     0 ?        Z    Feb27   6:42 [kiwix-serve] <defunct>
root     21912  0.1  0.0      0     0 ?        Z    Feb27   0:55 [kiwix-serve] <defunct>
root     22078  0.0  0.0      0     0 ?        Z    Feb25   0:03 [kiwix-serve] <defunct>
root     22452  0.0  0.0      0     0 ?        Z    Feb27   0:03 [kiwix-serve] <defunct>
root     22687  0.0  0.0      0     0 ?        Z    Feb25   2:24 [kiwix-serve] <defunct>
root     22739  0.2  0.0      0     0 ?        Z    Feb27   1:33 [kiwix-serve] <defunct>
root     23013  0.0  0.0      0     0 ?        Z    Feb25   0:09 [kiwix-serve] <defunct>
root     23647  0.0  0.0      0     0 ?        Z    Feb25   0:03 [kiwix-serve] <defunct>
root     23764  0.0  0.0      0     0 ?        Z    Feb27   0:13 [kiwix-serve] <defunct>
root     23816  0.0  0.0      0     0 ?        Z    Feb25   0:21 [kiwix-serve] <defunct>
root     23915  0.1  0.0      0     0 ?        Z    Feb26   3:08 [kiwix-serve] <defunct>
root     23955  0.0  0.0      0     0 ?        Z    Feb27   0:01 [kiwix-serve] <defunct>
root     24189  0.0  0.0      0     0 ?        Z    Feb26   0:02 [kiwix-serve] <defunct>
root     24624  0.0  0.0      0     0 ?        Z    Feb25   0:51 [kiwix-serve] <defunct>
root     25329  0.0  0.0      0     0 ?        Z    Feb24   0:06 [kiwix-serve] <defunct>
root     25681  0.0  0.0      0     0 ?        Z    Feb26   1:38 [kiwix-serve] <defunct>
root     26777  0.0  0.0      0     0 ?        Z    Feb25   0:06 [kiwix-serve] <defunct>
root     27225  0.0  0.0      0     0 ?        Z    Feb25   3:02 [kiwix-serve] <defunct>
root     27407  0.4  0.0      0     0 ?        Z    Feb27   4:40 [kiwix-serve] <defunct>
root     27626  0.1  0.0      0     0 ?        Z    Feb23  14:56 [kiwix-serve] <defunct>
root     28213  0.0  0.0      0     0 ?        Z    Feb21   3:00 [kiwix-serve] <defunct>
root     28293  0.0  0.0      0     0 ?        Z    Feb27   0:03 [kiwix-serve] <defunct>
root     28335  0.0  0.0      0     0 ?        Z    Feb25   1:01 [kiwix-serve] <defunct>
root     28454  0.0  0.0      0     0 ?        Z    Feb27   0:04 [kiwix-serve] <defunct>
root     28791  0.0  0.0      0     0 ?        Z    Feb27   0:17 [kiwix-serve] <defunct>
root     28840  0.0  0.0      0     0 ?        Z    Feb27   0:06 [kiwix-serve] <defunct>
root     29214  0.0  0.0      0     0 ?        Z    Feb27   0:03 [kiwix-serve] <defunct>
root     29332  0.0  0.0      0     0 ?        Z    02:18   0:07 [kiwix-serve] <defunct>
root     29557  0.0  0.0      0     0 ?        Z    Feb27   0:07 [kiwix-serve] <defunct>
root     29966  0.0  0.0      0     0 ?        Z    03:16   0:03 [kiwix-serve] <defunct>
root     30199  0.0  0.0      0     0 ?        Z    Feb25   0:42 [kiwix-serve] <defunct>
root     30322  0.0  0.0      0     0 ?        Z    Feb27   0:03 [kiwix-serve] <defunct>
root     30518  0.0  0.0      0     0 ?        Z    03:18   0:06 [kiwix-serve] <defunct>
root     30606  0.0  0.0      0     0 ?        Z    Feb27   0:03 [kiwix-serve] <defunct>
root     30895  0.0  0.0      0     0 ?        Z    Feb27   0:06 [kiwix-serve] <defunct>
root     30988  0.0  0.0      0     0 ?        Z    03:20   0:07 [kiwix-serve] <defunct>
root     31396  0.0  0.0      0     0 ?        Z    03:22   0:04 [kiwix-serve] <defunct>
root     31499  0.0  0.0      0     0 ?        Z    Feb25   0:03 [kiwix-serve] <defunct>
root     31674  0.0  0.0      0     0 ?        Z    02:25   0:10 [kiwix-serve] <defunct>
root     31812  0.0  0.0      0     0 ?        Z    03:24   0:07 [kiwix-serve] <defunct>
root     31834  0.0  0.0      0     0 ?        Z    Feb27   0:04 [kiwix-serve] <defunct>
root     31960  0.0  0.0      0     0 ?        Z    Feb25   0:03 [kiwix-serve] <defunct>
root     32141  0.0  0.0      0     0 ?        Z    Feb27   0:06 [kiwix-serve] <defunct>
root     32237  0.0  0.0      0     0 ?        Z    Feb25   0:03 [kiwix-serve] <defunct>
root     32511  0.0  0.0      0     0 ?        Z    Feb27   0:12 [kiwix-serve] <defunct>
root     32575  0.0  0.0      0     0 ?        Z    Feb25   0:08 [kiwix-serve] <defunct>
root     32763  0.0  0.0      0     0 ?        Z    Feb27   0:03 [kiwix-serve] <defunct>
root     32812  0.0  0.0      0     0 ?        Z    03:28   0:03 [kiwix-serve] <defunct>
root     32980  0.0  0.0      0     0 ?        Z    03:29   0:04 [kiwix-serve] <defunct>
root     33311  0.0  0.0      0     0 ?        Z    04:19   0:06 [kiwix-serve] <defunct>
root     33375  0.0  0.0      0     0 ?        Z    03:31   0:12 [kiwix-serve] <defunct>
root     33380  0.2  0.0      0     0 ?        Z    Feb25   7:47 [kiwix-serve] <defunct>
root     33464  0.0  0.0      0     0 ?        Z    Feb27   0:08 [kiwix-serve] <defunct>
root     33520  0.0  0.0      0     0 ?        Z    Feb27   0:33 [kiwix-serve] <defunct>
root     33791  0.1  0.0      0     0 ?        Z    04:21   0:24 [kiwix-serve] <defunct>
root     34100  0.0  0.0      0     0 ?        Z    Feb25   0:04 [kiwix-serve] <defunct>
root     34272  0.0  0.0      0     0 ?        Z    Feb25   0:17 [kiwix-serve] <defunct>
root     35003  0.0  0.0      0     0 ?        Z    Feb25   0:51 [kiwix-serve] <defunct>
root     35182  0.0  0.0      0     0 ?        Z    01:50   0:04 [kiwix-serve] <defunct>
root     35327  0.0  0.0      0     0 ?        Z    03:39   0:03 [kiwix-serve] <defunct>
root     35700  0.0  0.0      0     0 ?        Z    03:40   0:04 [kiwix-serve] <defunct>
root     35894  0.0  0.0      0     0 ?        Z    01:52   0:09 [kiwix-serve] <defunct>
root     36334  0.0  0.0      0     0 ?        Z    03:42   0:03 [kiwix-serve] <defunct>
root     36359  0.0  0.0      0     0 ?        Z    02:37   0:04 [kiwix-serve] <defunct>
root     36391  0.0  0.0      0     0 ?        Z    Feb24   0:04 [kiwix-serve] <defunct>
root     36657  0.0  0.0      0     0 ?        Z    03:43   0:07 [kiwix-serve] <defunct>
root     36755  0.0  0.0      0     0 ?        Z    Feb24   0:04 [kiwix-serve] <defunct>
root     36768  0.0  0.0      0     0 ?        Z    Feb26   0:03 [kiwix-serve] <defunct>
root     36989  0.0  0.0      0     0 ?        Z    02:39   0:06 [kiwix-serve] <defunct>
root     37267  0.0  0.0      0     0 ?        Z    03:46   0:04 [kiwix-serve] <defunct>
root     37304  0.0  0.0      0     0 ?        Z    Feb24   0:05 [kiwix-serve] <defunct>
root     37499  0.0  0.0      0     0 ?        Z    00:51   0:03 [kiwix-serve] <defunct>
root     37511  0.0  0.0      0     0 ?        Z    Feb26   0:06 [kiwix-serve] <defunct>
root     37702  0.0  0.0      0     0 ?        Z    Feb24   0:09 [kiwix-serve] <defunct>
root     37733  0.0  0.0      0     0 ?        Z    02:42   0:03 [kiwix-serve] <defunct>
root     37772  0.0  0.0      0     0 ?        Z    00:52   0:02 [kiwix-serve] <defunct>
root     37808  0.0  0.0      0     0 ?        Z    03:48   0:12 [kiwix-serve] <defunct>
root     37838  0.0  0.0      0     0 ?        Z    01:56   0:12 [kiwix-serve] <defunct>
root     37862  0.0  0.0      0     0 ?        Z    Feb26   0:09 [kiwix-serve] <defunct>
root     38067  0.0  0.0      0     0 ?        Z    00:54   0:03 [kiwix-serve] <defunct>
root     38205  0.0  0.0      0     0 ?        Z    02:44   0:05 [kiwix-serve] <defunct>
root     38253  0.1  0.0      0     0 ?        Z    00:55   0:31 [kiwix-serve] <defunct>
root     38409  0.0  0.0      0     0 ?        Z    Feb24   1:08 [kiwix-serve] <defunct>
root     38464  0.0  0.0      0     0 ?        Z    Feb26   0:19 [kiwix-serve] <defunct>
root     38882  0.0  0.0      0     0 ?        Z    02:48   0:04 [kiwix-serve] <defunct>
root     39462  0.0  0.0      0     0 ?        Z    02:50   0:04 [kiwix-serve] <defunct>
root     39720  0.0  0.0      0     0 ?        Z    Feb26   0:03 [kiwix-serve] <defunct>
root     39769  0.0  0.0      0     0 ?        Z    02:52   0:07 [kiwix-serve] <defunct>
root     40105  0.0  0.0      0     0 ?        Z    03:53   0:10 [kiwix-serve] <defunct>
root     40198  0.0  0.0      0     0 ?        Z    02:55   0:12 [kiwix-serve] <defunct>
root     40470  0.0  0.0      0     0 ?        Z    Feb26   0:29 [kiwix-serve] <defunct>
root     40506  0.0  0.0      0     0 ?        Z    Feb22   0:41 [kiwix-serve] <defunct>
root     40566  0.0  0.0      0     0 ?        Z    Feb26   0:09 [kiwix-serve] <defunct>
root     40661  0.0  0.0      0     0 ?        Z    Feb26   0:04 [kiwix-serve] <defunct>
root     40772  0.0  0.0      0     0 ?        Z    02:07   0:09 [kiwix-serve] <defunct>
root     40841  0.1  0.0      0     0 ?        Z    Feb26   2:23 [kiwix-serve] <defunct>
root     41031  0.0  0.0      0     0 ?        Z    Feb24   0:06 [kiwix-serve] <defunct>
root     41237  0.0  0.0      0     0 ?        Z    Feb26   0:06 [kiwix-serve] <defunct>
root     41278  0.0  0.0      0     0 ?        Z    06:51   0:04 [kiwix-serve] <defunct>
root     41302  0.0  0.0      0     0 ?        Z    Feb24   1:37 [kiwix-serve] <defunct>
root     41637  0.1  0.0      0     0 ?        Z    06:53   0:13 [kiwix-serve] <defunct>
root     41688  0.0  0.0      0     0 ?        Z    Feb26   0:03 [kiwix-serve] <defunct>
root     41865  0.0  0.0      0     0 ?        Z    Feb26   0:14 [kiwix-serve] <defunct>
root     42059  0.0  0.0      0     0 ?        Z    Feb26   0:47 [kiwix-serve] <defunct>
root     42237  0.0  0.0      0     0 ?        Z    04:58   0:04 [kiwix-serve] <defunct>
root     42740  0.4  0.0      0     0 ?        Z    05:01   1:02 [kiwix-serve] <defunct>
root     43091  0.0  0.0      0     0 ?        Z    Feb26   0:26 [kiwix-serve] <defunct>
root     43371  0.1  0.0      0     0 ?        Z    07:03   0:10 [kiwix-serve] <defunct>
root     43786  0.0  0.0      0     0 ?        Z    Feb26   0:03 [kiwix-serve] <defunct>
root     44008  0.0  0.0      0     0 ?        Z    Feb22   0:03 [kiwix-serve] <defunct>
root     44095  0.0  0.0      0     0 ?        Z    04:07   0:05 [kiwix-serve] <defunct>
root     44244  0.1  0.0      0     0 ?        Z    Feb22   9:35 [kiwix-serve] <defunct>
root     44326  0.0  0.0      0     0 ?        Z    Feb26   0:04 [kiwix-serve] <defunct>
root     44532  0.1  0.0      0     0 ?        Z    07:10   0:07 [kiwix-serve] <defunct>
root     44638  0.0  0.0      0     0 ?        Z    04:09   0:05 [kiwix-serve] <defunct>
root     44682  0.0  0.0      0     0 ?        Z    Feb26   0:24 [kiwix-serve] <defunct>
root     44789  0.0  0.0      0     0 ?        Z    Feb24   2:37 [kiwix-serve] <defunct>
root     45043  0.6  0.0      0     0 ?        Z    07:13   0:37 [kiwix-serve] <defunct>
root     45311  0.0  0.0      0     0 ?        Z    04:13   0:06 [kiwix-serve] <defunct>
root     45338  0.0  0.0      0     0 ?        Z    Feb24   0:04 [kiwix-serve] <defunct>
root     45850  0.0  0.0      0     0 ?        Z    Feb22   0:07 [kiwix-serve] <defunct>
root     46291  0.0  0.0      0     0 ?        Z    Feb22   2:19 [kiwix-serve] <defunct>
root     46385  0.0  0.0      0     0 ?        Z    Feb24   0:06 [kiwix-serve] <defunct>
root     46869  0.0  0.0      0     0 ?        Z    Feb24   0:03 [kiwix-serve] <defunct>
root     47155  0.0  0.0      0     0 ?        Z    Feb24   0:04 [kiwix-serve] <defunct>
root     47401  0.0  0.0      0     0 ?        Z    Feb26   0:38 [kiwix-serve] <defunct>
root     47590  0.0  0.0      0     0 ?        Z    Feb24   0:03 [kiwix-serve] <defunct>
root     47792  0.0  0.0      0     0 ?        Z    Feb24   0:09 [kiwix-serve] <defunct>
root     48492  0.0  0.0      0     0 ?        Z    Feb24   0:14 [kiwix-serve] <defunct>
kelson42 commented 2 years ago

After restarting the library.kiwix.org docker container, then the load was back to normal and all the defunct kiwix-serve process have vanished

veloman-yunkan commented 2 years ago

Do you have an idea what happens? Where do those defunct processes come from? Do they correspond to threads handling concurrent requests? Did the kiwix-serve process run continuously since it was started or it was automatically restarted multiple times because of becoming unresponsive? Are there any logs?

kelson42 commented 2 years ago

@veloman-yunkan The problems seems to come from the kill operation we rung here https://github.com/kiwix/maintenance/blob/master/library-docker/bin/restart-kiwix-serve.sh#L19.

I can not reproduce the problem locally, but I have tried directly on library.kiwix.org and this really seems to be the root cause. I even have tried with 15 (TERM) but doing this I still get zombies.

So, if we don't have an obvious performance problem (for the moment), we have a clear problem that it seems not possible to kill cleanly the process without leaving zombies.

veloman-yunkan commented 2 years ago

@veloman-yunkan The problems seems to come from the kill operation we rung here https://github.com/kiwix/maintenance/blob/master/library-docker/bin/restart-kiwix-serve.sh#L19.

An immediate question is: why is a KILL signal used to stop kiwix-serve? Processes should be killed with a KILL signal only as a last resort.

kelson42 commented 2 years ago

@veloman-yunkan Sending SIGTERM leads to the same behaviour. What would you propose as alternative to request kiwix-serve daemon to stop?

veloman-yunkan commented 2 years ago

@kelson42 SIGTERM must be used for terminating a process. Any problems with handling of SIGTERM must be debugged and fixed.

kelson42 commented 2 years ago

@veloman-yunkan Yesterday, I have rebuilded/relaunched the library container with latest kiwix-serve and used the opportunity to put a proper SIGTERM signal, see: https://github.com/kiwix/maintenance/pull/217/files

After restart I had to <defunct> kiwix-serve process... but this morning, the first one have already appeared :(

$ ps aux | grep kiwix-serve
root      3607  0.4  0.0      0     0 ?        Z    07:02   1:30 [kiwix-serve] <defunct>
kelson    5708  0.0  0.0  12752   876 pts/0    S+   12:19   0:00 grep kiwix-serve
root      6343  0.2  0.0      0     0 ?        Z    09:03   0:24 [kiwix-serve] <defunct>
root      7992  0.0  0.0      0     0 ?        Z    09:17   0:09 [kiwix-serve] <defunct>
root      8436  1.9  5.7 97090236 3779152 ?    Sl   09:19   3:28 kiwix-serve --daemon --port=8000 --library --monitorLibrary --threads=16 --nodatealias /var/www/library.kiwix.org/library.kiwix.org.xml
root     13923  0.0  0.0      0     0 ?        Z    04:19   0:06 [kiwix-serve] <defunct>
root     14456  1.1  0.0      0     0 ?        Z    04:21   5:21 [kiwix-serve] <defunct>
root     15077  0.0  0.0      0     0 ?        Z    08:40   0:04 [kiwix-serve] <defunct>
root     15643  0.2  0.0      0     0 ?        Z    08:42   0:28 [kiwix-serve] <defunct>
root     40612  1.4  0.0      0     0 ?        Z    Mar24  14:41 [kiwix-serve] <defunct>
veloman-yunkan commented 2 years ago

My analysis of kiwix/maintenance/library-docker/bin/start.sh and kiwix/maintenance/library-docker/bin/restart-kiwix-serve.sh suggests that the zombie processes of kiwix-serve have nothing to do with its reaction to termination requests.

In library-docker/bin/start.sh cron is configured to execute restart-kiwix-serve.sh every minute without any arguments. That command only checks if kiwix-serve is alive and, if so, does nothing. Otherwise, it starts a new instance of kiwix-serve.

Attempts to restart (i.e. terminate the running instance of kiwix-serve and start a new one) are made only if restart-kiwix-serve.sh is passed the restart command line argument. This is configured only for the start-up of the container and is meaningless since on start-up no kiwix-serve should be expected to be already running.

Therefore zombie/\<defunct> processes of kiwix-serve appear independently from restart-kiwix-serve.sh (which simply recovers from that situation by launching kiwix-serve again). Why kiwix-serve goes \<defunct> has to be debugged.

kelson42 commented 2 years ago

My analysis of kiwix/maintenance/library-docker/bin/start.sh and kiwix/maintenance/library-docker/bin/restart-kiwix-serve.sh suggests that the zombie processes of kiwix-serve have nothing to do with its reaction to termination requests.

This is probably wrong, here is something I made live on library.kiwix.org (unfortunately I can not reproduce localy)

$ docker-compose up --detach --force-recreate library
Recreating library ... done
$ date
Sat Mar 26 10:29:08 CET 2022
$ ps aux | grep kiwix-serve
root      6157  0.0  0.0  12752   948 pts/0    S+   10:29   0:00 grep kiwix-serve
root     36150 19.8  0.8 3991372 561872 ?      Sl   10:28   0:06 kiwix-serve --daemon --port=8000 --library --monitorLibrary --threads=16 --nodatealias /var/www/library.kiwix.org/library.kiwix.org.xml
$ date
Sat Mar 26 10:29:45 CET 2022
$ kill -TERM 36150
$ ps aux | grep kiwix-serve
root      7022  0.0  0.0  12752   964 pts/0    S+   10:30   0:00 grep kiwix-serve
root     36150  6.8  0.0      0     0 ?        Z    10:28   0:07 [kiwix-serve] <defunct>

But your argumentation/analysis is proper: since we use th e --monitorLibrary, kiwix-serve is not started arbitrarly anymore from this script (beside at container start). That said, the kill -TERM still generates such <defunc> processes. Therefore, either their is an other root cause or something else kills the process. At least I believe, using kill -TERM might be an interesting "artificial" way to reproduce the bug and diagnosis it.

veloman-yunkan commented 2 years ago

here is something I made live on library.kiwix.org (unfortunately I can not reproduce localy)

$ docker-compose up --detach --force-recreate library
Recreating library ... done
$ date
Sat Mar 26 10:29:08 CET 2022
$ ps aux | grep kiwix-serve
root      6157  0.0  0.0  12752   948 pts/0    S+   10:29   0:00 grep kiwix-serve
root     36150 19.8  0.8 3991372 561872 ?      Sl   10:28   0:06 kiwix-serve --daemon --port=8000 --library --monitorLibrary --threads=16 --nodatealias /var/www/library.kiwix.org/library.kiwix.org.xml
$ date
Sat Mar 26 10:29:45 CET 2022
$ kill -TERM 36150
$ ps aux | grep kiwix-serve
root      7022  0.0  0.0  12752   964 pts/0    S+   10:30   0:00 grep kiwix-serve
root     36150  6.8  0.0      0     0 ?        Z    10:28   0:07 [kiwix-serve] <defunct>

Do you kill a process belonging to a docker container from outside the container?

kelson42 commented 2 years ago

@veloman-yunkan This is the case here, yes.

veloman-yunkan commented 2 years ago

It look like the problem belongs to the parent process of kiwix-serve. See https://linuxreviews.org/Defunct_process

Unix manages an explicit parent-child relationships between processes (Windows does not do this).

When a child process dies, the parent process recieves a notification. It is then the duty of the parent process to explicitly take notice of the childs demise by using the wait() system call.

The return value of the wait() is the process ID of the child, which gives the parent exact control about which of its children are still alive. Upon returning, wait() will have set the integer pointed to by its argument to the exit status of the child. A shell programm like "bash" could then decide how to process following commands and set the special $? variable accordingly.

As long as the parent hasn't called wait(), the system needs to keep the dead child in the global process list, because that's the only place where the process ID is stored. The purpose of the "zombies" is really just for the system to remember the process ID, so that it can inform the parent process about it on request.

If the parent "forgets" to collect on its children, then the zombie will stay undead forever.

Well, almost forever. If the parent itself dies, then "init" (the system process with the ID 0) will take over fostership over its children and catch up on the neglected parental duties. This is why you need to identify the parent process and stop or restart it in order to get rid of defunct processes. If the zombie process has id nnnnn, you can do ps -ef | grep nnnnn and find the id of the parent process, which you can then kill if no longer needed. Then the defunct process will be removed from the list.

@kelson42 Will you please find out the parent process of the defunct kiwix-serve processes using the ps -ef command?

kelson42 commented 2 years ago
$ ps -ef | grep kiwix-serve
root       852 35156  0 10:45 ?        00:00:06 [kiwix-serve] <defunct>
root       966 35156  1 10:47 ?        00:04:26 [kiwix-serve] <defunct>
root      3331 35156  0 Mar26 ?        00:01:53 [kiwix-serve] <defunct>
root      7583 35156  0 Mar26 ?        00:02:24 [kiwix-serve] <defunct>
root      7632 35156  0 Mar26 ?        00:06:35 [kiwix-serve] <defunct>
kelson   21314 21143  0 15:49 pts/2    00:00:00 grep kiwix-serve
root     22859 35156  0 00:23 ?        00:00:06 [kiwix-serve] <defunct>
root     23620 35156  1 00:26 ?        00:12:09 [kiwix-serve] <defunct>
root     35907 35156  0 Mar26 ?        00:03:12 [kiwix-serve] <defunct>
root     36150 35156  0 Mar26 ?        00:00:07 [kiwix-serve] <defunct>
root     39560 35156  0 Mar26 ?        00:01:32 [kiwix-serve] <defunct>
root     43009 35156  0 14:38 ?        00:00:04 [kiwix-serve] <defunct>
root     43192 35156  2 14:40 ?        00:01:47 kiwix-serve --daemon --port=8000 --library --monitorLibrary --threads=16 --nodatealias /var/www/library.kiwix.org/library.kiwix.org.xml

and

$ ps aux | grep 35156
kelson   21660  0.0  0.0  12752   904 pts/2    S+   15:51   0:00 grep 35156
systemd+ 35156  0.0  0.0  18604  1428 ?        SLs  Mar26   0:05 /usr/sbin/varnishd -F -a :80 -T localhost:6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret -s malloc,256m
kelson42 commented 2 years ago

This problem seems śtrongly tight to docker. Looks like somehow the PPID of kiwix-serve is the Varnish process in place of ths bourne shell process to me. Might be a duplicate of https://github.com/kiwix/maintenance/issues/204.

veloman-yunkan commented 2 years ago

I think that everything is correct. When started in --daemon mode kiwix-serve detaches itself from the parent shell process and becomes a child of the init process (which in this container is varnishd)

kelson42 commented 2 years ago

@veloman-yunkan So, what is wrong to your opinion?

veloman-yunkan commented 2 years ago

I guess the problem is that we don't have a proper init process in our container (dumbinit or tini, see https://stackoverflow.com/questions/37374310/how-critical-is-dumb-init-for-docker). It would be needed even if kiwix/maintenance#204 is implemented.

kelson42 commented 2 years ago

@veloman-yunkan Thank you for your analysis on this. @rgaudin @mgautierfr Should we close this ticket in favour of https://github.com/kiwix/maintenance/issues/204 (as duplicate)?

mgautierfr commented 2 years ago

You should use dumbinit (or tini) as said by @veloman-yunkan and how it was made in https://github.com/kiwix/kiwix-tools/pull/489

If the parent itself dies, then "init" (the system process with the ID 0) will take over fostership over its children and catch up on the neglected parental duties

In our case varnish is the PID 0 and it becomes parent of kiwix-serve process (and never wait() for it as it is not a init process).

kiwix/maintenance#204 is not related to this issue.

rgaudin commented 2 years ago

Closing this as we're not using this image anymore. New container(s) uses plain kiwix-tools image with kiwix-serve directly (through its dumb-init) and we're not killing it manually anymore.