torrust / torrust-demo

CI and configuration for the Torrust live demo.
https://index.torrust-demo.com
0 stars 0 forks source link

There are a lot of zombie processes #2

Closed josecelano closed 5 months ago

josecelano commented 6 months ago

Relates to: https://github.com/torrust/torrust-demo/issues/1

I'm trying to fix this issue on the live demo server. The tracker container restarts every 2 hours because of the healthcheck. I'm still trying to figure out what is happening. However, I've noticed a lot of zombie processes. This may or may not be related to the periodic restart.

Some minutes after restarting the server, you see a lot of zombie processes.

This is the server 3 hours after restarting the tracker container:

docker ps
CONTAINER ID   IMAGE                       COMMAND                  CREATED       STATUS                   PORTS                                                                                                                                       NAMES
ef72b037bf26   nginx:mainline-alpine       "/docker-entrypoint.…"   3 hours ago   Up 3 hours               0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp                                                                    proxy
d7618a22d425   torrust/index-gui:develop   "/usr/local/bin/entr…"   3 hours ago   Up 3 hours (unhealthy)   0.0.0.0:3000->3000/tcp, :::3000->3000/tcp                                                                                                   index-gui
3f34f41514bb   torrust/index:develop       "/usr/local/bin/entr…"   3 hours ago   Up 3 hours (healthy)     0.0.0.0:3001->3001/tcp, :::3001->3001/tcp                                                                                                   index
e938bf65ea02   torrust/tracker:develop     "/usr/local/bin/entr…"   3 hours ago   Up 3 hours (unhealthy)   0.0.0.0:1212->1212/tcp, :::1212->1212/tcp, 0.0.0.0:7070->7070/tcp, :::7070->7070/tcp, 1313/tcp, 0.0.0.0:6969->6969/udp, :::6969->6969/udp   tracker

As you can see the tracker is unhealthy. Running top gives you this output:

top - 15:06:45 up 21:41,  1 user,  load average: 9.53, 10.18, 10.44
Tasks: 212 total,   4 running, 121 sleeping,   0 stopped,  87 zombie
%Cpu(s):  3.0 us, 90.8 sy,  0.0 ni,  0.0 id,  1.0 wa,  0.0 hi,  4.6 si,  0.7 st
MiB Mem :    957.4 total,     80.0 free,    834.5 used,     43.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.     25.1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 601160 root      20   0 1022516   2644      0 R  41.0   0.3   0:10.81 snapd
     85 root      20   0       0      0      0 S  17.7   0.0 172:28.45 kswapd0
     13 root      20   0       0      0      0 S   7.5   0.0  23:28.68 ksoftirqd/0
    709 root      20   0 1546016  38040      0 S   4.3   3.9 100:53.91 dockerd
 494855 torrust   20   0  573052  12776      0 S   4.3   1.3   9:06.97 torrust-index
 601209 root      20   0  724908   5716      0 R   3.9   0.6   0:01.20 node
 494706 torrust   20   0  815552 538000      0 S   3.6  54.9  21:01.63 torrust-tracker
    655 root      20   0 1357240  18052      0 S   3.3   1.8  13:14.69 containerd
 494683 root      20   0  719640   3568      0 S   3.3   0.4  13:31.96 containerd-shim
 601255 root      20   0 1237648   2796      0 S   2.6   0.3   0:00.17 runc

87 zombie processes but I've seen more in other cases. That output is where the server is already too busy swapping. Before reaching that point you get an output like this:

top -U torrust

top - 14:59:08 up 21:33,  1 user,  load average: 13.99, 13.41, 11.21
Tasks: 184 total,   5 running, 116 sleeping,   0 stopped,  63 zombie
%Cpu(s): 14.6 us, 73.5 sy,  0.0 ni,  0.0 id,  4.6 wa,  0.0 hi,  7.0 si,  0.3 st
MiB Mem :    957.4 total,     79.9 free,    823.5 used,     54.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.     30.6 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 494706 torrust   20   0  815552 538000      0 S   7.6  54.9  20:33.65 torrust-tracker
 495006 torrust   20   0   21.0g  30616      0 S   0.3   3.1   0:34.54 node
 599470 torrust   20   0   11040   3136   2244 R   0.3   0.3   0:00.07 top
 598211 torrust   20   0   17068   2580    856 S   0.0   0.3   0:00.29 systemd
 598212 torrust   20   0  169404   4000      0 S   0.0   0.4   0:00.00 (sd-pam)
 598290 torrust   20   0   17224   3100    548 S   0.0   0.3   0:01.17 sshd
 598291 torrust   20   0    9980   4656   2108 S   0.0   0.5   0:00.83 bash

You can see how zombie processes have increased.

I have also listed the zombie processes:

ps aux | grep Z

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      589711  0.1  0.0      0     0 ?        Z    14:19   0:04 [node] <defunct>
root      589967  0.0  0.0      0     0 ?        Z    14:23   0:00 [node] <defunct>
root      590976  0.0  0.0      0     0 ?        Z    14:25   0:00 [node] <defunct>
root      591046  0.0  0.0      0     0 ?        Z    14:25   0:00 [node] <defunct>
root      591115  0.0  0.0      0     0 ?        Z    14:26   0:00 [node] <defunct>
root      591182  0.0  0.0      0     0 ?        Z    14:26   0:00 [node] <defunct>
root      591231  0.0  0.0      0     0 ?        Z    14:26   0:00 [http_health_che] <defunct>
root      591255  0.0  0.0      0     0 ?        Z    14:26   0:00 [node] <defunct>
root      591360  0.0  0.0      0     0 ?        Z    14:27   0:00 [node] <defunct>
root      591644  0.0  0.0      0     0 ?        Z    14:28   0:00 [node] <defunct>
root      591727  0.0  0.0      0     0 ?        Z    14:28   0:00 [node] <defunct>
root      591867  0.0  0.0      0     0 ?        Z    14:28   0:00 [node] <defunct>
root      591938  0.0  0.0      0     0 ?        Z    14:29   0:00 [node] <defunct>
root      592020  0.0  0.0      0     0 ?        Z    14:29   0:01 [node] <defunct>
root      592103  0.0  0.0      0     0 ?        Z    14:29   0:00 [node] <defunct>
root      592455  0.0  0.0      0     0 ?        Z    14:30   0:00 [node] <defunct>
root      592528  0.0  0.0      0     0 ?        Z    14:30   0:00 [node] <defunct>
root      593183  0.0  0.0      0     0 ?        Z    14:32   0:00 [node] <defunct>
root      593263  0.0  0.0      0     0 ?        Z    14:33   0:00 [node] <defunct>
root      593704  0.0  0.0      0     0 ?        Z    14:34   0:00 [node] <defunct>
root      593777  0.0  0.0      0     0 ?        Z    14:34   0:00 [node] <defunct>
root      594501  0.0  0.0      0     0 ?        Z    14:36   0:00 [node] <defunct>
root      594891  0.0  0.0      0     0 ?        Z    14:37   0:00 [node] <defunct>
root      595260  0.0  0.0      0     0 ?        Z    14:39   0:00 [node] <defunct>
root      595404  0.0  0.0      0     0 ?        Z    14:39   0:00 [node] <defunct>
root      595494  0.0  0.0      0     0 ?        Z    14:39   0:00 [node] <defunct>
root      595563  0.0  0.0      0     0 ?        Z    14:40   0:00 [node] <defunct>
root      595641  0.0  0.0      0     0 ?        Z    14:40   0:00 [node] <defunct>
root      595664  0.0  0.0      0     0 ?        Z    14:40   0:00 [http_health_che] <defunct>
root      595708  0.0  0.0      0     0 ?        Z    14:40   0:01 [node] <defunct>
root      595782  0.1  0.0      0     0 ?        Z    14:40   0:01 [node] <defunct>
root      595856  0.0  0.0      0     0 ?        Z    14:41   0:00 [node] <defunct>
root      595928  0.0  0.0      0     0 ?        Z    14:41   0:00 [node] <defunct>
root      595999  0.0  0.0      0     0 ?        Z    14:41   0:00 [node] <defunct>
root      596068  0.0  0.0      0     0 ?        Z    14:42   0:00 [node] <defunct>
root      596135  0.1  0.0      0     0 ?        Z    14:42   0:01 [node] <defunct>
root      596207  0.0  0.0      0     0 ?        Z    14:42   0:01 [node] <defunct>
root      596278  0.1  0.0      0     0 ?        Z    14:43   0:01 [node] <defunct>
root      596323  0.0  0.0      0     0 ?        Z    14:43   0:00 [health_check] <defunct>
root      596325  0.0  0.0      0     0 ?        Z    14:43   0:00 [http_health_che] <defunct>
root      596350  0.1  0.0      0     0 ?        Z    14:43   0:01 [node] <defunct>
root      596421  0.0  0.0      0     0 ?        Z    14:44   0:00 [node] <defunct>
root      596488  0.0  0.0      0     0 ?        Z    14:44   0:00 [node] <defunct>
root      596555  0.0  0.0      0     0 ?        Z    14:44   0:00 [node] <defunct>
root      596693  0.0  0.0      0     0 ?        Z    14:45   0:00 [node] <defunct>
root      596761  0.0  0.0      0     0 ?        Z    14:45   0:00 [node] <defunct>
root      596833  0.1  0.0      0     0 ?        Z    14:45   0:01 [node] <defunct>
root      596911  0.3  0.0      0     0 ?        Z    14:46   0:03 [node] <defunct>
root      597029  0.1  0.0      0     0 ?        Z    14:47   0:01 [node] <defunct>
root      597099  0.1  0.0      0     0 ?        Z    14:47   0:00 [node] <defunct>
root      597164  0.1  0.0      0     0 ?        Z    14:47   0:00 [node] <defunct>
root      597234  0.1  0.0      0     0 ?        Z    14:47   0:01 [node] <defunct>
root      597302  0.2  0.0      0     0 ?        Z    14:48   0:01 [node] <defunct>
root      597375  0.0  0.0      0     0 ?        Z    14:48   0:00 [node] <defunct>
root      597443  0.1  0.0      0     0 ?        Z    14:49   0:00 [node] <defunct>
root      597475  0.0  0.0      0     0 ?        Z    14:49   0:00 [health_check] <defunct>
root      597493  0.4  0.0      0     0 ?        Z    14:49   0:03 [node] <defunct>
root      597567  0.3  0.0      0     0 ?        Z    14:50   0:02 [node] <defunct>
root      597620  0.2  0.0      0     0 ?        Z    14:51   0:01 [node] <defunct>
root      597693  0.2  0.0      0     0 ?        Z    14:51   0:01 [node] <defunct>
root      597735  0.0  0.0      0     0 ?        Z    14:51   0:00 [health_check] <defunct>
root      597742  0.0  0.0      0     0 ?        Z    14:51   0:00 [http_health_che] <defunct>
root      597762  0.1  0.0      0     0 ?        Z    14:52   0:00 [node] <defunct>
root      599434  2.3  0.0      0     0 ?        Z    14:58   0:01 [node] <defunct>
root      599527  0.2  0.0      0     0 ?        Z    14:59   0:00 [health_check] <defunct>
root      599593  1.3  0.0      0     0 ?        Z    14:59   0:00 [node] <defunct>
root      599745  3.6  0.0      0     0 ?        Z    14:59   0:00 [node] <defunct>

Those processes are a child of the main torrust tracker, index and index-gui processes.

ps -eo pid,ppid,state,command | grep Z

 589711  495006 Z [node] <defunct>
 589967  495006 Z [node] <defunct>
 590976  495006 Z [node] <defunct>
 591046  495006 Z [node] <defunct>
 591115  495006 Z [node] <defunct>
 591182  495006 Z [node] <defunct>
 591231  494706 Z [http_health_che] <defunct>
 591255  495006 Z [node] <defunct>
 591360  495006 Z [node] <defunct>
 591644  495006 Z [node] <defunct>
 591727  495006 Z [node] <defunct>
 591867  495006 Z [node] <defunct>
 591938  495006 Z [node] <defunct>
 592020  495006 Z [node] <defunct>
 592103  495006 Z [node] <defunct>
 592455  495006 Z [node] <defunct>
 592528  495006 Z [node] <defunct>
 593183  495006 Z [node] <defunct>
 593263  495006 Z [node] <defunct>
 593704  495006 Z [node] <defunct>
 593777  495006 Z [node] <defunct>
 594501  495006 Z [node] <defunct>
 594891  495006 Z [node] <defunct>
 595260  495006 Z [node] <defunct>
 595404  495006 Z [node] <defunct>
 595494  495006 Z [node] <defunct>
 595563  495006 Z [node] <defunct>
 595641  495006 Z [node] <defunct>
 595664  494706 Z [http_health_che] <defunct>
 595708  495006 Z [node] <defunct>
 595782  495006 Z [node] <defunct>
 595856  495006 Z [node] <defunct>
 595928  495006 Z [node] <defunct>
 595999  495006 Z [node] <defunct>
 596068  495006 Z [node] <defunct>
 596135  495006 Z [node] <defunct>
 596207  495006 Z [node] <defunct>
 596278  495006 Z [node] <defunct>
 596323  494855 Z [health_check] <defunct>
 596325  494706 Z [http_health_che] <defunct>
 596350  495006 Z [node] <defunct>
 596421  495006 Z [node] <defunct>
 596488  495006 Z [node] <defunct>
 596555  495006 Z [node] <defunct>
 596693  495006 Z [node] <defunct>
 596761  495006 Z [node] <defunct>
 596833  495006 Z [node] <defunct>
 596911  495006 Z [node] <defunct>
 597029  495006 Z [node] <defunct>
 597099  495006 Z [node] <defunct>
 597164  495006 Z [node] <defunct>
 597234  495006 Z [node] <defunct>
 597302  495006 Z [node] <defunct>
 597375  495006 Z [node] <defunct>
 597443  495006 Z [node] <defunct>
 597475  494855 Z [health_check] <defunct>
 597493  495006 Z [node] <defunct>
 597567  495006 Z [node] <defunct>
 597620  495006 Z [node] <defunct>
 597693  495006 Z [node] <defunct>
 597735  494855 Z [health_check] <defunct>
 597742  494706 Z [http_health_che] <defunct>
 597762  495006 Z [node] <defunct>
 599434  495006 Z [node] <defunct>
 599527  494855 Z [health_check] <defunct>
 599593  495006 Z [node] <defunct>
 599745  495006 Z [node] <defunct>
 599872  495006 Z [node] <defunct>

These are the parent processes:

ps -o pid,ppid,cmd -p 494706
    PID    PPID CMD
 494706  494683 /usr/bin/torrust-tracker
ps -o pid,ppid,cmd -p 494855
    PID    PPID CMD
 494855  494833 /usr/bin/torrust-index 
 ps -o pid,ppid,cmd -p 495006
    PID    PPID CMD
 495006  494983 /nodejs/bin/node /app/.output/server/index.mjs

In the past, we had a similar problem:

And it was solved by adding timeouts. That could be the reason for the healthcheck zombies, but I have not idea but the problem is with the node webserver (495006 494983 /nodejs/bin/node /app/.output/server/index.mjs). I guess the webserver is launching threads to handle requests but they are not finishing correctly.

josecelano commented 6 months ago

A zombie process, also known as a defunct process, is a state that occurs in a Unix-like operating system when a process finishes execution, but its entry remains in the process table. In simpler terms, it's a process that has completed its execution but still has an entry in the process table because its parent process hasn't yet retrieved its exit status.

When a process finishes its execution, it typically sends an exit status to its parent process, indicating its completion. The parent process is then responsible for reading this exit status via system calls like wait() or waitpid(). Once the parent process retrieves the exit status, the zombie process is removed from the process table, and its resources are released.

However, if the parent process fails to retrieve the exit status of its child processes (perhaps because it's busy with other tasks or has terminated without cleaning up its child processes), the child process enters a zombie state. In this state, the process table entry remains, but the process itself is essentially defunct; it occupies virtually no system resources, except for its entry in the process table.

Zombie processes are usually harmless by themselves and don't consume significant system resources. However, having too many zombie processes can indicate a problem with process management, such as a bug in the parent process or a resource exhaustion issue. Therefore, while individual zombie processes are not a cause for concern, a large number of them may require investigation and remediation.

ChatGPT

I think we should check that healthcheck binaries end correctly in all cases. However, it looks like, in this case, the reason could be the parent process "fails to retrieve the exit status