krallin / tini

A tiny but valid `init` for containers
MIT License
9.63k stars 503 forks source link

not reaping zombie or defunct child processes #164

Open bingerambo opened 3 years ago

bingerambo commented 3 years ago

In container, I exec a python process as a child process of tini. as following steps:

  1. run the container from dockerfile

os info:

[root@node3 ~]# uname -a
Linux node3 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@node3 ~]#
[root@node3 ~]#
[root@node3 ~]#
[root@node3 ~]# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)

docker version:

[root@node3 ~]# docker version
Client:
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        481bc77156
 Built:             Sat May  4 02:34:58 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:19:08 2018
  OS/Arch:          linux/amd64
  Experimental:     false

docker file

ADD tini /tini
RUN chmod +x /tini
ENV PYTHONPATH=/root/tf/models
WORKDIR /examples

ENTRYPOINT ["/tini", "-g", "-w", "-vvv", "--", "bash"]
CMD ["-c","/usr/bin/python ~/tf/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py  --batch_size=64 --model=official_resnet18 --optimizer=momentum  --num_gpus=2 --num_epochs=1 --weight_decay=1e-4 --data_dir=/tmp/imagenet"]

process status:

tini: pid 1 python: pid 8

root@deee30ff32b4:/examples# ps axjf
  PPID    PID   PGID    SID TTY       TPGID STAT   UID   TIME COMMAND
     0    397    397    397 pts/1       410 Ss       0   0:00 bash
   397    410    410    397 pts/1       410 R+       0   0:00  \_ ps axjf
     0      1      1      1 pts/0         8 Ss       0   0:00 /tini -g -w -vvv -- bash -c /usr/bin/python ~/tf/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py  --batch_size=64 --model=official_resnet
     1      8      8      1 pts/0         8 Sl+      0   2:31 /usr/bin/python /root/tf/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --batch_size=64 --model=official_resnet18 --optimizer=momentum -
  1. kill the python process

kill python process (the child of tini, and its pid is 8 )

root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples# kill -9 8
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples# ps axjf
  PPID    PID   PGID    SID TTY       TPGID STAT   UID   TIME COMMAND
     0    397    397    397 pts/1       411 Ss       0   0:00 bash
   397    411    411    397 pts/1       411 R+       0   0:00  \_ ps axjf
     0      1      1      1 pts/0         8 Ss       0   0:00 /tini -g -w -vvv -- bash -c /usr/bin/python ~/tf/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py  --batch_size=64 --model=official_resnet
     1      8      8      1 pts/0         8 Zl+      0   4:30 [tf_cnn_benchmar] <defunct>
  1. zombie process exists: the python process is defunct, but not reap by parent process (tini), it become zombie.

the tini and python processes print info:

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0810 01:44:12.306231 140481052276544 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
[TRACE tini (1)] No child to reap
Initializing graph
WARNING:tensorflow:From /root/tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0810 01:44:13.657472 140481052276544 deprecation.py:323] From /root/tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2020-08-10 01:44:13.911021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2020-08-10 01:44:13.911240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-10 01:44:13.911262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1
2020-08-10 01:44:13.911272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y
2020-08-10 01:44:13.911358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N
2020-08-10 01:44:13.911696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0)
2020-08-10 01:44:13.912073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:88:00.0, compute capability: 7.0)
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
INFO:tensorflow:Running local_init_op.
I0810 01:44:17.589881 140481052276544 session_manager.py:491] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0810 01:44:17.782803 140481052276544 session_manager.py:493] Done running local_init_op.
Running warm up
[TRACE tini (1)] No child to reap
2020-08-10 01:44:18.530569: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap

now I can not remove the zombie process. only reboot the machine.

when the python process suspended, the tini will not reap the child process, and the defunct python process will become zombie .

I'm not sure about either of the following reasons:

  1. the tini did not receive the SIGCHLD signal from the defunct child process. Is the tini process missed to deal with the SIGCHLD ?
  2. as a child process of tini. did the python process not send the SIGCHLD to tini? maybe it is suspended for hardware or program problem
yosifkit commented 3 years ago

Having to use kill -9 is a red flag to me. See also https://serverfault.com/a/76296/58240

krallin commented 3 years ago

Do you mind sharing all the lines of output from Tini (from the point at which you start your process until the point at which you're stuck) — you can exclude the other output, but please include all the tini ones. There should at least be some additional input starting with an info line reporting what Tini spawned.

Note that the point at which you're stuck is basically a look where Tini asks the Kernel "do I have any children that have exited?", and the Kernel is answering "you do not".

What happens if you send SIGTERM to Tini? If nothing happens, what if you send it SIGKILL? Do the processes get torn down?

bingerambo commented 3 years ago

Do you mind sharing all the lines of output from Tini (from the point at which you start your process until the point at which you're stuck) — you can exclude the other output, but please include all the tini ones. There should at least be some additional input starting with an info line reporting what Tini spawned.

Note that the point at which you're stuck is basically a look where Tini asks the Kernel "do I have any children that have exited?", and the Kernel is answering "you do not".

What happens if you send SIGTERM to Tini? If nothing happens, what if you send it SIGKILL? Do the processes get torn down?

@krallin I am sorry that I did not save complete log information. the python program runs TensorFlow framework with NVIDIA GPU cards, for training a deep learning job. I saved the problem context log information and /proc/pid/taskid status , so I exepect them are maybe useful.

  1. python program started and blocked. python and tini print info:
    
    .............................................

Initializing graph WARNING:tensorflow:From /root/tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession W0810 01:44:13.657472 140481052276544 deprecation.py:323] From /root/tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-08-10 01:44:13.911021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1 2020-08-10 01:44:13.911240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-08-10 01:44:13.911262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2020-08-10 01:44:13.911272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y 2020-08-10 01:44:13.911358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N 2020-08-10 01:44:13.911696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0) 2020-08-10 01:44:13.912073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:88:00.0, compute capability: 7.0) [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap INFO:tensorflow:Running local_init_op. I0810 01:44:17.589881 140481052276544 session_manager.py:491] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0810 01:44:17.782803 140481052276544 session_manager.py:493] Done running local_init_op. Running warm up [TRACE tini (1)] No child to reap 2020-08-10 01:44:18.530569: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap .............................................


2. exec docker stop command, to stop python program.

[root@node3 ~]# [root@node3 ~]# docker stop e99a157e24f8


3.  the tini print info: tini received the SIGTERM signal, and passed it to child. Then tini printed " No child to reap" at all the time.

.............................................

[TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [DEBUG tini (1)] Passing signal: 'Terminated' [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap

.............................................


4. the python process info, /proc/pid/taskid details:
the defunct python process (its name was tf_cnn_benchmar ) had 2 threads.
thread  taskid 175308 status running
thread  taskid 174982 status zombie
why the defunct process still contained a running thread?

top - 09:51:14 up 1 day, 1:16, 1 user, load average: 1.21, 1.24, 1.32 Threads: 2 total, 1 running, 0 sleeping, 0 stopped, 1 zombie %Cpu(s): 0.1 us, 1.1 sy, 0.0 ni, 98.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 79106937+total, 72575788+free, 17352112 used, 47959384 buff/cache KiB Swap: 0 total, 0 free, 0 used. 76956729+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 175308 root 20 0 0 0 0 R 99.9 0.0 157:58.34 tf_cnn_benchmar 174982 root 20 0 0 0 0 Z 0.0 0.0 0:06.41 tf_cnn_benchmar


zombie thread /proc/pid/taskid/stack info:

[root@node3 174982]# cat stack [] do_exit+0x6bb/0xa40 [] do_group_exit+0x3f/0xa0 [] get_signal_to_deliver+0x1ce/0x5e0 [] do_signal+0x57/0x6e0 [] do_notify_resume+0x72/0xc0 [] int_signal+0x12/0x17 [] 0xffffffffffffffff


running thread  /proc/pid/taskid/stack info:

[root@node3 175308]# cat stack [] uvm_spin_loop+0xc2/0x100 [nvidia_uvm] [] uvm_tracker_wait+0x8d/0x1a0 [nvidia_uvm] [] uvm_page_tree_wait+0x1d/0x30 [nvidia_uvm] [] uvm_page_table_range_vec_init+0x158/0x1d0 [nvidia_uvm] [] uvm_va_range_map_rm_allocation+0x157/0x310 [nvidia_uvm] [] uvm_map_external_allocation_on_gpu+0x1b2/0x230 [nvidia_uvm] [] uvm_api_map_external_allocation+0x27b/0x4c0 [nvidia_uvm] [] uvm_unlocked_ioctl+0xd57/0xe70 [nvidia_uvm] [] do_vfs_ioctl+0x350/0x560 [] SyS_ioctl+0xa1/0xc0 [] tracesys+0x9d/0xc3 [] 0xffffffffffffffff



When I send SIGKILL, the process did not be removed. Only reboot the machine to clear it.
bfosberry commented 1 month ago

Im seeing the same issue with a java process, the only fix is a reboot of the entire server. Has anyone found a solution to this issue?