Open bingerambo opened 3 years ago
Having to use kill -9
is a red flag to me. See also https://serverfault.com/a/76296/58240
Do you mind sharing all the lines of output from Tini (from the point at which you start your process until the point at which you're stuck) — you can exclude the other output, but please include all the tini
ones. There should at least be some additional input starting with an info line reporting what Tini spawned.
Note that the point at which you're stuck is basically a look where Tini asks the Kernel "do I have any children that have exited?", and the Kernel is answering "you do not".
What happens if you send SIGTERM to Tini? If nothing happens, what if you send it SIGKILL? Do the processes get torn down?
Do you mind sharing all the lines of output from Tini (from the point at which you start your process until the point at which you're stuck) — you can exclude the other output, but please include all the
tini
ones. There should at least be some additional input starting with an info line reporting what Tini spawned.Note that the point at which you're stuck is basically a look where Tini asks the Kernel "do I have any children that have exited?", and the Kernel is answering "you do not".
What happens if you send SIGTERM to Tini? If nothing happens, what if you send it SIGKILL? Do the processes get torn down?
@krallin I am sorry that I did not save complete log information. the python program runs TensorFlow framework with NVIDIA GPU cards, for training a deep learning job. I saved the problem context log information and /proc/pid/taskid status , so I exepect them are maybe useful.
.............................................
Initializing graph WARNING:tensorflow:From /root/tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession W0810 01:44:13.657472 140481052276544 deprecation.py:323] From /root/tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-08-10 01:44:13.911021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1 2020-08-10 01:44:13.911240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-08-10 01:44:13.911262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2020-08-10 01:44:13.911272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y 2020-08-10 01:44:13.911358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N 2020-08-10 01:44:13.911696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0) 2020-08-10 01:44:13.912073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:88:00.0, compute capability: 7.0) [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap INFO:tensorflow:Running local_init_op. I0810 01:44:17.589881 140481052276544 session_manager.py:491] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0810 01:44:17.782803 140481052276544 session_manager.py:493] Done running local_init_op. Running warm up [TRACE tini (1)] No child to reap 2020-08-10 01:44:18.530569: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap .............................................
2. exec docker stop command, to stop python program.
[root@node3 ~]# [root@node3 ~]# docker stop e99a157e24f8
3. the tini print info: tini received the SIGTERM signal, and passed it to child. Then tini printed " No child to reap" at all the time.
.............................................
[TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [DEBUG tini (1)] Passing signal: 'Terminated' [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap [TRACE tini (1)] No child to reap
.............................................
4. the python process info, /proc/pid/taskid details:
the defunct python process (its name was tf_cnn_benchmar ) had 2 threads.
thread taskid 175308 status running
thread taskid 174982 status zombie
why the defunct process still contained a running thread?
top - 09:51:14 up 1 day, 1:16, 1 user, load average: 1.21, 1.24, 1.32 Threads: 2 total, 1 running, 0 sleeping, 0 stopped, 1 zombie %Cpu(s): 0.1 us, 1.1 sy, 0.0 ni, 98.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 79106937+total, 72575788+free, 17352112 used, 47959384 buff/cache KiB Swap: 0 total, 0 free, 0 used. 76956729+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 175308 root 20 0 0 0 0 R 99.9 0.0 157:58.34 tf_cnn_benchmar 174982 root 20 0 0 0 0 Z 0.0 0.0 0:06.41 tf_cnn_benchmar
zombie thread /proc/pid/taskid/stack info:
[root@node3 174982]# cat stack
[
running thread /proc/pid/taskid/stack info:
[root@node3 175308]# cat stack
[
When I send SIGKILL, the process did not be removed. Only reboot the machine to clear it.
Im seeing the same issue with a java process, the only fix is a reboot of the entire server. Has anyone found a solution to this issue?
In container, I exec a python process as a child process of tini. as following steps:
os info:
docker version:
docker file
process status:
tini: pid 1 python: pid 8
kill python process (the child of tini, and its pid is 8 )
the tini and python processes print info:
now I can not remove the zombie process. only reboot the machine.
when the python process suspended, the tini will not reap the child process, and the defunct python process will become zombie .
I'm not sure about either of the following reasons: