When fluentd worker finishes unexpectedly with signal SIGKILL it leaves defunct processes behind

macEar commented 3 years ago

What happened: After fluend worker inside splunk pods unexpectedly finishes with signal SIGKILL, it leaves defunct processes behind. In our case we set insufficient cpu limits for splunk logging pods and fluentd process was constantly finishing its work leaving zombie processes. In the pod logs we can see the following messages:

2021-07-29 10:17:34 +0000 [info]: Worker 0 finished unexpectedly with signal SIGKILL

We noticed that splunk left more than 2000 defunct processes in a day. Here is the shortened output of ps -ef --forest command where we can see that the parent process is fluentd:

....
/usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
root     13462 13445  0 Jul29 ?        00:01:50  |   \_ /usr/bin/fluentd -c /fluentd/etc/fluent.conf
root     13603 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     13605 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     14016 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     14017 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     14367 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     14369 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     14731 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     14733 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     15092 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     15094 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     15484 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     15486 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     15806 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     15808 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     16147 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
root     16149 13462  0 Jul29 ?        00:00:00  |       \_ [sh] <defunct>
....

FYI, our limits settings from values.yml:

  resources:
    limits:
      cpu: 200m
      memory: 400Mi
    requests:
      cpu: 100m
      memory: 200Mi

  buffer:
    "@type": memory
    total_limit_size: 300m
    chunk_limit_size: 20m
    chunk_limit_records: 100000
    flush_interval: 5s
    flush_at_shutdown: true
    flush_thread_count: 1
    overflow_action: block
    retry_max_times: 5
    retry_type: periodic

What you expected to happen: No zombie processes

How to reproduce it (as minimally and precisely as possible):

Set low cpu limits for splunk logging pods and deploy it in k8s.
Wait for when fluentd needs more cpu resources than it is allowed to consume, and it finishes with signal SIGKILL.
Check presence of defunct processes that fluentd might leave behind by issuing ps -ef --forest.

Environment:

Kubernetes version (use kubectl version): v1.19.3
OS (e.g: cat /etc/os-release): CentOS Linux 7.9.2009
Splunk logging image: splunk/fluentd-hec:1.2.7 (latest)
Splunk Connect for Kubernetes helm chart version: 1.4.8 (latest)

Others:

2021-07-29 04:43:05 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
2021-07-29 04:43:05 +0000 [info]: gem 'fluentd' version '1.11.5'
2021-07-29 04:43:05 +0000 [info]: gem 'fluent-plugin-concat' version '2.4.0'
2021-07-29 04:43:05 +0000 [info]: gem 'fluent-plugin-jq' version '0.5.1'
2021-07-29 04:43:05 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '2.5.3'
2021-07-29 04:43:05 +0000 [info]: gem 'fluent-plugin-prometheus' version '2.0.1'
2021-07-29 04:43:05 +0000 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2021-07-29 04:43:05 +0000 [info]: gem 'fluent-plugin-splunk-hec' version '1.2.7'
2021-07-29 04:43:05 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.2'

rockb1017 commented 3 years ago

I have added liveness probes to all pods to chart v1.4.9. Could you upgrade to it ?

macEar commented 3 years ago

Yes, we will upgrade shortly and after I will write back if it helps or not

macEar commented 3 years ago

We upgraded to v1.4.9 and decided to take some time to observe if the problem shows again. If no problem shows, I guess we could close this issue. I'll report back in a week.

macEar commented 3 years ago

So far so good. I close this issue and I will reopen it if the problem shows up again. Thanks.

splunk / splunk-connect-for-kubernetes

When fluentd worker finishes unexpectedly with signal SIGKILL it leaves defunct processes behind #642