splunk / splunk-connect-for-kubernetes

Helm charts associated with kubernetes plug-ins
Apache License 2.0
343 stars 270 forks source link

metrics and objects deployments generating tons of zombie processes and using up cluster node process limits #857

Open gvoden opened 1 year ago

gvoden commented 1 year ago

What happened: Deploying metrics, metrics aggregator and kube-objects (all images tagged with 1.2.1) seems to cause lots of zombie processes to be created on the cluster node where the deployment is and eventually cluster node is overwhelmed and crashes (Amazon EKS 1.22)

What you expected to happen: Metrics and object collections should function normally.

How to reproduce it (as minimally and precisely as possible): Deploy Splunk Connect with below YAML

global: logLevel: info
splunk: hec: host: http-inputs-hoopp.splunkcloud.com insecureSSL: false port: 443 protocol: https token: splunk-kubernetes-logging: enabled: true journalLogPath: /var/log/journal logs: isg-containers: logFormatType: cri from: container: isg- pod: '*' multiline: firstline: /^\d{4}-\d{2}-\d{2} \d{1,2}:\d{1,2}:\d{1,2}.\d{3}/ sourcetype: kube:container timestampExtraction: format: '%Y-%m-%d %H:%M:%S.%NZ' regexp: time="(?

Scaling down the deployment for metrics and objects to 0 makes the zombie processes disappear immediately Environment:

gvoden commented 1 year ago

We found the following in our logs: 2023-04-27 18:06:26 +0000 [error]: #0 unexpected error error_class=Kubeclient::HttpError error="HTTP status code 403, v1 is forbidden: User \"system:serviceaccount:splunk-connect-k8s:splunk-kubernetes-objects\" cannot list resource \"v1\" in API group \"\" at the cluster scope for GET https://10.100.0.1/api/apps/v1"

The service account was missing permissions to list the v1 resource. After updating permissions in our clusterrole we no longer see this error and the zombie process creation is cleared and issue is resolved. Question, why does the pod need access to this v1 endpoint? And did it require access to it in prior versions?