splunk / splunk-connect-for-kubernetes

Helm charts associated with kubernetes plug-ins
Apache License 2.0
344 stars 270 forks source link

splunk-kubernetes-objects - error_class=Net::OpenTimeout error="execution expired" #230

Closed rfrodrigues16 closed 5 years ago

rfrodrigues16 commented 5 years ago

I’m experiencing a problem when trying to use the SCK agent for collecting Kubernetes objects data (splunk-kubernetes-objects). The SCK agent throws an error “execution expired” whenever it tries to post the data using HEC. See the log excerpts below:

2019-08-13 21:47:37 +0000 [debug]: #0 Sending 4547 bytes to Splunk. 2019-08-13 21:47:37 +0000 [trace]: #0 POST https:/x.x.x.x:8088/services/collector body={"host":"","time":"1565732853.4252117","index":"vecc_csa_kubernetes_logs","source":"namespace:splunk/pod:splunk-connect-splunk-kubernetes-objects-6cb5876855-rmqpf","sourcetype":"kube:objects:pods","fields":{"cluster_name":"cluster_name"},"event":"{\"metadata\":{\"name\":\"splunk-connect-splunk-kubernetes-objects-6cb5876855-rmqpf\",\"generateName\":\"splunk-connect-splunk-kubernetes-objects-6cb5876855-\",\"namespace\":\"splunk\",\"selfLink\":\"/api/v1/namespaces/splunk/pods/splunk-connect-splunk-kubernetes-objects-6cb5876855-rmqpf\",\"uid\":\"f3dc7102-be13-11e9-8dde-fa163e5b4bad\",\"resourceVersion\":\"5401185\",\"creationTimestamp\":\"2019-08-13T21:47:31Z\",\"labels\":{\"app\":\"splunk-kubernetes-objects\",\"engine\":\"fluentd\",\"pod-template-hash\":\"6cb5876855\",\"release\":\"splunk-connect\"},\"annotations\":{\"checksum/config\":\"b14e319dbcdaddd52e542e9fe9b529381b0f918591820bcdfc4a10fd2b572fff\"},\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"kind\":\"ReplicaSet\",\"name\":\"splunk-connect-splunk-kubernetes-objects-6cb5876855\",\"uid\":\"f3da97df-be13-11e9-8dde-fa163e5b4bad\",\"controller\":true,\"blockOwnerDeletion\":true}],\"requestResourceVersion\":\"5401187\"},\"spec\":{\"volumes\":[{\"name\":\"conf-configmap\",\"configMap\":{\"name\":\"splunk-connect-splunk-kubernetes-objects\",\"defaultMode\":420}},{\"name\":\"secrets\",\"secret\":{\"secretName\":\"splunk-kubernetes-objects\",\"defaultMode\":420}},{\"name\":\"splunk-connect-splunk-kubernetes-objects-token-4m9w6\",\"secret\":{\"secretName\":\"splunk-connect-splunk-kubernetes-objects-token-4m9w6\",\"defaultMode\":420}}],\"containers\":[{\"name\":\"splunk-fluentd-k8s-objects\",\"image\":\"k8s-registry.eccd.local/splunk/kube-objects:1.1.0\",\"args\":[\"-c\",\"/fluentd/etc/fluent.conf\"],\"env\":[{\"name\":\"MY_NAMESPACE\",\"valueFrom\":{\"fieldRef\":{\"apiVersion\":\"v1\",\"fieldPath\":\"metadata.namespace\"}}},{\"name\":\"MY_POD_NAME\",\"valueFrom\":{\"fieldRef\":{\"apiVersion\":\"v1\",\"fieldPath\":\"metadata.name\"}}},{\"name\":\"NODE_NAME\",\"valueFrom\":{\"fieldRef\":{\"apiVersion\":\"v1\",\"fieldPath\":\"spec.nodeName\"}}},{\"name\":\"SPLUNK_HEC_TOKEN\",\"valueFrom\":{\"secretKeyRef\":{\"name\":\"splunk-kubernetes-objects\",\"key\":\"splunk_hec_token\"}}}],\"resources\":{\"requests\":{\"cpu\":\"100m\",\"memory\":\"200Mi\"}},\"volumeMounts\":[{\"name\":\"conf-configmap\",\"mountPath\":\"/fluentd/etc\"},{\"name\":\"secrets\",\"readOnly\":true,\"mountPath\":\"/fluentd/etc/splunk\"},{\"name\":\"splunk-connect-splunk-kubernetes-objects-token-4m9w6\",\"readOnly\":true,\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\"}],\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\",\"imagePullPolicy\":\"IfNotPresent\"}],\"restartPolicy\":\"Always\",\"terminationGracePeriodSeconds\":30,\"dnsPolicy\":\"ClusterFirst\",\"serviceAccountName\":\"splunk-connect-splunk-kubernetes-objects\",\"serviceAccount\":\"splunk-connect-splunk-kubernetes-objects\",\"nodeName\":\"worker-pool1-xmbq8dza-csa-eccd-v2\",\"securityContext\":{},\"schedulerName\":\"default-scheduler\",\"tolerations\":[{\"key\":\"node.kubernetes.io/not-ready\",\"operator\":\"Exists\",\"effect\":\"NoExecute\",\"tolerationSeconds\":300},{\"key\":\"node.kubernetes.io/unreachable\",\"operator\":\"Exists\",\"effect\":\"NoExecute\",\"tolerationSeconds\":300}],\"priority\":0,\"enableServiceLinks\":true},\"status\":{\"phase\":\"Running\",\"conditions\":[{\"type\":\"Initialized\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2019-08-13T21:47:31Z\"},{\"type\":\"Ready\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2019-08-13T21:47:32Z\"},{\"type\":\"ContainersReady\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2019-08-13T21:47:32Z\"},{\"type\":\"PodScheduled\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2019-08-13T21:47:31Z\"}],\"hostIP\":\"10.0.10.2\",\"podIP\":\"192.168.20.23\",\"startTime\":\"2019-08-13T21:47:31Z\",\"containerStatuses\":[{\"name\":\"splunk-fluentd-k8s-objects\",\"state\":{\"running\":{\"startedAt\":\"2019-08-13T21:47:32Z\"}},\"lastState\":{},\"ready\":true,\"restartCount\":0,\"image\":\"k8s-registry.eccd.local/splunk/kube-objects:1.1.0\",\"imageID\":\"docker-pullable://k8s-registry.eccd.local/splunk/kube-objects@sha256:209ea3157615eec0bd43bcae5450c4a7e2c962ce5c8dd42e890ca81dfc551103\",\"containerID\":\"docker://4c59e1927a3d386d8707e86033c73e9a82a7e56823811572abefffdaceda5e79\"}],\"qosClass\":\"Burstable\"}}"} … 2019-08-13 21:48:37 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2019-08-13 21:48:38 +0000 chunk="59006968aad29c0ccc8b3c2c2ee28461" error_class=Net::OpenTimeout error="execution expired" 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/lib/ruby/2.5.0/net/http.rb:937:in initialize' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/lib/ruby/2.5.0/net/http.rb:937:in open' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/lib/ruby/2.5.0/net/http.rb:937:in block in connect' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/lib/ruby/2.5.0/timeout.rb:103:in timeout' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/lib/ruby/2.5.0/net/http.rb:935:in connect' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/lib/ruby/2.5.0/net/http.rb:920:in do_start' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/lib/ruby/2.5.0/net/http.rb:915:in start' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/bundle/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:692:in start' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/bundle/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:622:in connection_for' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/bundle/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:927:in request' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-splunk-hec-1.1.0/lib/fluent/plugin/out_splunk_hec.rb:355:in send_to_hec' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-splunk-hec-1.1.0/lib/fluent/plugin/out_splunk_hec.rb:167:in write' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.4.0/lib/fluent/plugin/output.rb:1125:in try_flush' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.4.0/lib/fluent/plugin/output.rb:1425:in flush_thread_run' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.4.0/lib/fluent/plugin/output.rb:454:in block (2 levels) in start' 2019-08-13 21:48:37 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.4.0/lib/fluent/plugin_helper/thread.rb:78:in block in thread_create' 2019-08-13 21:48:37 +0000 [trace]: #0 enqueueing all chunks in buffer instance=69875636192300 … 2019-08-13 22:02:44 +0000 [error]: #0 failed to flush the buffer, and hit limit for retries. dropping all chunks in the buffer queue. retry_times=3 records=2 error_class=Net::OpenTimeout error="execution expired" 2019-08-13 22:02:44 +0000 [error]: #0 suppressed same stacktrace

Nevertheless, I can successfully send the same request from the node using Curl:

curl -k https://x.x.x.x:8088/services/collector/event -H "Authorization: Splunk xxx-xxx-xxx-xxx" -d '{"host":"","time":"1565732853.4252117","index":"vecc_csa_kubernetes_logs","source":"namespace:splunk/pod:splunk-connect-splunk-kubernetes-objects-6cb5876855-rmqpf","sourcetype":"kube:objects:pods","fields":{"cluster_name":"cluster_name"},"event":"{\"metadata\":{\"name\":\"splunk-connect-splunk-kubernetes-objects-6cb5876855-rmqpf\",\"generateName\":\"splunk-connect-splunk-kubernetes-objects-6cb5876855-\",\"namespace\":\"splunk\",\"selfLink\":\"/api/v1/namespaces/splunk/pods/splunk-connect-splunk-kubernetes-objects-6cb5876855-rmqpf\",\"uid\":\"f3dc7102-be13-11e9-8dde-fa163e5b4bad\",\"resourceVersion\":\"5401185\",\"creationTimestamp\":\"2019-08-13T21:47:31Z\",\"labels\":{\"app\":\"splunk-kubernetes-objects\",\"engine\":\"fluentd\",\"pod-template-hash\":\"6cb5876855\",\"release\":\"splunk-connect\"},\"annotations\":{\"checksum/config\":\"b14e319dbcdaddd52e542e9fe9b529381b0f918591820bcdfc4a10fd2b572fff\"},\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"kind\":\"ReplicaSet\",\"name\":\"splunk-connect-splunk-kubernetes-objects-6cb5876855\",\"uid\":\"f3da97df-be13-11e9-8dde-fa163e5b4bad\",\"controller\":true,\"blockOwnerDeletion\":true}],\"requestResourceVersion\":\"5401187\"},\"spec\":{\"volumes\":[{\"name\":\"conf-configmap\",\"configMap\":{\"name\":\"splunk-connect-splunk-kubernetes-objects\",\"defaultMode\":420}},{\"name\":\"secrets\",\"secret\":{\"secretName\":\"splunk-kubernetes-objects\",\"defaultMode\":420}},{\"name\":\"splunk-connect-splunk-kubernetes-objects-token-4m9w6\",\"secret\":{\"secretName\":\"splunk-connect-splunk-kubernetes-objects-token-4m9w6\",\"defaultMode\":420}}],\"containers\":[{\"name\":\"splunk-fluentd-k8s-objects\",\"image\":\"k8s-registry.eccd.local/splunk/kube-objects:1.1.0\",\"args\":[\"-c\",\"/fluentd/etc/fluent.conf\"],\"env\":[{\"name\":\"MY_NAMESPACE\",\"valueFrom\":{\"fieldRef\":{\"apiVersion\":\"v1\",\"fieldPath\":\"metadata.namespace\"}}},{\"name\":\"MY_POD_NAME\",\"valueFrom\":{\"fieldRef\":{\"apiVersion\":\"v1\",\"fieldPath\":\"metadata.name\"}}},{\"name\":\"NODE_NAME\",\"valueFrom\":{\"fieldRef\":{\"apiVersion\":\"v1\",\"fieldPath\":\"spec.nodeName\"}}},{\"name\":\"SPLUNK_HEC_TOKEN\",\"valueFrom\":{\"secretKeyRef\":{\"name\":\"splunk-kubernetes-objects\",\"key\":\"splunk_hec_token\"}}}],\"resources\":{\"requests\":{\"cpu\":\"100m\",\"memory\":\"200Mi\"}},\"volumeMounts\":[{\"name\":\"conf-configmap\",\"mountPath\":\"/fluentd/etc\"},{\"name\":\"secrets\",\"readOnly\":true,\"mountPath\":\"/fluentd/etc/splunk\"},{\"name\":\"splunk-connect-splunk-kubernetes-objects-token-4m9w6\",\"readOnly\":true,\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\"}],\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\",\"imagePullPolicy\":\"IfNotPresent\"}],\"restartPolicy\":\"Always\",\"terminationGracePeriodSeconds\":30,\"dnsPolicy\":\"ClusterFirst\",\"serviceAccountName\":\"splunk-connect-splunk-kubernetes-objects\",\"serviceAccount\":\"splunk-connect-splunk-kubernetes-objects\",\"nodeName\":\"worker-pool1-xmbq8dza-csa-eccd-v2\",\"securityContext\":{},\"schedulerName\":\"default-scheduler\",\"tolerations\":[{\"key\":\"node.kubernetes.io/not-ready\",\"operator\":\"Exists\",\"effect\":\"NoExecute\",\"tolerationSeconds\":300},{\"key\":\"node.kubernetes.io/unreachable\",\"operator\":\"Exists\",\"effect\":\"NoExecute\",\"tolerationSeconds\":300}],\"priority\":0,\"enableServiceLinks\":true},\"status\":{\"phase\":\"Running\",\"conditions\":[{\"type\":\"Initialized\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2019-08-13T21:47:31Z\"},{\"type\":\"Ready\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2019-08-13T21:47:32Z\"},{\"type\":\"ContainersReady\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2019-08-13T21:47:32Z\"},{\"type\":\"PodScheduled\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2019-08-13T21:47:31Z\"}],\"hostIP\":\"10.0.10.2\",\"podIP\":\"192.168.20.23\",\"startTime\":\"2019-08-13T21:47:31Z\",\"containerStatuses\":[{\"name\":\"splunk-fluentd-k8s-objects\",\"state\":{\"running\":{\"startedAt\":\"2019-08-13T21:47:32Z\"}},\"lastState\":{},\"ready\":true,\"restartCount\":0,\"image\":\"k8s-registry.eccd.local/splunk/kube-objects:1.1.0\",\"imageID\":\"docker-pullable://k8s-registry.eccd.local/splunk/kube-objects@sha256:209ea3157615eec0bd43bcae5450c4a7e2c962ce5c8dd42e890ca81dfc551103\",\"containerID\":\"docker://4c59e1927a3d386d8707e86033c73e9a82a7e56823811572abefffdaceda5e79\"}],\"qosClass\":\"Burstable\"}}"}'

{"text":"Success","code":0}

HEC is configured with insecureSSL=true. Also, the same HEC configuration is being used with SCK for sending logging and metrics to Splunk. The logging and metrics agents are working fine.

Do you have any idea what could be wrong?

chaitanyaphalak commented 5 years ago

This issue is similar to https://github.com/splunk/splunk-connect-for-kubernetes/issues/204

I think we can handle the timeout error in the HEC plugin to solve this issue. Apart from that i think it might be dns issue/splunk being unavailable for that pod. A pod restart might help. Let me know what you think.

rfrodrigues16 commented 5 years ago

I doesn't seem to be dns/splunk issue. Same configuration is working fine for splunk-kubernetes-logging and splunk-kubernetes-metrics. I also restarted the pod several times, it doesn't work. I didn't get the idea of handling the issue via HEC plugin. Is there are configuration which I could try?

dtregonning commented 5 years ago

Hi @rfrodrigues16 - Are you deploying with Helm and inside the configuration files are you setting any explicit timeout values for the bundled HEC plugin module. Also, What version of Splunk are you sending to?

rfrodrigues16 commented 5 years ago

Hi @dtregonning, I'm not setting any explicit timeout value. The plugin is sending data to an instance of Splunk Enterprise version 7.1.4. I'm pasting the Pod configuration below:

<match kube.**> @type splunk_hec protocol http hec_host "x.x.x.x" hec_port 8088 hec_token "x-x-x-x-x" host "" source_key "source" sourcetype_key "sourcetype" index "vecc_csa_kubernetes_logs" insecure_ssl false

cluster_name
<buffer>
  @type "memory"
  chunk_limit_records 10000
  chunk_limit_size 200m
  flush_interval 3s
  flush_thread_count 1
  overflow_action block
  retry_max_times 3
  total_limit_size 600m
</buffer>

matthewmodestino commented 5 years ago

Can you set the protocol to https and ensure you are running insecureSSL=true unless you have valid certs being served by Splunk HEC?

Your curl worked because it was:

curl -k https://x.x.x.x:8088/services/collector/event -k means same thing as insecureSSL=true and you are using https.

rfrodrigues16 commented 5 years ago

I already tried both (http and https). Firstly, I was using https and insecureSSL=true with the HEC security enabled in the Splunk Enterprise. I disabled security in the HEC configuration and changed the plugin configuration to http in order to check if the issue was somehow related to the security option. Nevertheless, I still get the same error using http.

matthewmodestino commented 5 years ago

let's not guess, let's confirm your HEC configs please.

Once you have verified which protocol HEC is serving, then we can validate your configs, which we need to see dumped from the running pod.

rfrodrigues16 commented 5 years ago

Currently, my HEC is configured to use HTTP. The global settings in the Splunk Enterprise are:

matthewmodestino commented 5 years ago

When you say “from the node” did you drop a pod in the splunk namespace or try from the os?

with those settings i would have http and insecureSSL=true

and I would validate the hec hostname is resolvable or use the ip.

I’m convinced its a reachability thing based on my experience.

rfrodrigues16 commented 5 years ago

Hi @matthewmodestino, thank you for your hints. The problem seems to be connectivity of one of the k8s nodes with the Splunk HEC.