influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.51k stars 5.56k forks source link

took longer to collect than collection interval & InfluxDB Output Error #2780

Closed kotarusv closed 7 years ago

kotarusv commented 7 years ago

I'm using telegraf and InfluxDB in a big production environment. We have 3 data centers and telegraph is running from these nodes. Configuration is same across all nodes using automation/templatizing. Telegraf is running as a daemon set and running in a container platform.

We have mix of VM's and bare metal nodes on all 3 data centers. except 3 bare metal nodes from one data center, rest everything is working perfectly fine.

Telegraf from these 3 nodes always complaning with below errors.

2017-05-10T03:55:47Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster 2017-05-10T03:56:00Z E! ERROR: input [inputs.cpu] took longer to collect than collection interval (1m0s) 2017-05-10T03:56:00Z E! ERROR: input [inputs.netstat] took longer to collect than collection interval (1m0s) 2017-05-10T03:56:01Z E! ERROR: input [inputs.disk] took longer to collect than collection interval (1m0s) 2017-05-10T03:56:01Z E! ERROR: input [inputs.swap] took longer to collect than collection interval (1m0s) 2017-05-10T03:56:01Z E! ERROR: input [inputs.kernel] took longer to collect than collection interval (1m0s) 2017-05-10T03:56:01Z E! ERROR: input [inputs.processes] took longer to collect than collection interval (1m0s) 2017-05-10T03:56:02Z E! ERROR: input [inputs.docker] took longer to collect than collection interval (1m0s) 2017-05-10T03:56:02Z E! ERROR: input [inputs.system] took longer to collect than collection interval (1m0s) 2017-05-10T03:56:03Z E! ERROR: input [inputs.diskio] took longer to collect than collection interval (1m0s) 2017-05-10T03:56:03Z E! ERROR: input [inputs.mem] took longer to collect than collection interval (1m0s) 2017-05-10T03:56:04Z E! ERROR: input [inputs.net] took longer to collect than collection interval (1m0s) 2017-05-10T03:56:20Z E! InfluxDB Output Error: Post http:/<*****>:8086/write?consistency=any&db=hosting&precision=ns&rp=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

While looking at errors, it is obvious that, caused by some kind of connectivity issue or something. Performed below troubleshooting but still the same error. wondering this is some kind of bug or something?

  1. Restarted problematic 3 telegraf containers. issue is the same
  2. No issues from any node from this data center. Even other bare metal nodes from this data center also working fine
  3. Login to each telegraf container and performed below 2 steps to make sure error is relevant

telegraf -config /etc/telegraf/telegraf.conf -input-filter docker -test or telegraf -config /etc/telegraf/telegraf.conf -input-filter mem -test and

for loop to test connectivity to InfluxDB using curl command

curl -sl -I http://:8086/ping

HTTP/1.1 204 No Content Content-Type: application/json Request-Id: d78ee135-e770-11e6-8008-000000000000 X-Influxdb-Version: 1.2.2

ACL, network connectivity to InfluxDB , InfluxDB health ( other nodes writing successfully) all seems to be good.

scratching my head to narrow down what causing these error and unable to write to InfluxDB?

Versions: Telegrfa : 1.2.1 InfluxDB : 1.2.2

danielnelson commented 7 years ago

Can you try with the latest release candidate? I think it might be fixed with it. Official builds are here: #2733

kotarusv commented 7 years ago

I will use once it released for GA. I am seeing another error from few nodes on different clusters.

2017-05-16T00:05:54Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster 2017-05-16T00:06:03Z E! InfluxDB Output Error: {"error":"partial write:\nunable to parse 'docker_container_mem,BZComponent=metrics-cassandra-docker,Name=openshift3/metrics-cassandra,Release=5,Version=3.4.1,architecture=x86_64,authoritative-source-url=registry.access.redhat.com,build-date=2017-03-08T16:44:32.020444,cluster=cae-ga-rcdn,com.redhat.build-host=ip-10-29-120-146.ec2.internal,com.redhat.component=metrics-cassandra-docker,container_image=registry.access.redhat.com/openshift3/metrics-cassandra,container_name=k8s_hawkular-cassandra-1.dc7f6db8_hawkular-cassandra-1-yhejo_openshift-infra_ff64638f-0eb3-11e7-8a1f-005056bcbc98_b55312d5,container_version=3.4.1,datacenter=rcdn,description=The\ Red\ Hat\ Enterprise\ Linux\ Base\ image\ is\ designed\ to\ be\ a\ fully\ supported\ foundation\ for\ your\ containerized\ applications.\ \ This\ base\ image\ provides\ your\ operations\ and\ application\ teams\ with\ the\ packages\,\ language\ runtimes\ and\ tools\ necessary\ to\ run\,\ maintain\,\ and\ troubleshoot\ al

Very big verbose output and trying to give every image and pod information in systemout logs. Restarting telegraf agent also no use and always behave the same. no issues to connect to InfluxdB from telegraf

danielnelson commented 7 years ago

It looks like one of the metrics is malformed. Can you use the file output to log the metrics and add the output?

kotarusv commented 7 years ago

i have the same config file mounted across all nodes using Ansible template file. if other nodes don't complain for formating, not sure why a single node complaining.

here is config file from problematic node

cat /etc/telegraf/telegraf.conf

[global_tags] cluster = "**" hostname = "****" datacenter = "*****"

[agent] interval = "60s" round_interval = true metric_batch_size = 1000 metric_buffer_limit = 10000 collection_jitter = "5s" flush_interval = "10s" flush_jitter = "5s" precision = "" debug = false quiet = false logfile = "" hostname = "" omit_hostname = true

[[outputs.influxdb]] urls = ["http://********************:8086"] database = "hosting" write_consistency = "any" timeout = "30s" username = "****" password = "*****" user_agent = "telegraf" udp_payload = 512

[[outputs.influxdb]] urls = ["http://*******************:8086"] database = "hosting" write_consistency = "any" timeout = "30s" username = "" password = "" user_agent = "telegraf" udp_payload = 512

[[inputs.cpu]] percpu = true totalcpu = true collect_cpu_time = false

[[inputs.disk]] ignore_fs = ["tmpfs", "devtmpfs"]

[[inputs.diskio]]

[[inputs.kernel]]

[[inputs.mem]]

[[inputs.processes]]

[[inputs.swap]]

[[inputs.system]]

[[inputs.net]]

[[inputs.netstat]]

[[inputs.docker]] endpoint = "unix:///var/run/docker.sock" timeout = "15s" perdevice = true total = true

[[inputs.procstat]] exe = "dockerd-current" prefix = "docker"

[[inputs.procstat]] exe = "openshift" prefix = "openshift"

PS : 1) Sensitive information replaced with *** 2) for HA purpose, we writing metrics to 2 separate InfluxDB's

danielnelson commented 7 years ago

I don't see any issues with your config, but can you use the file output to log the metrics or run telegraf with -test to get a single sample? This should narrow down with input is creating the bad metric.

kotarusv commented 7 years ago

Unfortunately I can't modify production file right now but test command output working fine. i tested with every input and all output coming up properly

for example

telegraf -config /etc/telegraf/telegraf.conf -input-filter mem -test

all input plug-ins giving output including docker without having any issues

danielnelson commented 7 years ago

--test doesn't send data to the outputs, so I wouldn't expect the failure to show itself. In the mem output above, did you remove the hostname and datacenter values?

kotarusv commented 7 years ago

yes i removed for sensitive reasons.

danielnelson commented 7 years ago

I think one of the lines is bad. Check for tags and fields with no values, as well as that the escaping is correct especially on docker command. You could try to send them one at a time to InfluxDB using the /query endpoint.

There are some details about proper escaping here: https://docs.influxdata.com/influxdb/v1.2/write_protocols/line_protocol_reference/#special-characters

Unfortunately, I can't tell exactly what the problem is without the full output.

kotarusv commented 7 years ago

let me know what info are you looking exactly ? and give me your email id , will send. remember we use same config file across other nodes as well. other nodes not reported any issues. am strongly not suspecting config/format issue here

danielnelson commented 7 years ago

Run sudo -u telegraf telegraf -config /etc/telegraf/telegraf.conf -test and send me the full output. You can send it to the email on my github profile, here is my PGP fingerprint 88F3 8111 EEB6 CCB7 5682 3A17 542B 7AD0 4784 1EAB if you would like to encrypt it.

kotarusv commented 7 years ago

send emailw with details. it is big file thus compressed. Can you take a look? this bug or issue causing panic as multiple nodes in different clusters behaving in similar way and restarts also no use. not sure what telegraf causing unable to write to InfluxDB although login to telegraf container abel to telnet and curl InfluxDB.

danielnelson commented 7 years ago

I looked only at the very last point in the logfile, it has the measurement name docker_container_blkio. The problem is that the tag io.kubernetes.container.terminationMessagePath= has no value. I think this tag exists because there is a label on the container and all labels are added as tags by default.

You can remove these valueless fields using tagexclude https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md#measurement-filtering. I am sure you will want to remove tags like description and many others as well since they are too verbose and not very useful.

In 1.3 there is also the new docker_label_include and docker_label_exclude options to filter labels.

kotarusv commented 7 years ago

I just upgraded telegraf to latest 1.3.0 docker image. issue still persist

after careful looking at logs, every docker container metric ending with

How to identify which tags to exclude? Take an example from logs for 3 containers metrics

unable to parse 'usage_total=40349862699i,container_id="8398699e85021d1c309259960efdd11c46b096129a2ea30c1c0f04e2652f7129" 1495079049000000000': invalid field format unable to parse 'docker_container_cpu,cluster=cae-ga-rcdn,container_image=containers.cisco.com/oneidentity/ubidproxy,io.kubernetes.container.preStopHandler={\"exec\":{\"command\":[\"/apps/latest/bin/stop-proxy\"]}},io.kubernetes.pod.uid=ec68feaa-3732-11e7-9710-005056bcec0e,license=GPLv2,io.kubernetes.container.ports=[{\"name\":\"ldaps\"\,\"containerPort\":3636\,\"protocol\":\"TCP\"}\,{\"name\":\"serflantcp\"\,\"containerPort\":8301\,\"protocol\":\"TCP\"}\,{\"name\":\"serfwantcp\"\,\"containerPort\":8302\,\"protocol\":\"TCP\"}\,{\"name\":\"serflanudp\"\,\"containerPort\":8301\,\"protocol\":\"UDP\"}\,{\"name\":\"serfwanudp\"\,\"containerPort\":8302\,\"protocol\":\"UDP\"}],datacenter=rcdn,io.kubernetes.container.terminationMessagePath=/dev/termination-log,build-date=20161214,io.kubernetes.container.name=proxy-authn-v1,cpu=cpu53,io.kubernetes.pod.data={\"kind\":\"Pod\"\,\"apiVersion\":\"v1\"\,\"metadata\":{\"name\":\"proxy-authn-v1-1\"\,\"generateName\":\"proxy-authn-v1-\"\,\"namespace\":\"coi-dataservices-stg\"\,\"selfLink\":\"/api/v1/namespaces/coi-dataservices-stg/pods/proxy-authn-v1-1\"\,\"uid\":\"ec68feaa-3732-11e7-9710-005056bcec0e\"\,\"resourceVersion\":\"79982417\"\,\"creationTimestamp\":\"2017-05-12T16:49:11Z\"\,\"labels\":{\"app\":\"proxy-authn-v1\"}\,\"annotations\":{\"kubernetes.io/config.seen\":\"2017-05-16T18:24:17.759271218-07:00\"\,\"kubernetes.io/config.source\":\"api\"\,\"kubernetes.io/created-by\":\"{\"kind\":\"SerializedReference\"\,\"apiVersion\":\"v1\"\,\"reference\":{\"kind\":\"PetSet\"\,\"namespace\":\"coi-dataservices-stg\"\,\"name\":\"proxy-authn-v1\"\,\"uid\":\"07ff8068-3732-11e7-b50c-005056bcbc98\"\,\"apiVersion\":\"apps\"\,\"resourceVersion\":\"76848314\"}}\n\"\,\"openshift.io/scc\":\"restricted\"\,\"pod.alpha.kubernetes.io/initialized\":\"true\"\,\"pod.beta.kubernetes.io/hostname\":\"proxy-authn-v1-1\"\,\"pod.beta.kubernetes.io/subdomain\":\"proxy-authn-v1\"}}\,\"spec\":{\"volumes\":[{\"name\":\"default-token-j303p\"\,\"secret\":{\"secretName\":\"default-token-j303p\"\,\"defaultMode\":420}}]\,\"containers\":[{\"name\":\"proxy-authn-v1\"\,\"image\":\"containers.cisco.com/oneidentity/ubidproxy:develop-64\"\,\"ports\":[{\"name\":\"ldaps\"\,\"containerPort\":3636\,\"protocol\":\"TCP\"}\,{\"name\":\"serflantcp\"\,\"containerPort\":8301\,\"protocol\":\"TCP\"}\,{\"name\":\"serfwantcp\"\,\"containerPort\":8302\,\"protocol\":\"TCP\"}\,{\"name\":\"serflanudp\"\,\"containerPort\":8301\,\"protocol\":\"UDP\"}\,{\"name\":\"serfwanudp\"\,\"containerPort\":8302\,\"protocol\":\"UDP\"}]\,\"env\":[{\"name\":\"POD_NAMESPACE\"\,\"valueFrom\":{\"fieldRef\":{\"apiVersion\":\"v1\"\,\"fieldPath\":\"metadata.namespace\"}}}\,{\"name\":\"CONSUL_LOCAL_CONFIG\"\,\"value\":\"{\"datacenter\":\"rcdn-stg\"}\"}\,{\"name\":\"SERVICENAME\"\,\"value\":\"proxy-authn-v1\"}\,{\"name\":\"DEPENDENT_SERVICENAME\"\,\"value\":\"directory-authn-v1\"}\,{\"name\":\"DS_GEO\"\,\"value\":\"test\"}\,{\"name\":\"DS_TYPE\"\,\"value\":\"AUTHN\"}\,{\"name\":\"proxy_PETSET_NAME\"\,\"value\":\"proxy-authn-v1\"}\,{\"name\":\"LOCATION\"\,\"value\":\"rcdn-stg\"}\,{\"name\":\"FAIL_OVER_LOCATION\"\,\"value\":\"TBD\"}\,{\"name\":\"VAULT_TOKEN\"\,\"value\":\"92760921-d690-f6f9-35cf-abae5a872e13\"}]\,\"resources\":{\"limits\":{\"cpu\":\"4\"\,\"memory\":\"16Gi\"}\,\"requests\":{\"cpu\":\"400m\"\,\"memory\":\"4Gi\"}}\,\"volumeMounts\":[{\"name\":\"default-token-j303p\"\,\"readOnly\":true\,\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\"}]\,\"readinessProbe\":{\"exec\":{\"command\":[\"/apps/scripts/readiness-probe.sh\"]}\,\"initialDelaySeconds\":15\,\"timeoutSeconds\":5\,\"periodSeconds\":10\,\"successThreshold\":1\,\"failureThreshold\":3}\,\"lifecycle\":{\"preStop\":{\"exec\":{\"command\":[\"/apps/latest/bin/stop-proxy\"]}}}\,\"terminationMessagePath\":\"/dev/termination-log\"\,\"imagePullPolicy\":\"Always\"\,\"securityContext\":{\"capabilities\":{\"drop\":[\"KILL\"\,\"MKNOD\"\,\"SETGID\"\,\"SETUID\"\,\"SYS_CHROOT\"]}\,\"privileged\":false\,\"seLinuxOptions\":{\"level\":\"s0:c28\,c7\"}\,\"runAsUser\":1000770000}}]\,\"restartPolicy\":\"Always\"\,\"terminationGracePeriodSeconds\":30\,\"dnsPolicy\":\"ClusterFirst\"\,\"nodeSelector\":{\"environment\":\"ext-nonprod\"}\,\"serviceAccountName\":\"default\"\,\"serviceAccount\":\"default\"\,\"nodeName\":\"cae-ga1-597.cisco.com\"\,\"securityContext\":{\"seLinuxOptions\":{\"level\":\"s0:c28\,c7\"}\,\"fsGroup\":1000770000}\,\"imagePullSecrets\":[{\"name\":\"default-dockercfg-30b52\"}\,{\"name\":\"imagepull-data\"}]}\,\"status\":{\"phase\":\"Running\"\,\"conditions\":[{\"type\":\"Initialized\"\,\"status\":\"True\"\,\"lastProbeTime\":null\,\"lastTransitionTime\":\"2017-05-12T16:49:11Z\"}\,{\"type\":\"Ready\"\,\"status\":\"True\"\,\"lastProbeTime\":null\,\"lastTransitionTime\":\"2017-05-17T01:31:46Z\"}\,{\"type\":\"PodScheduled\"\,\"status\":\"True\"\,\"lastProbeTime\":null\,\"lastTransitionTime\":\"2017-05-12T16:49:11Z\"}]\,\"hostIP\":\"72.163.48.175\"\,\"podIP\":\"10.0.37.158\"\,\"startTime\":\"2017-05-12T16:49:11Z\"\,\"containerStatuses\":[{\"name\":\"proxy-authn-v1\"\,\"state\":{\"running\":{\"startedAt\":\"2017-05-12T16:49:21Z\"}}\,\"lastState\":{}\,\"ready\":true\,\"restartCount\":0\,\"image\":\"containers.cisco.com/oneidentity/ubidproxy:develop-64\"\,\"imageID\":\"docker-pullable://containers.cisco.com/oneidentity/ubidproxy@sha256:943ef60858c39d4a04a692d556a26fac67737cc044e3f26e0161659673b42bc6\"\,\"containerID\":\"docker://8a0f8d9b5693693da43e4e01181bbc5cd2ade3f96ae24c3d06c3e2701bd63126\"}]}}': missing fields unable to parse ',io.kubernetes.pod.terminationGracePeriod=30,vendor=CentOS,hostname=cae-ga1-597,engine_host=cae-ga1-597,container_name=k8s_proxy-authn-v1.7c0b84a4_proxy-authn-v1-1_coi-dataservices-stg_ec68feaa-3732-11e7-9710-005056bcec0e_efda1c6e,container_version=develop-64,io.kubernetes.container.hash=7c0b84a4,io.kubernetes.pod.namespace=coi-dataservices-stg,io.kubernetes.container.restartCount=1,io.kubernetes.pod.name=proxy-authn-v1-1,name=CentOS\ Base\ Image usage_total=16203823101i,container_id="24d4a5fd90f5194e7a840448316eb969090beff39d901425b4b86a9b52d7ee72" 1495079048000000000': missing measurement unable to parse 'docker_container_cpu,name=CentOS\ Base\ Image,vendor=CentOS,io.kubernetes.pod.terminationGracePeriod=30,container_image=containers.cisco.com/oneidentity/ubiddatabroker,container_name=k8s_broker-service-v1.d90554d2_broker-service-v1-dwq5h_coi-dataservices-stg_d5ccf972-3734-11e7-9710-005056bcec0e_7dfbdf97,io.kubernetes.container.preStopHandler={\"exec\":{\"command\":[\"sh\"\,\"-c\"\,\"/apps/latest/bin/stop-broker\"\"]}},build-date=20161214,io.kubernetes.container.name=broker-service-v1,io.kubernetes.pod.name=broker-service-v1-dwq5h,io.kubernetes.container.ports=[{\"name\":\"ldaps\"\,\"containerPort\":2636\,\"protocol\":\"TCP\"}\,{\"name\":\"scim\"\,\"containerPort\":8443\,\"protocol\":\"TCP\"}\,{\"name\":\"serflantcp\"\,\"containerPort\":8301\,\"protocol\":\"TCP\"}\,{\"name\":\"serfwantcp\"\,\"containerPort\":8302\,\"protocol\":\"TCP\"}\,{\"name\":\"serflanudp\"\,\"containerPort\":8301\,\"protocol\":\"UDP\"}\,{\"name\":\"serfwanudp\"\,\"containerPort\":8302\,\"protocol\":\"UDP\"}],datacenter=rcdn,container_version=master-2,io.kubernetes.container.hash=d90554d2,io.kubernetes.container.restartCount=1,engine_host=cae-ga1-597,io.kubernetes.pod.namespace=coi-dataservices-stg,cluster=cae-ga-rcdn,io.kubernetes.pod.data={\"kind\":\"Pod\"\,\"apiVersion\":\"v1\"\,\"metadata\":{\"name\":\"broker-service-v1-dwq5h\"\,\"generateName\":\"broker-service-v1-\"\,\"namespace\":\"coi-dataservices-stg\"\,\"selfLink\":\"/api/v1/namespaces/coi-dataservices-stg/pods/broker-service-v1-dwq5h\"\,\"uid\":\"d5ccf972-3734-11e7-9710-005056bcec0e\"\,\"resourceVersion\":\"79981792\"\,\"creationTimestamp\":\"2017-05-12T17:02:52Z\"\,\"labels\":{\"app\":\"broker-service-v1\"}\,\"annotations\":{\"kubernetes.io/config.seen\":\"2017-05-16T18:24:17.759363397-07:00\"\,\"kubernetes.io/config.source\":\"api\"\,\"kubernetes.io/created-by\":\"{\"kind\":\"SerializedReference\"\,\"apiVersion\":\"v1\"\,\"reference\":{\"kind\":\"ReplicationController\"\,\"namespace\":\"coi-dataservices-stg\"\,\"name\":\"broker-service-v1\"\,\"uid\":\"d5c5d28f-3734-11e7-b50c-005056bcbc98\"\,\"apiVersion\":\"v1\"\,\"resourceVersion\":\"76858956\"}}\n\"\,\"openshift.io/scc\":\"restricted\"\,\"pod.alpha.kubernetes.io/initialized\":\"true\"}}\,\"spec\":{\"volumes\":[{\"name\":\"default-token-j303p\"\,\"secret\":{\"secretName\":\"default-token-j303p\"\,\"defaultMode\":420}}]\,\"containers\":[{\"name\":\"broker-service-v1\"\,\"image\":\"containers.cisco.com/oneidentity/ubiddatabroker:master-2\"\,\"ports\":[{\"name\":\"ldaps\"\,\"containerPort\":2636\,\"protocol\":\"TCP\"}\,{\"name\":\"scim\"\,\"containerPort\":8443\,\"protocol\":\"TCP\"}\,{\"name\":\"serflantcp\"\,\"containerPort\":8301\,\"protocol\":\"TCP\"}\,{\"name\":\"serfwantcp\"\,\"containerPort\":8302\,\"protocol\":\"TCP\"}\,{\"name\":\"serflanudp\"\,\"containerPort\":8301\,\"protocol\":\"UDP\"}\,{\"name\":\"serfwanudp\"\,\"containerPort\":8302\,\"protocol\":\"UDP\"}]\,\"env\":[{\"name\":\"POD_NAMESPACE\"\,\"valueFrom\":{\"fieldRef\":{\"apiVersion\":\"v1\"\,\"fieldPath\":\"metadata.namespace\"}}}\,{\"name\":\"CONSUL_LOCAL_CONFIG\"\,\"value\":\"{\"datacenter\":\"rcdn-stg\"}\"}\,{\"name\":\"SERVICENAME\"\,\"value\":\"broker-service-v1\"}\,{\"name\":\"DEPENDENT_SERVICENAME\"\,\"value\":\"proxy-authn-v1\"}\,{\"name\":\"LOCATION\"\,\"value\":\"rcdn-stg\"}\,{\"name\":\"FAIL_OVER_LOCATION\"\,\"value\":\"TBD\"}\,{\"name\":\"VAULT_TOKEN\"\,\"value\":\"92760921-d690-f6f9-35cf-abae5a872e13\"}]\,\"resources\":{\"limits\":{\"cpu\":\"4\"\,\"memory\":\"16Gi\"}\,\"requests\":{\"cpu\":\"400m\"\,\"memory\":\"4Gi\"}}\,\"volumeMounts\":[{\"name\":\"default-token-j303p\"\,\"readOnly\":true\,\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\"}]\,\"readinessProbe\":{\"exec\":{\"command\":[\"/apps/scripts/readiness-probe.sh\"]}\,\"initialDelaySeconds\":15\,\"timeoutSeconds\":5\,\"periodSeconds\":10\,\"successThreshold\":1\,\"failureThreshold\":3}\,\"lifecycle\":{\"preStop\":{\"exec\":{\"command\":[\"sh\"\,\"-c\"\,\"/apps/latest/bin/stop-broker\"\"]}}}\,\"terminationMessagePath\":\"/dev/termination-log\"\,\"imagePullPolicy\":\"Always\"\,\"securityContext\":{\"capabilities\":{\"drop\":[\"KILL\"\,\"MKNOD\"\,\"SETGID\"\,\"SETUID\"\,\"SYS_CHROOT\"]}\,\"privileged\":false\,\"seLinuxOptions\":{\"level\":\"s0:c28\,c7\"}\,\"runAsUser\":1000770000}}]\,\"restartPolicy\":\"Always\"\,\"terminationGracePeriodSeconds\":30\,\"dnsPolicy\":\"ClusterFirst\"\,\"nodeSelector\":{\"environment\":\"ext-nonprod\"}\,\"serviceAccountName\":\"default\"\,\"serviceAccount\":\"default\"\,\"nodeName\":\"cae-ga1-597.cisco.com\"\,\"securityContext\":{\"seLinuxOptions\":{\"level\":\"s0:c28\,c7\"}\,\"fsGroup\":1000770000}\,\"imagePullSecrets\":[{\"name\":\"default-dockercfg-30b52\"}\,{\"name\":\"imagepull-data\"}]}\,\"status\":{\"phase\":\"Running\"\,\"conditions\":[{\"type\":\"Initialized\"\,\"status\":\"True\"\,\"lastProbeTime\":null\,\"lastTransitionTime\":\"2017-05-12T17:02:52Z\"}\,{\"type\":\"Ready\"\,\"status\":\"True\"\,\"lastProbeTime\":null\,\"lastTransitionTime\":\"2017-05-17T01:31:42Z\"}\,{\"type\":\"PodScheduled\"\,\"status\":\"True\"\,\"lastProbeTime\":null\,\"lastTransitionTime\":\"2017-05-12T17:02:52Z\"}]\,\"hostIP\":\"72.163.48.175\"\,\"podIP\":\"10.0.37.163\"\,\"startTime\":\"2017-05-12T17:02:52Z\"\,\"containerStatuses\":[{\"name\":\"broker-service-v1\"\,\"state\":{\"running\":{\"startedAt\":\"2017-05-12T17:06:02Z\"}}\,\"lastState\":{}\,\"ready\":true\,\"restartCount\":0\,\"image\":\"containers.cisco.com/oneidentity/ubiddatabroker:master-2\"\,\"imageID\":\"docker-pullable://containers.cisco.com/oneidentity/ubiddatabroker@sha256:429c668d63cd67a8968fc11f4dae362b2cc1550b8f0cddf6d19a41cd56fbe725\"\,\"containerID\":\"docker://83265ee310be5f40e59e607364b56c752786bd9d1bea8b11dd65a0fade523608\"}]}}': missing fields unable to parse ',io.kubernetes.pod.uid=d5ccf972-3734-11e7-9710-005056bcec0e,io.kubernetes.container.terminationMessagePath=/dev/termination-log,license=GPLv2,hostname=cae-ga1-597,cpu=cpu57 usage_total=90133990079i,container_id="faf1b5985b48171cd16b206f4a231c36d241c5869f03d96e71e63fafbac28d09" 1495079049000000000': missing measurement unable to parse 'docker_container_cpu,io.kubernetes.container.name=proxy-authn-v1,container_name=k8s_proxy-authn-v1.7c0b84a4_proxy-authn-v1-1_coi-dataservices-stg_ec68feaa-3732-11e7-9710-005056bcec0e_efda1c6e,io.kubernetes.container.terminationMessagePath=/dev/termination-log,hostname=cae-ga1-597,container_image=containers.cisco.com/oneidentity/ubidproxy,name=CentOS\ Base\ Image,cpu=cpu54,container_version=develop-64,io.kubernetes.container.ports=[{\"name\":\"ldaps\"\,\"containerPort\":3636\,\"protocol\":\"TCP\"}\,{\"name\":\"serflantcp\"\,\"containerPort\":8301\,\"protocol\":\"TCP\"}\,{\"name\":\"serfwantcp\"\,\"containerPort\":8302\,\"protocol\":\"TCP\"}\,{\"name\":\"serflanudp\"\,\"containerPort\":8301\,\"protocol\":\"UDP\"}\,{\"name\":\"serfwanudp\"\,\"containerPort\":8302\,\"protocol\":\"UDP\"}],datacenter=rcdn,license=GPLv2,io.kubernetes.container.hash=7c0b84a4,cluster=cae-ga-rcdn,io.kubernetes.pod.namespace=coi-dataservices-stg,engine_host=cae-ga1-597,io.kubernetes.pod.data={\"kind\":\"Pod\"\,\"apiVersion\":\"v1\"\,\"metadata\":{\"name\":\"proxy-authn-v1-1\"\,\"generateName\":\"proxy-authn-v1-\"\,\"namespace\":\"coi-dataservices-stg\"\,\"selfLink\":\"/api/v1/namespaces/coi-dataservices-stg/pods/proxy-authn-v1-1\"\,\"uid\":\"ec68feaa-3732-11e7-9710-005056bcec0e\"\,\"resourceVersion\":\"79982417\"\,\"creationTimestamp\":\"2017-05-12T16:49:11Z\"\,\"labels\":{\"app\":\"proxy-authn-v1\"}\,\"annotations\":{\"kubernetes.io/config.seen\":\"2017-05-16T18:24:17.759271218-07:00\"\,\"kubernetes.io/config.source\":\"api\"\,\"kubernetes.io/created-by\":\"{\"kind\":\"SerializedReference\"\,\"apiVersion\":\"v1\"\,\"reference\":{\"kind\":\"PetSet\"\,\"namespace\":\"coi-dataservices-stg\"\,\"name\":\"proxy-authn-v1\"\,\"uid\":\"07ff8068-3732-11e7-b50c-005056bcbc98\"\,\"apiVersion\":\"apps\"\,\"resourceVersion\":\"76848314\"}}\n\"\,\"openshift.io/scc\":\"restricted\"\,\"pod.alpha.kubernetes.io/initialized\":\"true\"\,\"pod.beta.kubernetes.io/hostname\":\"proxy-authn-v1-1\"\,\"pod.beta.kubernetes.io/subdomain\":\"proxy-authn-v1\"}}\,\"spec\":{\"volumes\":[{\"name\":\"default-token-j303p\"\,\"secret\":{\"secretName\":\"default-token-j303p\"\,\"defaultMode\":420}}]\,\"containers\":[{\"name\":\"proxy-authn-v1\"\,\"image\":\"containers.cisco.com/oneidentity/ubidproxy:develop-64\"\,\"ports\":[{\"name\":\"ldaps\"\,\"containerPort\":3636\,\"protocol\":\"TCP\"}\,{\"name\":\"serflantcp\"\,\"containerPort\":8301\,\"protocol\":\"TCP\"}\,{\"name\":\"serfwantcp\"\,\"containerPort\":8302\,\"protocol\":\"TCP\"}\,{\"name\":\"serflanudp\"\,\"containerPort\":8301\,\"protocol\":\"UDP\"}\,{\"name\":\"serfwanudp\"\,\"containerPort\":8302\,\"protocol\":\"UDP\"}]\,\"env\":[{\"name\":\"POD_NAMESPACE\"\,\"valueFrom\":{\"fieldRef\":{\"apiVersion\":\"v1\"\,\"fieldPath\":\"metadata.namespace\"}}}\,{\"name\":\"CONSUL_LOCAL_CONFIG\"\,\"value\":\"{\"datacenter\":\"rcdn-stg\"}\"}\,{\"name\":\"SERVICENAME\"\,\"value\":\"proxy-authn-v1\"}\,{\"name\":\"DEPENDENT_SERVICENAME\"\,\"value\":\"directory-authn-v1\"}\,{\"name\":\"DS_GEO\"\,\"value\":\"test\"}\,{\"name\":\"DS_TYPE\"\,\"value\":\"AUTHN\"}\,{\"name\":\"proxy_PETSET_NAME\"\,\"value\":\"proxy-authn-v1\"}\,{\"name\":\"LOCATION\"\,\"value\":\"rcdn-stg\"}\,{\"name\":\"FAIL_OVER_LOCATION\"\,\"value\":\"TBD\"}\,{\"name\":\"VAULT_TOKEN\"\,\"value\":\"92760921-d690-f6f9-35cf-abae5a872e13\"}]\,\"resources\":{\"limits\":{\"cpu\":\"4\"\,\"memory\":\"16Gi\"}\,\"requests\":{\"cpu\":\"400m\"\,\"memory\":\"4Gi\"}}\,\"volumeMounts\":[{\"name\":\"default-token-j303p\"\,\"readOnly\":true\,\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\"}]\,\"readinessProbe\":{\"exec\":{\"command\":[\"/apps/scripts/readiness-probe.sh\"]}\,\"initialDelaySeconds\":15\,\"timeoutSeconds\":5\,\"periodSeconds\":10\,\"successThreshold\":1\,\"failureThreshold\":3}\,\"lifecycle\":{\"preStop\":{\"exec\":{\"command\":[\"/apps/latest/bin/stop-proxy\"]}}}\,\"terminationMessagePath\":\"/dev/termination-log\"\,\"imagePullPolicy\":\"Always\"\,\"securityContext\":{\"capabilities\":{\"drop\":[\"KILL\"\,\"MKNOD\"\,\"SETGID\"\,\"SETUID\"\,\"SYS_CHROOT\"]}\,\"privileged\":false\,\"seLinuxOptions\":{\"level\":\"s0:c28\,c7\"}\,\"runAsUser\":1000770000}}]\,\"restartPolicy\":\"Always\"\,\"terminationGracePeriodSeconds\":30\,\"dnsPolicy\":\"ClusterFirst\"\,\"nodeSelector\":{\"environment\":\"ext-nonprod\"}\,\"serviceAccountName\":\"default\"\,\"serviceAccount\":\"default\"\,\"nodeName\":\"cae-ga1-597.cisco.com\"\,\"securityContext\":{\"seLinuxOptions\":{\"level\":\"s0:c28\,c7\"}\,\"fsGroup\":1000770000}\,\"imagePullSecrets\":[{\"name\":\"default-dockercfg-30b52\"}\,{\"name\":\"imagepull-data\"}]}\,\"status\":{\"phase\":\"Running\"\,\"conditions\":[{\"type\":\"Initialized\"\,\"status\":\"True\"\,\"lastProbeTime\":null\,\"lastTransitionTime\":\"2017-05-12T16:49:11Z\"}\,{\"type\":\"Ready\"\,\"status\":\"True\"\,\"lastProbeTime\":null\,\"lastTransitionTime\":\"2017-05-17T01:31:46Z\"}\,{\"type\":\"PodScheduled\"\,\"status\":\"True\"\,\"lastProbeTime\":null\,\"lastTransitionTime\":\"2017-05-12T16:49:11Z\"}]\,\"hostIP\":\"72.163.48.175\"\,\"podIP\":\"10.0.37.158\"\,\"startTime\":\"2017-05-12T16:49:11Z\"\,\"containerStatuses\":[{\"name\":\"proxy-authn-v1\"\,\"state\":{\"running\":{\"startedAt\":\"2017-05-12T16:49:21Z\"}}\,\"lastState\":{}\,\"ready\":true\,\"restartCount\":0\,\"image\":\"containers.cisco.com/oneidentity/ubidproxy:develop-64\"\,\"imageID\":\"docker-pullable://containers.cisco.com/oneidentity/ubidproxy@sha256:943ef60858c39d4a04a692d556a26fac67737cc044e3f26e0161659673b42bc6\"\,\"containerID\":\"docker://8a0f8d9b5693693da43e4e01181bbc5cd2ade3f96ae24c3d06c3e2701bd63126\"}]}}': missing fields unable to parse ',io.kubernetes.pod.terminationGracePeriod=30,io.kubernetes.pod.uid=ec68feaa-3732-11e7-9710-005056bcec0e,io.kubernetes.container.restartCount=1,build-date=20161214,io.kubernetes.container.preStopHandler={\"exec\":{\"command\":[\"/apps/latest/bin/stop-proxy\"]}},vendor=CentOS,io.kubernetes.pod.name=proxy-authn-v1-1 usage_total=29912823128i,container_id="24d4a5fd90f5194e7a840448316eb969090beff39d901425b4b86a9b52d7ee72" 1495079048000000000': missing measurement

danielnelson commented 7 years ago

I recommend excluding all docker labels as a starting point, and only adding them back in as needed, try adding this to your docker config:

[[inputs.docker]]
  ... snip ...
  docker_label_exclude = ["*"]
kotarusv commented 7 years ago

Wonderful. It seems fixed our issue. I'm not seeing verbose output in logs from one cluster. However , in another cluster am seeing below errors although telegraf container is working and sending metrics.

not sure why am getting below errors in another 2nd cluster.

oc logs -f telegraf-agent-6lhno

2017/05/19 04:55:58 I! Using config file: /etc/telegraf/telegraf.conf 2017-05-19T04:55:58Z I! Starting Telegraf (version 1.3.0) 2017-05-19T04:55:58Z I! Loaded outputs: influxdb influxdb 2017-05-19T04:55:58Z I! Loaded inputs: inputs.mem inputs.net inputs.procstat inputs.procstat inputs.cpu inputs.disk inputs.diskio inputs.kernel inputs.docker inputs.processes inputs.swap inputs.system inputs.netstat

2017-05-19T04:55:58Z I! Agent Config: Interval:1m0s, Quiet:false, Hostname:"", Flush Interval:10s 2017-05-19T04:56:16Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_service-iam-enrollment.7eb77c8c_service-iam-enrollment-50kb3_coi-iamservices-stg_5fd5e68e-3c39-11e7-b8cc-005056ac0765_08e63b81] stats: Error getting docker stats: context deadline exceeded 2017-05-19T04:56:16Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_coi-vo4kb_coi-qa-poc_37319da7-3a73-11e7-b856-005056ac66ba_a411e0c7] stats: Error getting docker stats: context deadline exceeded 2017-05-19T04:56:16Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_graylog.11f30466_graylog-pet-2_coi-management-stg_38190979-3a73-11e7-b856-005056ac66ba_8f8bccdc] stats: Error getting docker stats: context deadline exceeded 2017-05-19T04:56:16Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_db-amservice-1_coi-dataservice2-poc_c3e0b850-3bac-11e7-b8cc-005056ac0765_3d0311d2] stats: Error getting docker stats: context deadline exceeded 2017-05-19T04:56:16Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_kong-app.f37d4bf3_kong-app-2621467567-da9lx_coi-gateway-dev_367ddce5-3a73-11e7-b856-005056ac66ba_7a5e2c97] stats: Error getting docker stats: context deadline exceeded 2017-05-19T04:56:16Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_test-consul.f0a64712_testconsul.com-d6iu1_coi-dataservices-dev_3651c907-3a73-11e7-b856-005056ac66ba_c06e52a7] stats: Error getting docker stats: context deadline exceeded 2017-05-19T04:56:17Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_ux-app.79e46a18_ux-app-2dbky_coi-iamservices-stg_4ecf7974-3ba2-11e7-b8cc-005056ac0765_49835063] stats: Error getting docker stats: context deadline exceeded 2017-05-19T04:56:17Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_service-iam-recovery.831d77a7_service-iam-recovery-539zo_coi-iamservices-poc_3114c62d-34a0-11e7-b998-005056ac0765_381bc64d] stats: Error getting docker stats: context deadline exceeded 2017-05-19T04:56:17Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_es.b35dde17_es-pet-1_coi-management-stg_32e1b458-36ad-11e7-b8cc-005056ac0765_91a3f796] stats: Error getting docker stats: context deadline exceeded

danielnelson commented 7 years ago

There is a timeout option on the docker plugin, this means that it took longer than this amount of time to get the container stats. You can increase the timeout, you should keep it below the agent interval though.

kotarusv commented 7 years ago

[[inputs.docker]] endpoint = "unix:///var/run/docker.sock" timeout = "30s" perdevice = true total = true docker_label_exclude = ["*"]

increased from 15 seconds to 30s. i am still seeing errors. not sure we still need to increase or it taking more time to pull stats from docker

$ oc logs -f telegraf-agent-5unfw 2017/05/19 19:53:39 I! Using config file: /etc/telegraf/telegraf.conf 2017-05-19T19:53:39Z I! Starting Telegraf (version 1.3.0) 2017-05-19T19:53:39Z I! Loaded outputs: influxdb influxdb 2017-05-19T19:53:39Z I! Loaded inputs: inputs.mem inputs.processes inputs.net inputs.docker inputs.procstat inputs.procstat inputs.disk inputs.diskio inputs.swap inputs.system inputs.netstat inputs.cpu inputs.kernel

2017-05-19T19:53:39Z I! Agent Config: Interval:1m0s, Quiet:false, Hostname:"", Flush Interval:10s 2017-05-19T19:54:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_broker-service-v1-jjc25_coi-dataservices-stg_155bff05-3735-11e7-b856-005056ac66ba_34f6cf03] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:54:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_cassandra.e2e3e449_kong-database-0_coi-gateway-poc_58f60d4d-2a4c-11e7-b998-005056ac0765_cffe1aad] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:54:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_directory-authn-emea-proxy-test-0_coi-dataservice2-poc_87e0d567-377b-11e7-b856-005056ac66ba_0e5b73a1] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:54:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_directory-authn-apac-proxy-test-0_coi-dataservice2-poc_e014d4c9-3811-11e7-b856-005056ac66ba_4490e3e5] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:54:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_cache-amservice.fde51541_cache-amservice-0_coi-dataservices-stg_5f38abbb-3891-11e7-b856-005056ac66ba_27363499] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:54:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_amservice-3473060347-19j5c_coi-gateway-stg_d97f8cfb-371f-11e7-b856-005056ac66ba_e8f23808] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:54:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_coi-base-7vk12_coi-iamservices-poc_c0dcc128-2e2f-11e7-aa8e-005056ac69a9_a1b27199] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:55:32Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_amservice-3473060347-19j5c_coi-gateway-stg_d97f8cfb-371f-11e7-b856-005056ac66ba_e8f23808] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:55:32Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_coi-base-7vk12_coi-iamservices-poc_c0dcc128-2e2f-11e7-aa8e-005056ac69a9_a1b27199] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:55:32Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_cassandra.e2e3e449_kong-database-0_coi-gateway-poc_58f60d4d-2a4c-11e7-b998-005056ac0765_cffe1aad] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:55:32Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_broker-service-v1-jjc25_coi-dataservices-stg_155bff05-3735-11e7-b856-005056ac66ba_34f6cf03] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:55:32Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_directory-authn-emea-proxy-test-0_coi-dataservice2-poc_87e0d567-377b-11e7-b856-005056ac66ba_0e5b73a1] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:55:32Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_directory-authn-apac-proxy-test-0_coi-dataservice2-poc_e014d4c9-3811-11e7-b856-005056ac66ba_4490e3e5] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:55:32Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_cache-amservice.fde51541_cache-amservice-0_coi-dataservices-stg_5f38abbb-3891-11e7-b856-005056ac66ba_27363499] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:56:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_directory-authn-apac-proxy-test-0_coi-dataservice2-poc_e014d4c9-3811-11e7-b856-005056ac66ba_4490e3e5] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:56:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_cache-amservice.fde51541_cache-amservice-0_coi-dataservices-stg_5f38abbb-3891-11e7-b856-005056ac66ba_27363499] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:56:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_broker-service-v1-jjc25_coi-dataservices-stg_155bff05-3735-11e7-b856-005056ac66ba_34f6cf03] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:56:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_coi-base-7vk12_coi-iamservices-poc_c0dcc128-2e2f-11e7-aa8e-005056ac69a9_a1b27199] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:56:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_cassandra.e2e3e449_kong-database-0_coi-gateway-poc_58f60d4d-2a4c-11e7-b998-005056ac0765_cffe1aad] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:56:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_amservice-3473060347-19j5c_coi-gateway-stg_d97f8cfb-371f-11e7-b856-005056ac66ba_e8f23808] stats: Error getting docker stats: context deadline exceeded 2017-05-19T19:56:34Z E! Error in plugin [inputs.docker]: E! Error gathering container [/k8s_POD.6faadc2e_directory-authn-emea-proxy-test-0_coi-dataservice2-poc_87e0d567-377b-11e7-b856-005056ac66ba_0e5b73a1] stats: Error getting docker stats: context deadline exceeded

danielnelson commented 7 years ago

You should be able to time the main docker call outside of telegraf with curl 7.5 or greater curl --unix-socket /var/run/docker.sock http://localhost/info. How long does it take for this call to complete?

kotarusv commented 7 years ago

@danielnelson thanks for taking a fresh look again.

here is command output

@ sudo curl -o /dev/null -s -w %{time_connect}:%{time_starttransfer}:%{time_total} --unix-socket /var/run/docker.sock http://localhost/info 0.000:0.472:0.472

it is pretty fast. i could see output also

I want to check your opinion on a situation am facing for the same matter.

We encountered a docker bug where it hangs when pods gets multiple OOM's. we are working with docker and patch will be released soon. Although docker is hang where "docker ps" and "docker info" commands are not working or get hanging without any result, telegraf is perfectly collecting metrics. I setup the grafana alert to notify us in case any node docker is down. I used n_containers running measurement. Even during docker is hanged, Grafana still showing the valid n_containers running. how it is possible?

during hang time, I tested equivalent docker API call directly from node, it is also failing to list since docker deamon was hanging.

sudo curl --unix-socket /var/run/docker.sock http:/v1.24/containers/json

but telegraf is able to collect same metric

Srinivas Kotaru

keras commented 7 years ago

I was seeing similar issue:

2017-06-13T11:50:52Z E! ERROR: input [inputs.cpu] took longer to collect than collection interval (30s)
2017-06-13T11:50:52Z E! ERROR: input [inputs.conntrack] took longer to collect than collection interval (30s)
2017-06-13T11:50:52Z E! ERROR: input [inputs.swap] took longer to collect than collection interval (30s)
2017-06-13T11:50:52Z E! ERROR: input [inputs.mem] took longer to collect than collection interval (30s)
2017-06-13T11:50:52Z E! ERROR: input [inputs.net] took longer to collect than collection interval (30s)
2017-06-13T11:50:52Z E! ERROR: input [inputs.docker] took longer to collect than collection interval (30s)
2017-06-13T11:50:52Z E! ERROR: input [inputs.disk] took longer to collect than collection interval (30s)
2017-06-13T11:50:52Z E! ERROR: input [inputs.diskio] took longer to collect than collection interval (30s)
2017-06-13T11:50:52Z E! ERROR: input [inputs.system] took longer to collect than collection interval (30s)

My issue isn't reproducible anymore at 1.3.0, so it probably wasn't the same issue, but I also figured out that setting flush_interval to same as interval fixed the issue. But even with that aggregations triggered this same issue.

I think you could try setting flush_interval and see if it helps.

kotarusv commented 7 years ago

we frequently seeing this issue in our container platform where telegraf is running as container and collecting both host and docker metrics

my configuration is below

[agent] interval = "60s" round_interval = true metric_batch_size = 1000 metric_buffer_limit = 10000 collection_jitter = "5s" flush_interval = "60s" flush_jitter = "5s" precision = "" debug = false quiet = false logfile = "" hostname = "" omit_hostname = true

[[inputs.cpu]] percpu = true totalcpu = true collect_cpu_time = false

[[inputs.disk]] ignore_fs = ["tmpfs", "devtmpfs"]

[[inputs.diskio]]

[[inputs.kernel]]

[[inputs.mem]]

[[inputs.processes]]

[[inputs.swap]]

[[inputs.system]]

[[inputs.net]]

[[inputs.netstat]]

[[inputs.docker]] endpoint = "unix:///var/run/docker.sock" timeout = "30s" perdevice = true total = true docker_label_exclude = ["*"]

can you reconfirm by putting internal and flush interval will fix this issue?

Srinivas Kotaru

danielnelson commented 7 years ago

can you reconfirm by putting internal and flush interval will fix this issue?

I'm not sure why it would help. Of course if your flush interval is too high it will take too long to flush, and inputs have only a 100 metric buffer to work with during the flush.

Even during docker is hanged, Grafana still showing the valid n_containers running. how it is possible?

I'm not sure, you might want to check influxdb to verify the metric is still being updated. If it is, then telegraf must be able to query the docker socket.

kotarusv commented 7 years ago

yes i was able to see data in InfluxDB during docker hang time. i was wondering whether docker ps command API and telegraf docker plug-in using same API or not.

danielnelson commented 7 years ago

I'm not sure if they make the same requests, all I know is they are both using the socket.

joshughes commented 7 years ago

I have had the same issue...

On the hose I can list stats and containers fine... but when telegraf trys to gather stats it times out.

I pushed my timeout to 120s and still was unable to get past the issue

Using Telegraf 1.3

docker info
Containers: 19
 Running: 19
 Paused: 0
 Stopped: 0
Images: 27
Server Version: 1.11.2
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 4.4.35+
Operating System: Container-Optimized OS from Google
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.619 GiB
Name: foobar
ID: KZKW:KLQ3:32ER:ZAIL:XIVY:YDLB:P5KP:G6SP:O6WU:IKGS:OMN7:PFWS
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
danielnelson commented 7 years ago

I have lost track of what the current issues here is @joshughes @kotarusv can you open new issues if needed with the details?