influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.55k stars 5.56k forks source link

Vphere-Input for Telegraf stops working #5057

Closed v4vamsee closed 5 years ago

v4vamsee commented 5 years ago

Relevant telegraf.conf:

System info:

CentOS Linux release 7.5.1804 Telegraf 1.9.0 InfluxDB 1.6.3

Telegraf startup log has following:

2018-11-28T21:56:12Z I! Loaded inputs: inputs.influxdb inputs.jolokia2_agent inputs.vsphere inputs.cpu inputs.disk 2018-11-28T21:56:12Z I! Loaded aggregators: 2018-11-28T21:56:12Z I! Loaded processors: 2018-11-28T21:56:12Z I! Loaded outputs: influxdb 2018-11-28T21:56:12Z I! Tags enabled: host=xxxx 2018-11-28T21:56:12Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"rxxxx", Flush Interval:10s 2018-11-28T21:56:12Z D! [agent] Connecting outputs 2018-11-28T21:56:12Z D! [agent] Attempting connection to output: influxdb 2018-11-28T21:56:12Z D! [agent] Successfully connected to output: influxdb 2018-11-28T21:56:12Z D! [agent] Starting service inputs 2018-11-28T21:56:12Z D! [input.vsphere]: Starting plugin 2018-11-28T21:56:12Z D! [input.vsphere]: Running initial discovery and waiting for it to finish 2018-11-28T21:56:12Z D! [input.vsphere]: Creating client: xxxxx 2018-11-28T21:56:12Z I! [input.vsphere] Option query for maxQueryMetrics failed. Using default 2018-11-28T21:56:12Z D! [input.vsphere] vCenter version is: 6.5.0

Steps to reproduce:

  1. Install telegraf 1.9.0, influxdb 1.6.3
  2. configure vsphere input with following configuration:

[[inputs.vsphere]] vcenters = [ “rzzz” ] username = “zzzz” password = “zzz” interval = “30s”

vm_metric_include = [ “sys.uptime.latest” , “cpu.usage.average” , “cpu.ready.summation” , “cpu.readiness.average” , “cpu.usagemhz.average” , “cpu.wait.summation” , “cpu.system.summation” , “cpu.used.summation” , “mem.usage.average” , “mem.consumed.average” , “mem.active.average” , “mem.vmmemctl.average” , “mem.swapused.average” , “mem.swapIn.average” , “mem.swapOut.average” , “disk.maxTotalLatency.latest” , “net.usage.average” , “net.bytesRx.average” , “net.bytesTx.average” , “net.packetsRx.summation” , “net.packetsTx.summation” , “net.received.average” , “net.transmitted.average” , “virtualDisk.read.average” , “virtualDisk.write.average” , “virtualDisk.totalWriteLatency.average” , “virtualDisk.totalReadLatency.average” , “virtualDisk.numberReadAveraged.average” , “virtualDisk.numberWriteAveraged.average” , “virtualDisk.readOIO.latest” , “virtualDisk.writeOIO.latest” ] vm_metric_exclude = [] vm_instances = true ## true by default

host_metric_include = [ “cpu.usagemhz.average” , “cpu.usage.average” , “cpu.corecount.provisioned.average” , “mem.capacity.provisioned.average” , “mem.active.average” , “net.throughput.usage.average” , “net.throughput.contention.summation” , “vmop.numSVMotion.latest” , “vmop.numVMotion.latest” , “vmop.numXVMotion.latest” , “storageAdapter.numberReadAveraged.average” , “storageAdapter.numberWriteAveraged.average” , “storageAdapter.read.average” , “storageAdapter.write.average” , “storageAdapter.totalReadLatency.average” , “storageAdapter.totalWriteLatency.average” , “cpu.utilization.average” , “cpu.readiness.average” , “cpu.ready.summation” , “net.bytesRx.average” , “net.bytesTx.average” , “virtualDisk.totalWriteLatency.average” , “virtualDisk.totalReadLatency.average” , “net.received.average” , “net.transmitted.average” , “net.packetsRx.summation” , “net.packetsTx.summation” , “mem.consumed.average” , “mem.totalmb.average” ] host_metric_exclude = [] host_instances = true ## true by default

datastore_metric_include = [ “datastore.numberReadAveraged.average” , “datastore.numberWriteAveraged.average” , “datastore.read.average” , “datastore.write.average” , “datastore.totalReadLatency.average” , “datastore.totalWriteLatency.average” , “datastore.datastoreVMObservedLatency.latest” , “disk.capacity.latest” , “disk.used.latest” , “disk.numberReadAveraged.average” , “disk.numberWriteAveraged.average” ] ## if omitted or empty, all metrics are collected datastore_metric_exclude = [] datastore_instances = true ## false by default for Datastores only

datacenter_metric_include = [] ## if omitted or empty, all metrics are collected datacenter_metric_exclude = [] ## Datacenters are not collected by default. datacenter_instances = false

cluster_metric_include = [] ## if omitted or empty, all metrics are collected cluster_metric_exclude = [] ## Nothing excluded by default cluster_instances = false ## true by default

separator = “_” max_query_objects = 70 max_query_metrics = 70 collect_concurrency = 4 discover_concurrency = 1 force_discover_on_init = true object_discovery_interval = “30s” timeout = “20s” insecure_skip_verify = true

Expected behavior:

Data is collected

Actual behavior:

Data stops collected after you see following message in the log:

2018-11-28T21:13:00Z W! [agent] input “inputs.vsphere” did not complete within its interval

Additional info:

The same configuration works for Telegraf 1.8.3. I upgraded telegraf from to 1.9.0.

Related forum link: https://community.influxdata.com/t/telegraf-vsphere/7457/11

prydin commented 5 years ago

We are working on this particular issue as we speak. It seems to be related to intermittent network issues not being handled correctly.

prydin commented 5 years ago

@danielnelson This goes back to the discussion we had a while ago about exposing the interval or sending in a context with a timeout to Gather().

Every time I make an API call, I have to wrap it like this:

ctx1, cancel1 := context.WithTimeout(ctx, timeout)
defer cancel1()
APICall(ctx1, params)

As you might have guessed, I forgot that wrapping around one call, causing it to hang indefinitely if the network was dropped.

If I knew the interval, I could have created a root context with a deadline that matches the interval and passed it to all calls. Alternative, if Gather() would have taken a context, the Telegraf core could cancel the context when the interval was exceeded.

Of course I shouldn't have forgotten the wrapping, but having to deal with timeouts manually for every call is very error prone.

danielnelson commented 5 years ago

It is expected that the input will to continue to work towards completing the collection even across multiple intervals, this way we don't end up continually restarting the plugin which will degrade more gracefully. Better to have a reduced sampling rate than no data at all.

I do want to add a context to the Gather function, but it should be used only for canceling on shutdown/restart.

einhirn commented 5 years ago

https://github.com/influxdata/telegraf/blob/release-1.10/plugins/inputs/vsphere/README.md documents the cause of this issue (realtime vs. historical metrics in vsphere) quite nicely and gives a workaround. TL;DR: Just use two instances of the plugin, one for vm and host (realtime) metrics, the second for the other (historical) metrics.

danielnelson commented 5 years ago

Closing, should be fixed by #5113 but also try the workaround that @einhirn pointed out. If this is still trouble with 1.10, let's check if there are any similar issues open and if not open a new issue.