influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.84k stars 5.59k forks source link

[Inputs.vsphere] Error in plugin: ServerFaultCode: XML document element count exceeds configured maximum 500000 #5041

Closed PLColuccio closed 5 years ago

PLColuccio commented 5 years ago

I am receiving this when the plugin receives metrics from the vCenter servers.

Any idea on what is wrong / how to fix?

2018-11-26T22:24:47Z E! [inputs.vsphere]: Error in plugin: ServerFaultCode: XML document element count exceeds configured maximum 500000

while parsing serialized DataObject of type vim.PerformanceManager.MetricId at line 2, column 19637665

while parsing property "metricId" of static type ArrayOfPerfMetricId

while parsing serialized DataObject of type vim.PerformanceManager.QuerySpec at line 2, column 19598059

while parsing call information for method QueryPerf at line 2, column 66

while parsing SOAP body at line 2, column 60

while parsing SOAP envelope at line 2, column 0

while parsing HTTP request for method queryStats on object of type vim.PerformanceManager at line 1, column 0

prydin commented 5 years ago

I looks like the plugin is trying to send a huge query to the server. Limit max_query_objects and/or max_query_metrics. For example:

max_query_objects = 100 max_query_metrics = 100

prydin commented 5 years ago

The next release will also limit queries to 100,000 metrics at a time, regardless of settings. This should prevent this from happening again.

As a side note, @phreak2599, I'd be interested in knowing a bit more about your configuration. Assuming you have the default 256 objects per query, 500,000 metrics sounds incredibly high. How many VMs/hosts are in that vCenter, if you don't mind sharing?

PLColuccio commented 5 years ago

It was actually set to 64, since we are currently running 5.5, although we are in the process of upgrading to 6.5. I have since set it to 32 to see if that helps.

This plugin is currently running in one of our data centers against 4 vCenter hosts. Per vCenter: host count 29 vm count 548 host count 131 vm count 2343 host count 60 vm count 1733 host count 87 vm count 2168

Not sure which vCenter was causing the issue though.

Update: Seems to be working well with both the query settings set to 32.

Thanks for the help @prydin !

PLColuccio commented 5 years ago

Spoke too soon. Looks like now I am getting:

2018-11-28T15:40:54Z W! [agent] input "inputs.vsphere" did not complete within its interval 2018-11-28T15:40:54Z W! [agent] input "inputs.vsphere" did not complete within its interval 2018-11-28T15:40:54Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:40:54Z W! [agent] input "inputs.vsphere" did not complete within its interval 2018-11-28T15:40:54Z W! [agent] input "inputs.vsphere" did not complete within its interval 2018-11-28T15:40:59Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:41:04Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:41:09Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:41:14Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:41:19Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:41:24Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:41:29Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:41:34Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:41:39Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:41:44Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:41:49Z D! [outputs.influxdb] buffer fullness: 0 / 1000000 metrics. 2018-11-28T15:41:54Z W! [agent] input "inputs.vsphere" did not complete within its interval 2018-11-28T15:41:54Z W! [agent] input "inputs.vsphere" did not complete within its interval 2018-11-28T15:41:54Z W! [agent] input "inputs.vsphere" did not complete within its interval 2018-11-28T15:41:54Z W! [agent] input "inputs.vsphere" did not complete within its interval

prydin commented 5 years ago

What's your collect_concurrency setting? Try to increase it to, say, 5.

Also, if you don't need instance-level (per CPU etc) metrics, you can turn that off per resource type, which should save you a lot of collection time.

Another thing you can try is to reduce the number of metrics collected to only those you need.

We're just at the tail end of a huge scale testing and performance tuning effort and should be providing an update soon that has some performance tweaks. In our lab, we're collecting metrics for 7000 VMs, including instance data, in about 6 seconds.

PLColuccio commented 5 years ago

Not sure if I understand exactly what is happening, but it seems the initial discovery runs, then the plugin runs fine until the next discovery. When the next discovery runs, it doesn't complete, then the plugin doesn't seem to be sending any metrics, most likely due to the discovery failing.

Does that sound plausible?

If I raise the concurrency settings I think I will have to give more CPU to my vCenter DB servers. They have pegged out when I was playing with those in the past.

prydin commented 5 years ago

Try increasing the discovery interval to 30 minutes. The discovery logic is greatly improved in the version we're about to release. Should run 50-100 times faster!

I can post a binary if you feel like testing it out.

prydin commented 5 years ago

BTW, the concurrency settings for metric collection shouldn't have a huge impact on database servers, at least not for VM and host metrics, since they are scraped from ESXi memory.

PLColuccio commented 5 years ago

I can try the latest repo. Let me see how difficult it is to compile.

prydin commented 5 years ago

Its not in the main repo, but in my fork. I can point you in the right direction in a little while. In transit this second.

Sent from my Verizon, Samsung Galaxy smartphone

-------- Original message -------- From: phreak2599 notifications@github.com Date: 11/28/18 11:38 AM (GMT-05:00) To: influxdata/telegraf telegraf@noreply.github.com Cc: Pontus Rydin prydin@vmware.com, Mention mention@noreply.github.com Subject: Re: [influxdata/telegraf] [Inputs.vsphere] Error in plugin: ServerFaultCode: XML document element count exceeds configured maximum 500000 (#5041)

I can try the latest repo. Let me see how difficult it is to compile.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Finfluxdata%2Ftelegraf%2Fissues%2F5041%23issuecomment-442514639&data=02%7C01%7Cprydin%40vmware.com%7C4a2c497cc173493bd8b008d6554fe3f0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636790198968235212&sdata=cIYsknheIVgba9ogMgrlvdgt3%2FUq4u85lpeJakxnN4Q%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE_y774rbHSPtaNRkkdkmkd11XCQuIFnks5uzrv0gaJpZM4Yz9FW&data=02%7C01%7Cprydin%40vmware.com%7C4a2c497cc173493bd8b008d6554fe3f0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636790198968235212&sdata=oswdhOFZwEtVODqBifMlYnn6FO07r%2F7IT%2FGx02HrzZg%3D&reserved=0.

PLColuccio commented 5 years ago

I had the same issue with the new version. The discovery finishes, the initial metric collection seems to finish (but I don't think it does). I think the plugin is hanging, and not ever finishing the initial collection.

I am just getting:

2018-11-29T15:46:36Z D! [input.vsphere] Query for cluster returned metrics for 2 objects 2018-11-29T15:46:36Z D! [input.vsphere] CollectChunk for cluster returned 8 metrics 2018-11-29T15:46:36Z D! [input.vsphere] Query for cluster returned metrics for 1 objects 2018-11-29T15:46:36Z D! [input.vsphere] CollectChunk for cluster returned 8 metrics 2018-11-29T15:46:36Z D! [input.vsphere] Query for cluster returned metrics for 1 objects 2018-11-29T15:46:36Z D! [input.vsphere] CollectChunk for cluster returned 10 metrics 2018-11-29T15:46:36Z D! [input.vsphere] CollectChunk for cluster returned 0 metrics 2018-11-29T15:46:36Z D! [input.vsphere] Query for cluster returned metrics for 1 objects 2018-11-29T15:46:36Z D! [input.vsphere] CollectChunk for cluster returned 10 metrics 2018-11-29T15:46:36Z D! [input.vsphere] Query for cluster returned metrics for 1 objects 2018-11-29T15:46:36Z D! [input.vsphere] CollectChunk for cluster returned 10 metrics 2018-11-29T15:46:36Z D! [input.vsphere] Query for cluster returned metrics for 1 objects 2018-11-29T15:46:36Z D! [input.vsphere] CollectChunk for cluster returned 4 metrics 2018-11-29T15:46:40Z D! [outputs.influxdb] wrote batch of 12 metrics in 3.165227ms 2018-11-29T15:46:40Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-11-29T15:46:45Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-11-29T15:46:50Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-11-29T15:46:55Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-11-29T15:47:00Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-11-29T15:47:05Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-11-29T15:47:10Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-11-29T15:47:15Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-11-29T15:47:20Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-11-29T15:47:20Z W! [agent] input "inputs.vsphere" did not complete within its interval

and the last bit keeps repeating. Never starts collecting metrics again.

prydin commented 5 years ago

Are you collecting datastore metrics? Try disabling that.

datastore_metric_exclude = ["*"]

If that solves the problem, move the datastore collection to a separate instance of [inputs.vsphere] with an interval >= 300s.

Collection of datastore metrics can take a VERY long time due to the way vCenter manages that data. If it doesn't complete within the interval, you'll see these kinds of problems.

Also, let me point you to the very latest version that has some pretty radical performance improvements. Stand by!

PLColuccio commented 5 years ago

Cool, that seemed to get things going on this current release. Do you know the timeframe for the new release?

prydin commented 5 years ago

The actual release timing is up to the influx team, but I can get you a snapshot from my branch today. Use at you own risk and all that, of course...

prydin commented 5 years ago

Here's a snapshot that's been tested in our lab for a few days without any issues. You're welcome to try it (at your own risk). I attached a compiled binary for Linux. Let me know if you need any other flavors.

https://github.com/prydin/telegraf/releases/tag/PR-SCALE-IMPROVEMENT-BETA1

ghost commented 5 years ago

I still have the same issue with the binary you provided. Even with an interval of 300s on the agent i still get no information after an hour of collecting metrics.

2018-12-03T13:50:00Z W! [agent] input "inputs.vsphere" did not complete within its interval 2018-12-03T13:52:00Z D! [outputs.influxdb] wrote batch of 34 metrics in 6.219342ms 2018-12-03T13:52:00Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-12-03T13:54:00Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-12-03T13:55:00Z W! [agent] input "inputs.vsphere" did not complete within its interval 2018-12-03T13:56:00Z D! [outputs.influxdb] wrote batch of 34 metrics in 5.141758ms 2018-12-03T13:56:00Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-12-03T13:58:00Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-12-03T14:00:00Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics. 2018-12-03T14:00:00Z W! [agent] input "inputs.vsphere" did not complete within its interval

I get 204 http status on the influxdb side API.

Dec 03 14:42:00 XXXXXXXXX influxd[32499]: [httpd] 10.12.168.11 - - [03/Dec/2018:14:42:00 +0100] "POST /write?db=iaaspriv HTTP/1.1" 204 0 "-" "Telegraf/unknown" 35891433-f701-11e8-846a-005056bc0ddf 5270 Dec 03 14:46:00 XXXXXXXXX influxd[32499]: [httpd] 10.12.168.11 - - [03/Dec/2018:14:46:00 +0100] "POST /write?db=iaaspriv HTTP/1.1" 204 0 "-" "Telegraf/unknown" c4963448-f701-11e8-846b-005056bc0ddf 7489 Dec 03 14:52:00 XXXXXXXXX influxd[32499]: [httpd] 10.12.168.11 - - [03/Dec/2018:14:52:00 +0100] "POST /write?db=iaaspriv HTTP/1.1" 204 0 "-" "Telegraf/unknown" 9b29d8c6-f702-11e8-8488-005056bc0ddf 5070 Dec 03 14:56:00 XXXXXXXXX influxd[32499]: [httpd] 10.12.168.11 - - [03/Dec/2018:14:56:00 +0100] "POST /write?db=iaaspriv HTTP/1.1" 204 0 "-" "Telegraf/unknown" 2a36e2c8-f703-11e8-849e-005056bc0ddf 4267 Dec 03 15:02:00 XXXXXXXXX influxd[32499]: [httpd] 10.12.168.11 - - [03/Dec/2018:15:02:00 +0100] "POST /write?db=iaaspriv HTTP/1.1" 204 0 "-" "Telegraf/unknown" 00ca9e50-f704-11e8-84a6-005056bc0ddf 4848

I only try to get few information on Vcenter that contain 7669 VM, here is my conf:

  vm_metric_include = [
    "cpu.usage.average",
    "mem.usage.average",
    "net.received.average",
    "net.transmitted.average",
    "virtualDisk.read.average",
    "virtualDisk.write.average",
    "virtualDisk.writeOIO.latest"
  ]
  host_metric_include = [
    "cpu.usage.average",
    "disk.read.average",
    "disk.write.average",
    "disk.totalReadLatency.average",
    "disk.totalWriteLatency.average",
    "mem.usage.average",
    "net.received.average",
    "net.transmitted.average"
  ]
  cluster_metric_exclude = ["*"]
  datastore_metric_exclude = ["*"]
  datacenter_metric_exclude = [ "*" ]
  datacenter_metric_exclude = [ "*" ]
  max_query_objects = 256
  max_query_metrics = 256
  collect_concurrency = 24
  discover_concurrency = 24
  object_discovery_interval = "600s"
  timeout = "120s"
  insecure_skip_verify = true
prydin commented 5 years ago

The "exclude" statements should read:

datstore_metric_exclude = [ "*" ]
prydin commented 5 years ago

Also, do you get any debug statements starting with [input.vsphere]? You should at least see some statements saying that it's attempting to collect.

ghost commented 5 years ago
datstore_metric_exclude = [ "*" ]

Sorry i've forget to display my conf in markdown.

And yes i get debug entry with [input.vsphere]: Latest log:

2018-12-03T13:31:41Z D! [input.vsphere] Discovering resources for datastore

After that none of this entry appears in my telegraf.log

prydin commented 5 years ago

I'd need to see all the [input.vsphere] log lines to troubleshoot this. It looks like discovering the datastores takes a really long time. How many datastores do you have?

prydin commented 5 years ago

Also, what is the output of telegraf -version?

ghost commented 5 years ago

Telegraf version: Telegraf unknown (git: prydin-scale-improvement aaa67547

For security reason i can't give you the complete log, but the last interval didn't show up any errors with the key [input.vsphere]. The output only says

2018-12-03T13:31:40Z D! [input.vsphere] Skipped powered off VM: xxxxx  <= this show up for more than a thousand time with another hostname
2018-12-03T13:31:41Z D! [input.vsphere] Found 11 metrics for foo.bar.io <= this show up for more than a thousand time with another hostname
2018-12-03T13:31:41Z D! [input.vsphere] Discovering resources for datastore

After this last line the process still running and send request to influxdb but without data (204 http/code).

prydin commented 5 years ago

@bashrc666 If possible, could you run telegraf with the -pprof-addr 0.0.0.0:6060 added to the end. Then, once the agent becomes unresponsive, you can get a complete goroutine dump using this command:

curl http://localhost:6060/debug/pprof/goroutine?debug=1

Copy and paste the output to this thread. The output doesn't contain any application data, so it should be safe to share. This will tell me exactly where the code locks up.

ghost commented 5 years ago

Context

I try to collect simple vm metrics on a vcenter that manage:

259 host 8129 VM

BUG After about 25min Telegraf still running but no data is inserted in influxdb.

GO DUMP :

8 @ 0x42e14b 0x43e12d 0x8893ce 0x88c330 0x45c551
#       0x8893cd        github.com/influxdata/telegraf/agent.(*Agent).gatherOnInterval+0x1cd    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:262
#       0x88c32f        github.com/influxdata/telegraf/agent.(*Agent).runInputs.func1+0xbf      /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:229

5 @ 0x42e14b 0x43e12d 0x17ec6dc 0x17ee29d 0x45c551
#       0x17ec6db       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).pushOut+0xeb        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:56
#       0x17ee29c       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run.func1.1+0xcc    /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:80

1 @ 0x40ae87 0x4431dc 0x737382 0x45c551
#       0x4431db        os/signal.signal_recv+0x9b      /usr/local/Cellar/go/1.11/libexec/src/runtime/sigqueue.go:139
#       0x737381        os/signal.loop+0x21             /usr/local/Cellar/go/1.11/libexec/src/os/signal/signal_unix.go:23

1 @ 0x42e14b 0x429489 0x428b36 0x49818a 0x49829d 0x498fe9 0x5a143f 0x5b5648 0x69da8a 0x45c551
#       0x428b35        internal/poll.runtime_pollWait+0x65             /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173
#       0x498189        internal/poll.(*pollDesc).wait+0x99             /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85
#       0x49829c        internal/poll.(*pollDesc).waitRead+0x3c         /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90
#       0x498fe8        internal/poll.(*FD).Read+0x178                  /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:169
#       0x5a143e        net.(*netFD).Read+0x4e                          /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:202
100 12082    0 12082    0     0  2141k      0 --:--:-- --:--:-- --:-src/net/net.go:177lar/go/1.11/libexec/
#       0x69da89        net/http.(*connReader).backgroundRead+0x59      /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:676

1 @ 0x42e14b 0x429489 0x428b36 0x49818a 0x49829d 0x498fe9 0x5a143f 0x5b5648 0x6ba7e5 0x559cd6 0x559e2f 0x6bb332 0x45c551
#       0x428b35        internal/poll.runtime_pollWait+0x65     /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173
#       0x498189        internal/poll.(*pollDesc).wait+0x99     /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85
#       0x49829c        internal/poll.(*pollDesc).waitRead+0x3c /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90
#       0x498fe8        internal/poll.(*FD).Read+0x178          /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:169
#       0x5a143e        net.(*netFD).Read+0x4e                  /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:202
#       0x5b5647        net.(*conn).Read+0x67                   /usr/local/Cellar/go/1.11/libexe-:-- 2359k
c/src/net/net.go:177
#       0x6ba7e4        net/http.(*persistConn).Read+0x74       /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1497
#       0x559cd5        bufio.(*Reader).fill+0x105              /usr/local/Cellar/go/1.11/libexec/src/bufio/bufio.go:100
#       0x559e2e        bufio.(*Reader).Peek+0x3e               /usr/local/Cellar/go/1.11/libexec/src/bufio/bufio.go:132
#       0x6bb331        net/http.(*persistConn).readLoop+0x1a1  /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1645

1 @ 0x42e14b 0x429489 0x428b36 0x49818a 0x49829d 0x49a590 0x5a1d82 0x5c026e 0x5be797 0x6a819f 0x6c8c7c 0x6a6fcf 0x6a6c86 0x6a7c74 0x1a5ac1f 0x45c551
#       0x428b35        internal/poll.runtime_pollWait+0x65             /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173
#       0x498189        internal/poll.(*pollDesc).wait+0x99             /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85
#       0x49829c        internal/poll.(*pollDesc).waitRead+0x3c         /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90
#       0x49a58f        internal/poll.(*FD).Accept+0x19f                /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:384
#       0x5a1d81        net.(*netFD).accept+0x41                        /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:238
#       0x5c026d        net.(*TCPListener).accept+0x2d                  /usr/local/Cellar/go/1.11/libexec/src/net/tcpsock_posix.go:139
#       0x5be796        net.(*TCPListener).AcceptTCP+0x46               /usr/local/Cellar/go/1.11/libexec/src/net/tcpsock.go:247
#       0x6a819e        net/http.tcpKeepAliveListener.Accept+0x2e       /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:3232
#       0x6a6fce        net/http.(*Server).Serve+0x22e                  /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2826
#       0x6a6c85        net/http.(*Server).ListenAndServe+0xb5          /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2764
#       0x6a7c73        net/http.ListenAndServe+0x73                    /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:3004
#       0x1a5ac1e       main.main.func2+0x17e                           /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:274

1 @ 0x42e14b 0x42e1f3 0x405a8e 0x4057bb 0x88a13c 0x88bde4 0x45c551
#       0x88a13b        github.com/influxdata/telegraf/agent.(*Agent).runOutputs+0x2ab  /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:451
#       0x88bde3        github.com/influxdata/telegraf/agent.(*Agent).Run.func4+0x83    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:123

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x17e6a7c 0x17edd08 0x45c551#       0x43ed68        sync.runtime_Semacquire+0x38                                                            /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                                             /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x17e6a7b       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect+0x2ab         /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:611
#       0x17edd07       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*VSphere).Gather.func1+0x87      /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsphere.go:269

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x17ec378 0x77f66d 0x88c40f 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                                    /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                                     /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x17ec377       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*VSphere).Gather+0x167   /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsphere.go:282
#       0x77f66c        github.com/influxdata/telegraf/internal/models.(*RunningInput).Gather+0x6c      /Users/prydin/go/src/github.com/influxdata/telegraf/internal/models/running_input.go:86
#       0x88c40e        github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1+0x3e             /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:283

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x17ec908 0x17e81de 0x17ed4fe 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                                            /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                                             /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x17ec907       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Drain+0x87          /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:117
#       0x17e81dd       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).collectResource+0x99d /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:758
#       0x17ed4fd       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect.func1+0x9d    /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:604

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x17ee4dd 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                                            /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                                             /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x17ee4dc       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run.func1+0xec      /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:88

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x8887c1 0x1a590a0 0x1a587a8 0x1a59f4a 0x42dd57 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                            /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                             /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x8887c0        github.com/influxdata/telegraf/agent.(*Agent).Run+0x470 /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:129
#       0x1a5909f       main.runAgent+0x85f                                     /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:185
#       0x1a587a7       main.reloadLoop+0x247                                   /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:101
#       0x1a59f49       main.main+0x4b9                                         /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:381
#       0x42dd56        runtime.main+0x206                                      /usr/local/Cellar/go/1.11/libexec/src/runtime/proc.go:201

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x8891d8 0x88b9d4 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                    /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                     /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x8891d7        github.com/influxdata/telegraf/agent.(*Agent).runInputs+0x287   /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:232
#       0x88b9d3        github.com/influxdata/telegraf/agent.(*Agent).Run.func1+0xa3    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:69

1 @ 0x42e14b 0x43e12d 0x17ecb5a 0x45c551
#       0x17ecb59       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).startDiscovery.func1+0xd9     /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:230

1 @ 0x42e14b 0x43e12d 0x19fa24d 0x45c551
#       0x19fa24c       github.com/influxdata/telegraf/vendor/go.opencensus.io/stats/view.(*worker).start+0xdc  /Users/prydin/go/src/github.com/influxdata/telegraf/vendor/go.opencensus.io/stats/view/worker.go:150

1 @ 0x42e14b 0x43e12d 0x1a5a8cf 0x45c551
#       0x1a5a8ce       main.reloadLoop.func1+0xae      /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:88

1 @ 0x42e14b 0x43e12d 0x6bc8f3 0x45c551
#       0x6bc8f2        net/http.(*persistConn).writeLoop+0x112 /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1885

1 @ 0x42e14b 0x43e12d 0x8896b3 0x889326 0x88c330 0x45c551
#       0x8896b2        github.com/influxdata/telegraf/agent.(*Agent).gatherOnce+0x232          /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:287
#       0x889325        github.com/influxdata/telegraf/agent.(*Agent).gatherOnInterval+0x125    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:257
#       0x88c32f        github.com/influxdata/telegraf/agent.(*Agent).runInputs.func1+0xbf      /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:229

1 @ 0x42e14b 0x43e12d 0x88a3d2 0x88c8c3 0x45c551
#       0x88a3d1        github.com/influxdata/telegraf/agent.(*Agent).flush+0x1a1               /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:496
#       0x88c8c2        github.com/influxdata/telegraf/agent.(*Agent).runOutputs.func1+0xa2     /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:447

1 @ 0x42e14b 0x43e12d 0x88b82b 0x45c551
#       0x88b82a        github.com/influxdata/telegraf/agent.(*Ticker).relayTime+0x12a  /Users/prydin/go/src/github.com/influxdata/telegraf/agent/tick.go:46

1 @ 0x72d048 0x72ce50 0x7298b4 0x735cf0 0x7365c3 0x6a3f24 0x6a5bb7 0x6a6b5b 0x6a2f86 0x45c551
#       0x72d047        runtime/pprof.writeRuntimeProfile+0x97  /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:707
#       0x72ce4f        runtime/pprof.writeGoroutine+0x9f       /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:669
#       0x7298b3        runtime/pprof.(*Profile).WriteTo+0x3e3  /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:328
#       0x735cef        net/http/pprof.handler.ServeHTTP+0x20f  /usr/local/Cellar/go/1.11/libexec/src/net/http/pprof/pprof.go:245
#       0x7365c2        net/http/pprof.Index+0x722              /usr/local/Cellar/go/1.11/libexec/src/net/http/pprof/pprof.go:268
#       0x6a3f23        net/http.HandlerFunc.ServeHTTP+0x43     /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:1964
#       0x6a5bb6        net/http.(*ServeMux).ServeHTTP+0x126    /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2361
#       0x6a6b5a        net/http.serverHandler.ServeHTTP+0xaa   /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2741
#       0x6a2f85        net/http.(*conn).serve+0x645            /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:1847

Latest Log :

2018-12-04T09:56:00Z D! [outputs.influxdb] wrote batch of 136 metrics in 7.138475ms
2018-12-04T09:56:00Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-04T09:56:30Z W! [agent] input "inputs.vsphere" did not complete within its interval
2018-12-04T09:57:00Z W! [agent] input "inputs.vsphere" did not complete within its interval
2018-12-04T09:57:30Z W! [agent] input "inputs.vsphere" did not complete within its interval
2018-12-04T09:58:00Z W! [agent] input "inputs.vsphere" did not complete within its interval
2018-12-04T09:58:00Z D! [outputs.influxdb] wrote batch of 136 metrics in 10.431907ms
2018-12-04T09:58:00Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-04T09:58:30Z W! [agent] input "inputs.vsphere" did not complete within its interval
prydin commented 5 years ago

THANK YOU!!!! This gives me a pretty good idea what's wrong!

prydin commented 5 years ago

@bashrc666 Thanks again for the detailed information. It was extremely helpful.

Here's a pre-release of what's on PR #5113

https://github.com/prydin/telegraf/releases/tag/PR-SCALE-IMPROVEMENT-RC1

Try it if you like. As always with a pre-release, you use it at your own risk.

ghost commented 5 years ago

Hello,

I still have the same issue with the same vcenter.

version :

Telegraf unknown (git: prydin-scale-improvement 646c5960

GO DUMP

goroutine profile: total 30
8 @ 0x42e14b 0x43e12d 0x88941e 0x88c380 0x45c551
#       0x88941d        github.com/influxdata/telegraf/agent.(*Agent).gatherOnInterval+0x1cd    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:262
#       0x88c37f        github.com/influxdata/telegraf/agent.(*Agent).runInputs.func1+0xbf      /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:229

2 @ 0x42e14b 0x43e12d 0x6bc943 0x45c551
#       0x6bc942        net/http.(*persistConn).writeLoop+0x112 /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1885

1 @ 0x40ae87 0x4431dc 0x7373d2 0x45c551
#       0x4431db        os/signal.signal_recv+0x9b      /usr/local/Cellar/go/1.11/libexec/src/runtime/sigqueue.go:139
#       0x7373d1        os/signal.loop+0x21             /usr/local/Cellar/go/1.11/libexec/src/os/signal/signal_unix.go:23

1 @ 0x42e14b 0x429489 0x428b36 0x49818a 0x49829d 0x498fe9 0x5a148f 0x5b5698 0x603f29 0x60442d 0x6079b1 0x6ba835 0x559d26 0x559e7f 0x6bb382 0x45c551
#       0x428b35        internal/poll.runtime_pollWait+0x65     /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173
#       0x498189        internal/poll.(*pollDesc).wait+0x99     /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85
#       0x49829c        internal/poll.(*pollDesc).waitRead+0x3c /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90
#       0x498fe8        internal/poll.(*FD).Read+0x178          /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:169
#       0x5a148e        net.(*netFD).Read+0x4e                  /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:202
#       0x5b5697        net.(*conn).Read+0x67                   /usr/local/Cellar/go/1.11/libexec/src/net/net.go:177
#       0x603f28        crypto/tls.(*block).readFromUntil+0x88  /usr/local/Cellar/go/1.11/libexec/src/crypto/tls/conn.go:492
#       0x60442c        crypto/tls.(*Conn).readRecord+0xdc      /usr/local/Cellar/go/1.11/libexec/src/crypto/tls/conn.go:593
#       0x6079b0        crypto/tls.(*Conn).Read+0xf0            /usr/local/Cellar/go/1.11/libexec/src/crypto/tls/conn.go:1145
#       0x6ba834        net/http.(*persistConn).Read+0x74       /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1497
#       0x559d25        bufio.(*Reader).fill+0x105              /usr/local/Cellar/go/1.11/libexec/src/bufio/bufio.go:100
#       0x559e7e        bufio.(*Reader).Peek+0x3e               /usr/local/Cellar/go/1.11/libexec/src/bufio/bufio.go:132
#       0x6bb381        net/http.(*persistConn).readLoop+0x1a1  /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1645

1 @ 0x42e14b 0x429489 0x428b36 0x49818a 0x49829d 0x498fe9 0x5a148f 0x5b5698 0x69dada 0x45c551
#       0x428b35        internal/poll.runtime_pollWait+0x65             /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173
#       0x498189        internal/poll.(*pollDesc).wait+0x99             /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85
#       0x49829c        internal/poll.(*pollDesc).waitRead+0x3c         /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90
#       0x498fe8        internal/poll.(*FD).Read+0x178                  /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:169
#       0x5a148e        net.(*netFD).Read+0x4e                          /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:202
#       0x5b5697        net.(*conn).Read+0x67                           /usr/local/Cellar/go/1.11/libexec/src/net/net.go:177
#       0x69dad9        net/http.(*connReader).backgroundRead+0x59      /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:676

1 @ 0x42e14b 0x429489 0x428b36 0x49818a 0x49829d 0x498fe9 0x5a148f 0x5b5698 0x6ba835 0x559d26 0x559e7f 0x6bb382 0x45c551
#       0x428b35        internal/poll.runtime_pollWait+0x65     /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173
#       0x498189        internal/poll.(*pollDesc).wait+0x99     /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85
#       0x49829c        internal/poll.(*pollDesc).waitRead+0x3c /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90
#       0x498fe8        internal/poll.(*FD).Read+0x178          /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:169
#       0x5a148e        net.(*netFD).Read+0x4e                  /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:202
#       0x5b5697        net.(*conn).Read+0x67                   /usr/local/Cellar/go/1.11/libexec/src/net/net.go:177
#       0x6ba834        net/http.(*persistConn).Read+0x74       /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1497
#       0x559d25        bufio.(*Reader).fill+0x105              /usr/local/Cellar/go/1.11/libexec/src/bufio/bufio.go:100
#       0x559e7e        bufio.(*Reader).Peek+0x3e               /usr/local/Cellar/go/1.11/libexec/src/bufio/bufio.go:132
#       0x6bb381        net/http.(*persistConn).readLoop+0x1a1  /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1645

1 @ 0x42e14b 0x429489 0x428b36 0x49818a 0x49829d 0x49a590 0x5a1dd2 0x5c02be 0x5be7e7 0x6a81ef 0x6c8ccc 0x6a701f 0x6a6cd6 0x6a7cc4 0x1a5a9ff 0x45c551
#       0x428b35        internal/poll.runtime_pollWait+0x65             /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173
#       0x498189        internal/poll.(*pollDesc).wait+0x99             /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85
#       0x49829c        internal/poll.(*pollDesc).waitRead+0x3c         /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90
#       0x49a58f        internal/poll.(*FD).Accept+0x19f                /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:384
#       0x5a1dd1        net.(*netFD).accept+0x41                        /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:238
#       0x5c02bd        net.(*TCPListener).accept+0x2d                  /usr/local/Cellar/go/1.11/libexec/src/net/tcpsock_posix.go:139
#       0x5be7e6        net.(*TCPListener).AcceptTCP+0x46               /usr/local/Cellar/go/1.11/libexec/src/net/tcpsock.go:247
#       0x6a81ee        net/http.tcpKeepAliveListener.Accept+0x2e       /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:3232
#       0x6a701e        net/http.(*Server).Serve+0x22e                  /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2826
#       0x6a6cd5        net/http.(*Server).ListenAndServe+0xb5          /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2764
#       0x6a7cc3        net/http.ListenAndServe+0x73                    /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:3004
#       0x1a5a9fe       main.main.func2+0x17e                           /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:274

1 @ 0x42e14b 0x42e1f3 0x404ead 0x404c85 0x17ec185 0x17e7689 0x17e7e05 0x17e8b60 0x17eda7e 0x45c551
#       0x17ec184       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*ThrottledExecutor).Run+0x54     /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/throttled_exec.go:25
#       0x17e7688       github.com/influxdata/telegraf/plugins/inputs/vsphere.submitChunkJob+0x88               /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:667
#       0x17e7e04       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).chunkify+0x734        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:732
#       0x17e8b5f       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).collectResource+0x7cf /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:789
#       0x17eda7d       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect.func1+0x9d    /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:651

1 @ 0x42e14b 0x42e1f3 0x405a8e 0x4057bb 0x88a18c 0x88be34 0x45c551
#       0x88a18b        github.com/influxdata/telegraf/agent.(*Agent).runOutputs+0x2ab  /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:451
#       0x88be33        github.com/influxdata/telegraf/agent.(*Agent).Run.func4+0x83    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:123

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x17e74dc 0x17ee225 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                                            /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                                             /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x17e74db       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect+0x2ab         /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:658
#       0x17ee224       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*VSphere).Gather.func1+0x84      /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsphere.go:268

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x17ecd22 0x77f6bd 0x88c45f 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                                    /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                                     /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x17ecd21       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*VSphere).Gather+0xe1    /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsphere.go:280
#       0x77f6bc        github.com/influxdata/telegraf/internal/models.(*RunningInput).Gather+0x6c      /Users/prydin/go/src/github.com/influxdata/telegraf/internal/models/running_input.go:86
#       0x88c45e        github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1+0x3e             /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:283

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x888811 0x1a58e80 0x1a58588 0x1a59d2a 0x42dd57 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                            /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                             /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x888810        github.com/influxdata/telegraf/agent.(*Agent).Run+0x470 /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:129
#       0x1a58e7f       main.runAgent+0x85f                                     /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:185
#       0x1a58587       main.reloadLoop+0x247                                   /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:101
#       0x1a59d29       main.main+0x4b9                                         /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:381
#       0x42dd56        runtime.main+0x206                                      /usr/local/Cellar/go/1.11/libexec/src/runtime/proc.go:201

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x889228 0x88ba24 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                    /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                     /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x889227        github.com/influxdata/telegraf/agent.(*Agent).runInputs+0x287   /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:232
#       0x88ba23        github.com/influxdata/telegraf/agent.(*Agent).Run.func1+0xa3    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:69

1 @ 0x42e14b 0x43e12d 0x17ecffa 0x45c551
#       0x17ecff9       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).startDiscovery.func1+0xd9     /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:237

1 @ 0x42e14b 0x43e12d 0x19fa02d 0x45c551
#       0x19fa02c       github.com/influxdata/telegraf/vendor/go.opencensus.io/stats/view.(*worker).start+0xdc  /Users/prydin/go/src/github.com/influxdata/telegraf/vendor/go.opencensus.io/stats/view/worker.go:150

1 @ 0x42e14b 0x43e12d 0x1a5a6af 0x45c551
#       0x1a5a6ae       main.reloadLoop.func1+0xae      /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:88

1 @ 0x42e14b 0x43e12d 0x6bd36a 0x6b3a01 0x69c165 0x6617db 0x6614fa 0x662b88 0x6628a5 0x172ca14 0x172d4a3 0x173f650 0x1738e18 0x17d419a 0x17e1785 0x17e93b7 0x17edc27 0x17edb23 0x17ee0ad 0x45c551
#       0x6bd369        net/http.(*persistConn).roundTrip+0x569                                                                 /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:2101
#       0x6b3a00        net/http.(*Transport).roundTrip+0x9b0                                                                   /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:465
#       0x69c164        net/http.(*Transport).RoundTrip+0x34                                                                    /usr/local/Cellar/go/1.11/libexec/src/net/http/roundtrip.go:17
#       0x6617da        net/http.send+0x14a                                                                                     /usr/local/Cellar/go/1.11/libexec/src/net/http/client.go:250
#       0x6614f9        net/http.(*Client).send+0xf9                                                                            /usr/local/Cellar/go/1.11/libexec/src/net/http/client.go:174
#       0x662b87        net/http.(*Client).do+0x2a7                                                                             /usr/local/Cellar/go/1.11/libexec/src/net/http/client.go:641
#       0x6628a4        net/http.(*Client).Do+0x34                                                                              /usr/local/Cellar/go/1.11/libexec/src/net/http/client.go:509
#       0x172ca13       github.com/influxdata/telegraf/vendor/github.com/vmware/govmomi/vim25/soap.(*Client).do+0x113           /Users/prydin/go/src/github.com/influxdata/telegraf/vendor/github.com/vmware/govmomi/vim25/soap/client.go:442
#       0x172d4a2       github.com/influxdata/telegraf/vendor/github.com/vmware/govmomi/vim25/soap.(*Client).RoundTrip+0x882    /Users/prydin/go/src/github.com/influxdata/telegraf/vendor/github.com/vmware/govmomi/vim25/soap/client.go:524
#       0x173f64f       github.com/influxdata/telegraf/vendor/github.com/vmware/govmomi/vim25.(*Client).RoundTrip+0x7f          /Users/prydin/go/src/github.com/influxdata/telegraf/vendor/github.com/vmware/govmomi/vim25/client.go:89
#       0x1738e17       github.com/influxdata/telegraf/vendor/github.com/vmware/govmomi/vim25/methods.QueryPerf+0xb7            /Users/prydin/go/src/github.com/influxdata/telegraf/vendor/github.com/vmware/govmomi/vim25/methods/methods.go:9899
#       0x17d4199       github.com/influxdata/telegraf/vendor/github.com/vmware/govmomi/performance.(*Manager).Query+0x1a9      /Users/prydin/go/src/github.com/influxdata/telegraf/vendor/github.com/vmware/govmomi/performance/manager.go:276
#       0x17e1784       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Client).QueryMetrics+0x104                      /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/client.go:268
#       0x17e93b6       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).collectChunk+0x2c6                    /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:830
#       0x17edc26       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).collectResource.func1+0xe6            /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:791
#       0x17edb22       github.com/influxdata/telegraf/plugins/inputs/vsphere.submitChunkJob.func1+0x42                         /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:668
#       0x17ee0ac       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*ThrottledExecutor).Run.func1+0x7c               /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/throttled_exec.go:31

1 @ 0x42e14b 0x43e12d 0x6bf3df 0x45c551
#       0x6bf3de        net/http.setRequestCancel.func3+0xce    /usr/local/Cellar/go/1.11/libexec/src/net/http/client.go:321

1 @ 0x42e14b 0x43e12d 0x889703 0x889376 0x88c380 0x45c551
#       0x889702        github.com/influxdata/telegraf/agent.(*Agent).gatherOnce+0x232          /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:287
#       0x889375        github.com/influxdata/telegraf/agent.(*Agent).gatherOnInterval+0x125    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:257
#       0x88c37f        github.com/influxdata/telegraf/agent.(*Agent).runInputs.func1+0xbf      /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:229

1 @ 0x42e14b 0x43e12d 0x88a422 0x88c913 0x45c551
#       0x88a421        github.com/influxdata/telegraf/agent.(*Agent).flush+0x1a1               /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:496
#       0x88c912        github.com/influxdata/telegraf/agent.(*Agent).runOutputs.func1+0xa2     /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:447

1 @ 0x42e14b 0x43e12d 0x88b87b 0x45c551
#       0x88b87a        github.com/influxdata/telegraf/agent.(*Ticker).relayTime+0x12a  /Users/prydin/go/src/github.com/influxdata/telegraf/agent/tick.go:46

1 @ 0x72d098 0x72cea0 0x729904 0x735d40 0x736613 0x6a3f74 0x6a5c07 0x6a6bab 0x6a2fd6 0x45c551
#       0x72d097        runtime/pprof.writeRuntimeProfile+0x97  /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:707
#       0x72ce9f        runtime/pprof.writeGoroutine+0x9f       /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:669
#       0x729903        runtime/pprof.(*Profile).WriteTo+0x3e3  /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:328
#       0x735d3f        net/http/pprof.handler.ServeHTTP+0x20f  /usr/local/Cellar/go/1.11/libexec/src/net/http/pprof/pprof.go:245
#       0x736612        net/http/pprof.Index+0x722              /usr/local/Cellar/go/1.11/libexec/src/net/http/pprof/pprof.go:268
#       0x6a3f73        net/http.HandlerFunc.ServeHTTP+0x43     /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:1964
#       0x6a5c06        net/http.(*ServeMux).ServeHTTP+0x126    /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2361
#       0x6a6baa        net/http.serverHandler.ServeHTTP+0xaa   /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2741
#       0x6a2fd5        net/http.(*conn).serve+0x645            /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:1847

STDOUT

panic: runtime error: index out of range

goroutine 2061 [running]:
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).collectChunk(0xc000206900, 0x25051a0, 0xc00003e098, 0xc0016a4000, 0x13, 0x100, 0x21d7320, 0x9, 0xc000c3a090, 0x2512c40, ...)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:831 +0x17c2
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).collectResource.func1(0x25051a0, 0xc00003e098, 0x1c50cc0, 0xc000af96e0, 0x0, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:734 +0xff
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run.func1.1(0xc001376b80, 0xc000c675a0, 0x25051a0, 0xc00003e098, 0xc000155310)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:80 +0x8e
created by github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run.func1
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:72 +0xcd
prydin commented 5 years ago

@bashrc666 that output doesn't match the thread dump. The WorkerPool class doesn't exist anymore. Are you sure that's the right output?

As for the dump, it looks like it's stuck on a slow call to vCenter. What's your concurrency setting? Is the vCenter slow in general?

ghost commented 5 years ago

My conf

  vcenters = [ 'http://foo.bar/sdk' ]
  username = 'ADUSER'
  password = "supersecurepassword"

  vm_metric_include = [
    "cpu.usage.average",
    "mem.usage.average",
    "net.received.average",
    "net.transmitted.average",
    "virtualDisk.read.average",
    "virtualDisk.write.average",
    "virtualDisk.writeOIO.latest"
  ]
  host_metric_include = [
    "cpu.usage.average",
    "disk.read.average",
    "disk.write.average",
    "disk.totalReadLatency.average",
    "disk.totalWriteLatency.average",
    "mem.usage.average",
    "net.received.average",
    "net.transmitted.average"
  ]
  cluster_metric_exclude = []
  datastore_metric_exclude = [] 
  datacenter_metric_exclude = [ "*" ]
  collect_concurrency = 10
  discover_concurrency = 4
  object_discovery_interval = "3000s"
  insecure_skip_verify = true

It only happened on this particular very big vcenter that contain 29 cluster and 259 host and 8129 VM and so many datastore.

Maybe i'have something to improve on this config ???

@prydin Thank's so much for the help

prydin commented 5 years ago

@bashrc666 It's probably the datastore collection that takes a long time. Break it out into a separate declaration of [[inputs.vsphere]] and set the interval for that instance to 300s. Also, you're collecting every metric on the datastores. You can save some collection time by specifying a smaller set.

ghost commented 5 years ago

@prydin i've decided to get ride of the datastore metric for the moment, et get back on it when i'm sure that the VMS and HOST collecting will work on that vcenter. but between 10 to 20min telegraf stop working.

CONFIG

[[inputs.vsphere]]
  vcenters = [ 'https://foor.bar/sdk' ]
  username = 'ADUSER'
  password = "SUPERSTRONGPASSWORD"
  vm_metric_include = []
  host_metric_include = []
  cluster_metric_exclude = ["*"]
  datastore_metric_exclude = ["*"]
  datacenter_metric_exclude = [ "*" ]
  collect_concurrency = 10
  discover_concurrency = 4
  object_discovery_interval = "300s"
  insecure_skip_verify = true

GO DUMP

goroutine profile: total 37
10 @ 0x42e14b 0x43e12d 0x17ec6dc 0x17ee29d 0x45c551
#       0x17ec6db       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).pushOut+0xeb        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:56
#       0x17ee29c       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run.func1.1+0xcc    /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:80

8 @ 0x42e14b 0x43e12d 0x8893ce 0x88c330 0x45c551
#       0x8893cd        github.com/influxdata/telegraf/agent.(*Agent).gatherOnInterval+0x1cd    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:262
#       0x88c32f        github.com/influxdata/telegraf/agent.(*Agent).runInputs.func1+0xbf      /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:229

1 @ 0x40ae87 0x4431dc 0x737382 0x45c551
#       0x4431db        os/signal.signal_recv+0x9b      /usr/local/Cellar/go/1.11/libexec/src/runtime/sigqueue.go:139
#       0x737381        os/signal.loop+0x21             /usr/local/Cellar/go/1.11/libexec/src/os/signal/signal_unix.go:23

1 @ 0x42e14b 0x429489 0x428b36 0x49818a 0x49829d 0x498fe9 0x5a143f 0x5b5648 0x69da8a 0x45c551
#       0x428b35        internal/poll.runtime_pollWait+0x65             /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173
#       0x498189        internal/poll.(*pollDesc).wait+0x99             /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85
#       0x49829c        internal/poll.(*pollDesc).waitRead+0x3c         /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90
#       0x498fe8        internal/poll.(*FD).Read+0x178                  /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:169
#       0x5a143e        net.(*netFD).Read+0x4e                          /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:202
#       0x5b5647        net.(*conn).Read+0x67                           /usr/local/Cellar/go/1.11/libexec/src/net/net.go:177
#       0x69da89        net/http.(*connReader).backgroundRead+0x59      /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:676

1 @ 0x42e14b 0x429489 0x428b36 0x49818a 0x49829d 0x498fe9 0x5a143f 0x5b5648 0x6ba7e5 0x559cd6 0x559e2f 0x6bb332 0x45c551
#       0x428b35        internal/poll.runtime_pollWait+0x65     /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173
#       0x498189        internal/poll.(*pollDesc).wait+0x99     /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85
#       0x49829c        internal/poll.(*pollDesc).waitRead+0x3c /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90
#       0x498fe8        internal/poll.(*FD).Read+0x178          /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:169
#       0x5a143e        net.(*netFD).Read+0x4e                  /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:202
#       0x5b5647        net.(*conn).Read+0x67                   /usr/local/Cellar/go/1.11/libexec/src/net/net.go:177
#       0x6ba7e4        net/http.(*persistConn).Read+0x74       /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1497
#       0x559cd5        bufio.(*Reader).fill+0x105              /usr/local/Cellar/go/1.11/libexec/src/bufio/bufio.go:100
#       0x559e2e        bufio.(*Reader).Peek+0x3e               /usr/local/Cellar/go/1.11/libexec/src/bufio/bufio.go:132
#       0x6bb331        net/http.(*persistConn).readLoop+0x1a1  /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1645

1 @ 0x42e14b 0x429489 0x428b36 0x49818a 0x49829d 0x49a590 0x5a1d82 0x5c026e 0x5be797 0x6a819f 0x6c8c7c 0x6a6fcf 0x6a6c86 0x6a7c74 0x1a5ac1f 0x45c551
#       0x428b35        internal/poll.runtime_pollWait+0x65             /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173
#       0x498189        internal/poll.(*pollDesc).wait+0x99             /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85
#       0x49829c        internal/poll.(*pollDesc).waitRead+0x3c         /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90
#       0x49a58f        internal/poll.(*FD).Accept+0x19f                /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:384
#       0x5a1d81        net.(*netFD).accept+0x41                        /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:238
#       0x5c026d        net.(*TCPListener).accept+0x2d                  /usr/local/Cellar/go/1.11/libexec/src/net/tcpsock_posix.go:139
#       0x5be796        net.(*TCPListener).AcceptTCP+0x46               /usr/local/Cellar/go/1.11/libexec/src/net/tcpsock.go:247
#       0x6a819e        net/http.tcpKeepAliveListener.Accept+0x2e       /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:3232
#       0x6a6fce        net/http.(*Server).Serve+0x22e                  /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2826
#       0x6a6c85        net/http.(*Server).ListenAndServe+0xb5          /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2764
#       0x6a7c73        net/http.ListenAndServe+0x73                    /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:3004
#       0x1a5ac1e       main.main.func2+0x17e                           /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:274

1 @ 0x42e14b 0x42e1f3 0x405a8e 0x4057bb 0x88a13c 0x88bde4 0x45c551
#       0x88a13b        github.com/influxdata/telegraf/agent.(*Agent).runOutputs+0x2ab  /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:451
#       0x88bde3        github.com/influxdata/telegraf/agent.(*Agent).Run.func4+0x83    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:123

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x17e6a7c 0x17edd08 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                                            /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                                             /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x17e6a7b       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect+0x2ab         /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:611
#       0x17edd07       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*VSphere).Gather.func1+0x87      /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsphere.go:269

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x17ec378 0x77f66d 0x88c40f 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                                    /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                                     /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x17ec377       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*VSphere).Gather+0x167   /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsphere.go:282
#       0x77f66c        github.com/influxdata/telegraf/internal/models.(*RunningInput).Gather+0x6c      /Users/prydin/go/src/github.com/influxdata/telegraf/internal/models/running_input.go:86
#       0x88c40e        github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1+0x3e             /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:283

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x17ec908 0x17e81de 0x17ed4fe 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                                            /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                                             /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x17ec907       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Drain+0x87          /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:117
#       0x17e81dd       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).collectResource+0x99d /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:758
#       0x17ed4fd       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect.func1+0x9d    /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:604

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x17ee4dd 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                                            /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                                             /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x17ee4dc       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run.func1+0xec      /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:88

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x8887c1 0x1a590a0 0x1a587a8 0x1a59f4a 0x42dd57 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                            /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                             /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x8887c0        github.com/influxdata/telegraf/agent.(*Agent).Run+0x470 /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:129
#       0x1a5909f       main.runAgent+0x85f                                     /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:185
#       0x1a587a7       main.reloadLoop+0x247                                   /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:101
#       0x1a59f49       main.main+0x4b9                                         /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:381
#       0x42dd56        runtime.main+0x206                                      /usr/local/Cellar/go/1.11/libexec/src/runtime/proc.go:201

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ed69 0x474d54 0x8891d8 0x88b9d4 0x45c551
#       0x43ed68        sync.runtime_Semacquire+0x38                                    /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56
#       0x474d53        sync.(*WaitGroup).Wait+0x63                                     /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130
#       0x8891d7        github.com/influxdata/telegraf/agent.(*Agent).runInputs+0x287   /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:232
#       0x88b9d3        github.com/influxdata/telegraf/agent.(*Agent).Run.func1+0xa3    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:69

1 @ 0x42e14b 0x42e1f3 0x43f12c 0x43ee5d 0x4749e4 0x17e4960 0x17ecb93 0x45c551
#       0x43ee5c        sync.runtime_SemacquireMutex+0x3c                                                               /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:71
#       0x4749e3        sync.(*RWMutex).Lock+0x73                                                                       /usr/local/Cellar/go/1.11/libexec/src/sync/rwmutex.go:98
#       0x17e495f       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).discover+0xd8f                /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:452
#       0x17ecb92       github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).startDiscovery.func1+0x112    /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:232

1 @ 0x42e14b 0x43e12d 0x19fa24d 0x45c551
#       0x19fa24c       github.com/influxdata/telegraf/vendor/go.opencensus.io/stats/view.(*worker).start+0xdc  /Users/prydin/go/src/github.com/influxdata/telegraf/vendor/go.opencensus.io/stats/view/worker.go:150

1 @ 0x42e14b 0x43e12d 0x1a5a8cf 0x45c551
#       0x1a5a8ce       main.reloadLoop.func1+0xae      /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:88

1 @ 0x42e14b 0x43e12d 0x6bc8f3 0x45c551
#       0x6bc8f2        net/http.(*persistConn).writeLoop+0x112 /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1885

1 @ 0x42e14b 0x43e12d 0x8896b3 0x889326 0x88c330 0x45c551
#       0x8896b2        github.com/influxdata/telegraf/agent.(*Agent).gatherOnce+0x232          /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:287
#       0x889325        github.com/influxdata/telegraf/agent.(*Agent).gatherOnInterval+0x125    /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:257
#       0x88c32f        github.com/influxdata/telegraf/agent.(*Agent).runInputs.func1+0xbf      /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:229

1 @ 0x42e14b 0x43e12d 0x88a3d2 0x88c8c3 0x45c551
#       0x88a3d1        github.com/influxdata/telegraf/agent.(*Agent).flush+0x1a1               /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:496
#       0x88c8c2        github.com/influxdata/telegraf/agent.(*Agent).runOutputs.func1+0xa2     /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:447

1 @ 0x42e14b 0x43e12d 0x88b82b 0x45c551
#       0x88b82a        github.com/influxdata/telegraf/agent.(*Ticker).relayTime+0x12a  /Users/prydin/go/src/github.com/influxdata/telegraf/agent/tick.go:46

1 @ 0x72d048 0x72ce50 0x7298b4 0x735cf0 0x7365c3 0x6a3f24 0x6a5bb7 0x6a6b5b 0x6a2f86 0x45c551
#       0x72d047        runtime/pprof.writeRuntimeProfile+0x97  /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:707
#       0x72ce4f        runtime/pprof.writeGoroutine+0x9f       /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:669
#       0x7298b3        runtime/pprof.(*Profile).WriteTo+0x3e3  /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:328
#       0x735cef        net/http/pprof.handler.ServeHTTP+0x20f  /usr/local/Cellar/go/1.11/libexec/src/net/http/pprof/pprof.go:245
#       0x7365c2        net/http/pprof.Index+0x722              /usr/local/Cellar/go/1.11/libexec/src/net/http/pprof/pprof.go:268
#       0x6a3f23        net/http.HandlerFunc.ServeHTTP+0x43     /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:1964
#       0x6a5bb6        net/http.(*ServeMux).ServeHTTP+0x126    /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2361
#       0x6a6b5a        net/http.serverHandler.ServeHTTP+0xaa   /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2741
#       0x6a2f85        net/http.(*conn).serve+0x645            /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:1847
danielnelson commented 5 years ago

Can you grab the full goroutine stack dump from here: http://localhost:6060/debug/pprof/goroutine?debug=2

ghost commented 5 years ago

TELEGRAF VERSION

~# /usr/bin/telegraf --version
Telegraf unknown (git: prydin-scale-improvement aaa67547)

CONTEXT

I try to collect simple vm metrics on a vcenter that manage:

259 host 8129 VM

Telegraf stop working between 20 or 30 min after it started.

2018-12-14T12:50:30Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-14T12:50:40Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-14T12:50:50Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-14T12:51:00Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-14T12:51:00Z W! [agent] input "inputs.vsphere" did not complete within its interval
2018-12-14T12:51:10Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-14T12:51:20Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-14T12:51:30Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-14T12:51:40Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-14T12:51:50Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-14T12:52:00Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
2018-12-14T12:52:00Z W! [agent] input "inputs.vsphere" did not complete within its interval

CONFIG

[global_tags]

[agent]

interval = "60s"
round_interval = true
metric_batch_size = 10000
metric_buffer_limit = 100000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = true
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
hostname = ""
omit_hostname = false

[[outputs.influxdb]]

urls = ["http://10.x.x.x:8086"]
database = "vcenter"

[[inputs.vsphere]]
  vcenters = [ 'https://foo.bar/sdk' ]
  username = 'ADUSER'
  password = "SUPERSTRONGPASSWORD"
  vm_metric_include = []
  host_metric_include = []
  cluster_metric_exclude = ["*"] 
  datastore_metric_exclude = ["*"]
  datacenter_metric_exclude = [ "*" ]
  collect_concurrency = 2
  discover_concurrency = 2
  object_discovery_interval = "600s"
  insecure_skip_verify = true

GO DUMP LEVEL 2

goroutine 12632 [running]:                                                                                                                                                                                                                                            [278/1877]
runtime/pprof.writeGoroutineStacks(0x24e9000, 0xc01931c0e0, 0x40be5f, 0xc022c4e240)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:678 +0xa7
runtime/pprof.writeGoroutine(0x24e9000, 0xc01931c0e0, 0x2, 0xc0004e4700, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:667 +0x44
runtime/pprof.(*Profile).WriteTo(0x3ca45e0, 0x24e9000, 0xc01931c0e0, 0x2, 0xc01931c0e0, 0x21dec75)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/pprof/pprof.go:328 +0x3e4
net/http/pprof.handler.ServeHTTP(0xc0102a4011, 0x9, 0x2502020, 0xc01931c0e0, 0xc000128100)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/pprof/pprof.go:245 +0x210
net/http/pprof.Index(0x2502020, 0xc01931c0e0, 0xc000128100)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/pprof/pprof.go:268 +0x723
net/http.HandlerFunc.ServeHTTP(0x22cf9c0, 0x2502020, 0xc01931c0e0, 0xc000128100)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:1964 +0x44
net/http.(*ServeMux).ServeHTTP(0x3cd89a0, 0x2502020, 0xc01931c0e0, 0xc000128100)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2361 +0x127
net/http.serverHandler.ServeHTTP(0xc0000a6c30, 0x2502020, 0xc01931c0e0, 0xc000128100)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2741 +0xab
net/http.(*conn).serve(0xc00787c500, 0x2505160, 0xc02047a000)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:1847 +0x646
created by net/http.(*Server).Serve
        /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2851 +0x2f5

goroutine 1 [semacquire, 101 minutes]:
sync.runtime_Semacquire(0xc00053aea8)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc00053aea0)
        /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130 +0x64
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc0002025f0, 0x2505160, 0xc000042d00, 0x1, 0x1)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:129 +0x471
main.runAgent(0x2505160, 0xc000042d00, 0x3cfde20, 0x0, 0x0, 0x3cfde20, 0x0, 0x0, 0x0, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:185 +0x860
main.reloadLoop(0xc0002e0120, 0x3cfde20, 0x0, 0x0, 0x3cfde20, 0x0, 0x0, 0xc0007add58, 0x0, 0x0, ...)
        /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:101 +0x248
main.main()
        /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:381 +0x4ba

goroutine 17 [syscall, 101 minutes]:
os/signal.signal_recv(0x0)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/sigqueue.go:139 +0x9c
os/signal.loop()
        /usr/local/Cellar/go/1.11/libexec/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
        /usr/local/Cellar/go/1.11/libexec/src/os/signal/signal_unix.go:29 +0x41

goroutine 13 [select]:
github.com/influxdata/telegraf/vendor/go.opencensus.io/stats/view.(*worker).start(0xc000133b80)
        /Users/prydin/go/src/github.com/influxdata/telegraf/vendor/go.opencensus.io/stats/view/worker.go:150 +0xdd
created by github.com/influxdata/telegraf/vendor/go.opencensus.io/stats/view.init.0
        /Users/prydin/go/src/github.com/influxdata/telegraf/vendor/go.opencensus.io/stats/view/worker.go:29 +0x57

goroutine 14 [IO wait]:
internal/poll.runtime_pollWait(0x7f074bd93f00, 0x72, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00020c018, 0x72, 0xc0002a4200, 0x0, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85 +0x9a
internal/poll.(*pollDesc).waitRead(0xc00020c018, 0xffffffffffffff00, 0x0, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc00020c000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:384 +0x1a0
net.(*netFD).accept(0xc00020c000, 0x50, 0x1fa58e0, 0xc0004bfd01)
        /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:238 +0x42
net.(*TCPListener).accept(0xc00013e018, 0xc0004bfd88, 0xc009ec73b0, 0xe25aac92949344a6)
        /usr/local/Cellar/go/1.11/libexec/src/net/tcpsock_posix.go:139 +0x2e
net.(*TCPListener).AcceptTCP(0xc00013e018, 0xc0004bfdb0, 0x48f726, 0x5c13a732)
        /usr/local/Cellar/go/1.11/libexec/src/net/tcpsock.go:247 +0x47
net/http.tcpKeepAliveListener.Accept(0xc00013e018, 0xc0004bfe00, 0x18, 0xc0001ee600, 0x6a7095)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:3232 +0x2f
net/http.(*Server).Serve(0xc0000a6c30, 0x2503060, 0xc00013e018, 0x0, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2826 +0x22f
net/http.(*Server).ListenAndServe(0xc0000a6c30, 0xc0000a6c30, 0x41)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:2764 +0xb6
net/http.ListenAndServe(0x7ffcb30abf4e, 0xc, 0x0, 0x0, 0x1, 0x21dc8c0)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:3004 +0x74
main.main.func2()
        /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:274 +0x17f
created by main.main
        /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:264 +0xa1b

goroutine 37 [select, 101 minutes]:
main.reloadLoop.func1(0xc0002e02a0, 0xc0003042a0, 0xc00007b650, 0xc0002e0120)
        /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:88 +0xaf
created by main.reloadLoop
        /Users/prydin/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:87 +0x1e2

goroutine 82 [IO wait, 82 minutes]:
internal/poll.runtime_pollWait(0x7f074bd93e30, 0x72, 0xc000414a88)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00020c398, 0x72, 0xffffffffffffff00, 0x24eb300, 0x3bc37f0)
        /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85 +0x9a
internal/poll.(*pollDesc).waitRead(0xc00020c398, 0xc00042b000, 0x1000, 0x1000)
        /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc00020c380, 0xc00042b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:169 +0x179
net.(*netFD).Read(0xc00020c380, 0xc00042b000, 0x1000, 0x1000, 0x1, 0x0, 0xc0002b2ce0)
        /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc00013e058, 0xc00042b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/net/net.go:177 +0x68
net/http.(*persistConn).Read(0xc0000ba6c0, 0xc00042b000, 0x1000, 0x1000, 0xc0000ba480, 0xc0000ba6c0, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1497 +0x75
bufio.(*Reader).fill(0xc000134420)
        /usr/local/Cellar/go/1.11/libexec/src/bufio/bufio.go:100 +0x106
bufio.(*Reader).Peek(0xc000134420, 0x1, 0x2, 0x0, 0x0, 0xc0002e1ec0, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/bufio/bufio.go:132 +0x3f
net/http.(*persistConn).readLoop(0xc0000ba6c0)
        /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1645 +0x1a2
created by net/http.(*Transport).dialConn
        /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1338 +0x941

goroutine 83 [select, 82 minutes]:

net/http.(*persistConn).writeLoop(0xc0000ba6c0)                                                                                                                                                                                                                       [169/1877]
        /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1885 +0x113
created by net/http.(*Transport).dialConn
        /usr/local/Cellar/go/1.11/libexec/src/net/http/transport.go:1339 +0x966

goroutine 20 [semacquire, 101 minutes]:
sync.runtime_Semacquire(0xc00053b938)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc00053b930)
        /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130 +0x64
github.com/influxdata/telegraf/agent.(*Agent).runInputs(0xc0002025f0, 0x2505160, 0xc000042d00, 0xbefd01b4aa3b9c9f, 0x22803fa, 0x3cd91c0, 0xc00003a5a0, 0x0, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:232 +0x288
github.com/influxdata/telegraf/agent.(*Agent).Run.func1(0xc00053aea0, 0xc0002025f0, 0x2505160, 0xc000042d00, 0xbefd01b4aa3b9c9f, 0x22803fa, 0x3cd91c0, 0xc00003a5a0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:69 +0xa4
created by github.com/influxdata/telegraf/agent.(*Agent).Run
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:66 +0x3bb

goroutine 21 [chan receive, 82 minutes]:
github.com/influxdata/telegraf/agent.(*Agent).runOutputs(0xc0002025f0, 0xbefd01b4aa3b9c9f, 0x22803fa, 0x3cd91c0, 0xc00003a5a0, 0x4500000000, 0x201)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:451 +0x2ac
github.com/influxdata/telegraf/agent.(*Agent).Run.func4(0xc00053aea0, 0xc0002025f0, 0xbefd01b4aa3b9c9f, 0x22803fa, 0x3cd91c0, 0xc00003a5a0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:123 +0x84
created by github.com/influxdata/telegraf/agent.(*Agent).Run
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:120 +0x460

goroutine 22 [select]:
github.com/influxdata/telegraf/agent.(*Agent).flush(0xc0002025f0, 0x2505160, 0xc0002a4900, 0xc000483290, 0x2540be400, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:496 +0x1a2
github.com/influxdata/telegraf/agent.(*Agent).runOutputs.func1(0xc00053af30, 0xc0002025f0, 0x2505160, 0xc0002a4900, 0xbefd01b4aa3b9c9f, 0x22803fa, 0x3cd91c0, 0x2540be400, 0x0, 0xc000483290)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:447 +0xa3
created by github.com/influxdata/telegraf/agent.(*Agent).runOutputs
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:436 +0x1b9

goroutine 27 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce(0xc0002025f0, 0x2512c40, 0xc0002ebd80, 0xc000043240, 0xdf8475800, 0x0, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:287 +0x233
github.com/influxdata/telegraf/agent.(*Agent).gatherOnInterval(0xc0002025f0, 0x2505160, 0xc000042d00, 0x2512c40, 0xc0002ebd80, 0xc000043240, 0xdf8475800, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:257 +0x126
github.com/influxdata/telegraf/agent.(*Agent).runInputs.func1(0xc00053b930, 0xc0002025f0, 0x2505160, 0xc000042d00, 0xbefd01b4aa3b9c9f, 0x22803fa, 0x3cd91c0, 0xdf8475800, 0x2512c40, 0xc0002ebd80, ...)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:229 +0xc0
created by github.com/influxdata/telegraf/agent.(*Agent).runInputs
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:218 +0x171

goroutine 39 [select]:
github.com/influxdata/telegraf/agent.(*Ticker).relayTime(0xc000bec000, 0x2505160, 0xc000be8000)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/tick.go:46 +0x12b
created by github.com/influxdata/telegraf/agent.NewTicker
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/tick.go:33 +0x135

goroutine 6098 [select, 82 minutes]:
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).pushOut(0xc02253d640, 0x25051a0, 0xc00003c048, 0x0, 0x0, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:56 +0xec
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run.func1.1(0xc002a51520, 0xc02253d640, 0x25051a0, 0xc00003c048, 0xc001312f00)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:80 +0xcd
created by github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run.func1
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:72 +0xcd                                                                                                                                                             [114/1877]

goroutine 6081 [semacquire, 83 minutes]:
sync.runtime_Semacquire(0xc02253d648)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc02253d640)
        /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130 +0x64
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Drain(0xc02253d640, 0x25051a0, 0xc00003c048, 0xc02253dcc0, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:117 +0x88
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).collectResource(0xc000146380, 0x25051a0, 0xc00003c048, 0x21ccf4e, 0x2, 0x2512c40, 0xc0002ebd80, 0x36cc0d000, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:758 +0x99e
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect.func1(0xc01b1f5170, 0xc000146380, 0x25051a0, 0xc00003c048, 0x2512c40, 0xc0002ebd80, 0x21ccf4e, 0x2)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:604 +0x9e
created by github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:602 +0x299

goroutine 6079 [select, 82 minutes]:
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).push(0xc02253d640, 0x25051a0, 0xc00003c048, 0x1c50cc0, 0xc00ef1ca20, 0xc00ef1ca20)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:47 +0xec
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).push-fm(0x25051a0, 0xc00003c048, 0x1c50cc0, 0xc00ef1ca20, 0x7)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:100 +0x52
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).chunker(0xc000146380, 0x25051a0, 0xc00003c048, 0xc0159054d0, 0xc00f5a79e0, 0x81d260, 0xed3a58ac0, 0x0, 0x0, 0xed3a58a84, ...)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:677 +0x708
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).collectResource.func2(0x25051a0, 0xc00003c048, 0xc0159054d0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:752 +0x8b
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Fill.func1(0xc02253d640, 0xc001312f50, 0x25051a0, 0xc00003c048)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:100 +0xa0
created by github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Fill
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:98 +0x76

goroutine 1177 [semacquire, 81 minutes]:
sync.runtime_SemacquireMutex(0xc0001463d8, 0xc00ca49100)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:71 +0x3d
sync.(*RWMutex).Lock(0xc0001463d0)
        /usr/local/Cellar/go/1.11/libexec/src/sync/rwmutex.go:98 +0x74
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).discover(0xc000146380, 0x2505160, 0xc0002a4740, 0x0, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:452 +0xd90
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).startDiscovery.func1(0xc000146380, 0x2505160, 0xc0002a4740)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:232 +0x113
created by github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).startDiscovery
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:228 +0x81

goroutine 12633 [IO wait]:
internal/poll.runtime_pollWait(0x7f074bd93bc0, 0x72, 0xc000c2de58)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc0020de098, 0x72, 0xffffffffffffff00, 0x24eb300, 0x3bc37f0)
        /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:85 +0x9a
internal/poll.(*pollDesc).waitRead(0xc0020de098, 0xc022c4e000, 0x1, 0x1)
        /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc0020de080, 0xc022c4e0d1, 0x1, 0x1, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/internal/poll/fd_unix.go:169 +0x179
net.(*netFD).Read(0xc0020de080, 0xc022c4e0d1, 0x1, 0x1, 0x0, 0x24ad6, 0x259a9)
        /usr/local/Cellar/go/1.11/libexec/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc000202aa0, 0xc022c4e0d1, 0x1, 0x1, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.11/libexec/src/net/net.go:177 +0x68
net/http.(*connReader).backgroundRead(0xc022c4e0c0)                                                                                                                                                                                                                    [59/1877]
        /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:676 +0x5a
created by net/http.(*connReader).startBackgroundRead
        /usr/local/Cellar/go/1.11/libexec/src/net/http/server.go:672 +0xd2

goroutine 6076 [semacquire, 83 minutes]:
sync.runtime_Semacquire(0xc01b1f5178)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc01b1f5170)
        /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130 +0x64
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect(0xc000146380, 0x25051a0, 0xc00003c048, 0x2512c40, 0xc0002ebd80, 0x0, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:611 +0x2ac
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*VSphere).Gather.func1(0xc00dc1bf90, 0x2512c40, 0xc0002ebd80, 0xc009476cc0, 0xc000146380)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsphere.go:269 +0x88
created by github.com/influxdata/telegraf/plugins/inputs/vsphere.(*VSphere).Gather
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsphere.go:267 +0x13e

goroutine 6078 [semacquire, 83 minutes]:
sync.runtime_Semacquire(0xc002a51528)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc002a51520)
        /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130 +0x64
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run.func1(0xc02253d640, 0x2, 0x25051a0, 0xc00003c048, 0xc001312f00)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:88 +0xed
created by github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:67 +0x84

goroutine 6075 [semacquire, 83 minutes]:
sync.runtime_Semacquire(0xc00dc1bf98)
        /usr/local/Cellar/go/1.11/libexec/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc00dc1bf90)
        /usr/local/Cellar/go/1.11/libexec/src/sync/waitgroup.go:130 +0x64
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*VSphere).Gather(0xc0002aefc0, 0x2512c40, 0xc0002ebd80, 0x2710, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsphere.go:282 +0x168
github.com/influxdata/telegraf/internal/models.(*RunningInput).Gather(0xc000043240, 0x2512c40, 0xc0002ebd80, 0xc001637fc0, 0x88ca67)
        /Users/prydin/go/src/github.com/influxdata/telegraf/internal/models/running_input.go:86 +0x6d
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1(0xc0001786c0, 0xc000043240, 0x2512c40, 0xc0002ebd80)
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:283 +0x3f
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce
        /Users/prydin/go/src/github.com/influxdata/telegraf/agent/agent.go:282 +0xdc

goroutine 6097 [select, 82 minutes]:
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).pushOut(0xc02253d640, 0x25051a0, 0xc00003c048, 0x0, 0x0, 0x0)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:56 +0xec
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run.func1.1(0xc002a51520, 0xc02253d640, 0x25051a0, 0xc00003c048, 0xc001312f00)
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:80 +0xcd
created by github.com/influxdata/telegraf/plugins/inputs/vsphere.(*WorkerPool).Run.func1
        /Users/prydin/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/workerpool.go:72 +0xcd

NOTE

I just figured that, when i run telegraf as a systemd unit it fail like this case. but when i run it into a linux jobs with the same parameters of the systemd unit it work properly for more than an 2hours. I really dont get it. right now i'm trying to setup a proper InfluxDB Enterprise Cluster to check if this collecting failure it's not because of a standalone Influxdb.

ghost commented 5 years ago

Update

My bad, The plugin working fine in release Telegraf unknown (git: prydin-scale-improvement 646c5960). I just forget to tell grafana to connect each point of metric in an interval superior of 1min. I appologize for my huge misstake..

I have increase my interval at 120s and it's working like a charm with all my Vcenter

danielnelson commented 5 years ago

I believe this is working, and now available, in 1.10.0