influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.55k stars 5.56k forks source link

Telegraf vSphere Plugin not collecting Datastore Metrics after initial discovery #6841

Closed paladin245 closed 4 years ago

paladin245 commented 4 years ago

Relevant telegraf.conf:

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

## Historical instance
[[inputs.vsphere]]

  # Connection
  interval = "300s"
  timeout = "300s"
  separator = "_"
  vcenters = [ "redacted" ]
  username = "redacted"
  password = "redacted"
  insecure_skip_verify = true

  # Discovery
  force_discover_on_init = true
  discover_concurrency = 1
  object_discovery_interval = "3600s"

  # Query
  max_query_objects = 256
  max_query_metrics = 256
  collect_concurrency = 1

  # Include all historical metrics
  datastore_metric_include = []
  # cluster_metric_include = []
  # datacenter_metric_include = []

  # Exclude all real time metrics
  host_metric_exclude = ["*"]
  vm_metric_exclude = ["*"]
  cluster_metric_exclude = ["*"]
  datacenter_metric_exclude = ["*"]

System info:

App Server VM Windows Server 2016

TIG Stack Telegraf v1.13.0 InfluxDB v1.7.9 Grafana v6.5.2

VMware Environment vCenter - 6.7.0 (Build 15129938) ESXI - 6.5.0 (Build 15177306)

Steps to reproduce:

  1. run Telegraf
  2. Wait for discovery to complete and initial write to complete
  3. Check grafana dashboard for new metric data points
  4. Observe no update to metrics occuring on a time series graph (shows no further data points recorded

Expected behavior:

Additional data points should appear every 5 minutes

Actual behavior:

Initial data point is recorded and no further data points, once the last recorded data point exceeds 3Hrs my Grafana dashboard shows no further capacity information. You can see in the log that it finds 132 metrics, but then only consistently writes 6 metrics. Its almost as if its not writing the others simply because their value's have not changed. But the values remaining constant is to be expected as datastore capacity may not change for days, or even weeks.

Sometimes I am even experiencing that when I stop / restart the Telegraf instance, its not even causing new metrics for capacity to be written.

Additional info:

I'm using the following query within Grafana to plot the data on a graph that shows me each recorded data point in the last 3 hours:

SELECT mean("used_latest") * (100 / mean("capacity_latest"))  FROM "vsphere_datastore_disk" WHERE ("source" =~ /^$datastore$/) AND $timeFilter GROUP BY time($__interval) , "source"  fill(none)

You will also note that I have commented out the other metrics gathering in my config file and put in exclusions for cluster / datacenter metrics to ensure that they weren't causing any issues, for the purpose of isolating the issue I'm running this config as its own instance outputting to its own unique log file so that I can capture the issue.

Any help diagnosing why this is happening is appreciated as I have tried many suggestions found in other reported issue threads, on the forums and on reddit with no success so far.

2020-01-01T05:52:57Z I! Loaded inputs: vsphere
2020-01-01T05:52:57Z I! Loaded aggregators: 
2020-01-01T05:52:57Z I! Loaded processors: 
2020-01-01T05:52:57Z I! Loaded outputs: influxdb
2020-01-01T05:52:57Z I! Tags enabled: host=CRUSADER
2020-01-01T05:52:57Z I! [agent] Config: Interval:5m0s, Quiet:false, Hostname:"CRUSADER", Flush Interval:5m0s
2020-01-01T05:52:57Z D! [agent] Initializing plugins
2020-01-01T05:52:57Z D! [agent] Connecting outputs
2020-01-01T05:52:57Z D! [agent] Attempting connection to [outputs.influxdb]
2020-01-01T05:52:57Z D! [agent] Successfully connected to outputs.influxdb
2020-01-01T05:52:57Z D! [agent] Starting service inputs
2020-01-01T05:52:57Z I! [inputs.vsphere] Starting plugin
2020-01-01T05:52:57Z D! [inputs.vsphere] Creating client: VC1.RFHN.local
2020-01-01T05:52:57Z D! [inputs.vsphere] Option query for maxQueryMetrics failed. Using default
2020-01-01T05:52:57Z D! [inputs.vsphere] vCenter version is: 6.7.0
2020-01-01T05:52:57Z D! [inputs.vsphere] vCenter says max_query_metrics should be 256
2020-01-01T05:52:57Z D! [inputs.vsphere] Running initial discovery and waiting for it to finish
2020-01-01T05:52:57Z D! [inputs.vsphere] Discover new objects for VC1.RFHN.local
2020-01-01T05:52:57Z D! [inputs.vsphere] Discovering resources for datacenter
2020-01-01T05:52:57Z D! [inputs.vsphere] Find(Datacenter, /*) returned 1 objects
2020-01-01T05:52:57Z D! [inputs.vsphere] No parent found for Folder:group-d1 (ascending from Folder:group-d1)
2020-01-01T05:52:57Z D! [inputs.vsphere] Discovering resources for cluster
2020-01-01T05:52:57Z D! [inputs.vsphere] Find(ClusterComputeResource, /*/host/**) returned 2 objects
2020-01-01T05:52:57Z D! [inputs.vsphere] Discovering resources for host
2020-01-01T05:52:57Z D! [inputs.vsphere] Find(HostSystem, /*/host/**) returned 2 objects
2020-01-01T05:52:57Z D! [inputs.vsphere] Discovering resources for vm
2020-01-01T05:52:57Z D! [inputs.vsphere] Discovering resources for datastore
2020-01-01T05:52:57Z D! [inputs.vsphere] Find(Datastore, /*/datastore/**) returned 6 objects
2020-01-01T05:52:58Z D! [inputs.vsphere] Found 22 metrics for VS2-NVMe
2020-01-01T05:52:59Z D! [inputs.vsphere] Found 22 metrics for VS1-HDD
2020-01-01T05:52:59Z D! [inputs.vsphere] Found 22 metrics for VS1-NVMe
2020-01-01T05:52:59Z D! [inputs.vsphere] Found 22 metrics for VS2-HDD
2020-01-01T05:53:00Z D! [inputs.vsphere] Found 22 metrics for SW-HDD
2020-01-01T05:53:00Z D! [inputs.vsphere] Found 22 metrics for SW-SSD
2020-01-01T05:55:06Z D! [inputs.vsphere] Interval estimated to 1m0s
2020-01-01T05:55:06Z D! [inputs.vsphere] Collecting metrics for 6 objects of type datastore for VC1.RFHN.local
2020-01-01T05:55:06Z D! [inputs.vsphere] Queuing query: 6 objects, 132 metrics (0 remaining) of type datastore for VC1.RFHN.local. Total objects 6 (final chunk)
2020-01-01T05:55:06Z D! [inputs.vsphere] Query for datastore has 6 QuerySpecs
2020-01-01T05:55:07Z D! [inputs.vsphere] Query for datastore returned metrics for 6 objects
2020-01-01T05:55:07Z D! [inputs.vsphere] CollectChunk for datastore returned 108 metrics
2020-01-01T05:55:07Z D! [inputs.vsphere] Latest sample for datastore set to 2020-01-01 05:55:00 +0000 UTC
2020-01-01T05:55:07Z D! [inputs.vsphere] purged timestamp cache. 0 deleted with 6 remaining
2020-01-01T06:00:05Z D! [outputs.influxdb] Wrote batch of 18 metrics in 12.0072ms
2020-01-01T06:00:05Z D! [outputs.influxdb] Buffer fullness: 0 / 1000 metrics
2020-01-01T06:00:07Z D! [inputs.vsphere] Raw interval 5m0.6380855s, padded: 7m30.6380855s, estimated: 5m0s
2020-01-01T06:00:07Z D! [inputs.vsphere] Interval estimated to 5m0s
2020-01-01T06:00:07Z D! [inputs.vsphere] Latest: 2020-01-01 05:55:00 +0000 UTC, elapsed: 312.307132, resource: datastore
2020-01-01T06:00:07Z D! [inputs.vsphere] Collecting metrics for 6 objects of type datastore for VC1.RFHN.local
2020-01-01T06:00:07Z D! [inputs.vsphere] Queuing query: 6 objects, 132 metrics (0 remaining) of type datastore for VC1.RFHN.local. Total objects 6 (final chunk)
2020-01-01T06:00:07Z D! [inputs.vsphere] Query for datastore has 6 QuerySpecs
2020-01-01T06:00:08Z D! [inputs.vsphere] Query for datastore returned metrics for 6 objects
2020-01-01T06:00:08Z D! [inputs.vsphere] CollectChunk for datastore returned 36 metrics
2020-01-01T06:00:08Z D! [inputs.vsphere] Latest sample for datastore set to 2020-01-01 06:00:00 +0000 UTC
2020-01-01T06:00:08Z D! [inputs.vsphere] purged timestamp cache. 0 deleted with 6 remaining
2020-01-01T06:05:04Z D! [inputs.vsphere] Raw interval 4m56.6905435s, padded: 7m26.6905435s, estimated: 5m0s
2020-01-01T06:05:04Z D! [inputs.vsphere] Interval estimated to 5m0s
2020-01-01T06:05:04Z D! [inputs.vsphere] Latest: 2020-01-01 06:00:00 +0000 UTC, elapsed: 309.004530, resource: datastore
2020-01-01T06:05:04Z D! [inputs.vsphere] Collecting metrics for 6 objects of type datastore for VC1.RFHN.local
2020-01-01T06:05:04Z D! [inputs.vsphere] Queuing query: 6 objects, 132 metrics (0 remaining) of type datastore for VC1.RFHN.local. Total objects 6 (final chunk)
2020-01-01T06:05:04Z D! [inputs.vsphere] Query for datastore has 6 QuerySpecs
2020-01-01T06:05:04Z D! [outputs.influxdb] Wrote batch of 6 metrics in 11.0005ms
2020-01-01T06:05:04Z D! [outputs.influxdb] Buffer fullness: 0 / 1000 metrics
2020-01-01T06:05:05Z D! [inputs.vsphere] Query for datastore returned metrics for 6 objects
2020-01-01T06:05:05Z D! [inputs.vsphere] CollectChunk for datastore returned 36 metrics
2020-01-01T06:05:05Z D! [inputs.vsphere] Latest sample for datastore set to 2020-01-01 06:05:00 +0000 UTC
2020-01-01T06:05:05Z D! [inputs.vsphere] purged timestamp cache. 0 deleted with 6 remaining

Please see the attached files for full log and config file (I've added .txt to the config file to allow upload) telegraf-historical.conf.txt telegraf-historical.log

danielnelson commented 4 years ago

Can you also attach the output (redacted as needed) from:

telegraf --input-filter vsphere --test

You can see in the log that it finds 132 metrics, but then only consistently writes 6 metrics.

This is a little misleading on our part because there are two types of metrics involved, vSphere and Telegraf metrics, and they aren't one to one. A Telegraf metric can consist of multiple vSphere metrics.

prydin commented 4 years ago

It's collecting 108 metric according to the log:

CollectChunk for datastore returned 108 metrics

The discrepancy may be due to some metrics not being populated at all time. This is normal.

As @danielnelson points out, every data point in vSphere counts as a metric in the vSphere API. That's what's reflected in the log. InfluxDB counts metrics differently.

Are you seeing the metrics you expect on the InfluxDB side?

paladin245 commented 4 years ago

This is a little misleading on our part because there are two types of metrics involved, vSphere and Telegraf metrics, and they aren't one to one. A Telegraf metric can consist of multiple vSphere metrics.

I was not aware that this was the case, I have tried my best to understand how telegraf works but I'm very much still rapping my head around it so thank you for clarifying. See attached text file with the output you requested.

output.txt

paladin245 commented 4 years ago

The discrepancy may be due to some metrics not being populated at all time. This is normal.

As @danielnelson points out, every data point in vSphere counts as a metric in the vSphere API. That's what's reflected in the log. InfluxDB counts metrics differently.

Are you seeing the metrics you expect on the InfluxDB side?

I was curious if this might be the case, yes the metrics seem to be recorded correctly on the initial discovery and I can work with them in grafana. However after these initial data points, occasionally (maybe 2x-3x in a day at most) I'll get new data points. Is this simply because no data has changed so Telegraf doesn't collect it? or is this an issue on the VMware side?

I also wanted to quickly advise you of the statistics settings in my vCenter Server (this is all defaults except I adjusted the statistics level to the highest for each interval). I know Telegraf only uses the 5m data, so the other intervals can probably remain at Level 1 but I just put them all up temporarily to see if it had any affect on the issue. This was done prior to my raising the issue here on GitHub.

Enabled - Interval Duration - Save For - Statistics Level
-------------------------------------------------------------------
Yes -  5 minutes - 1 day   - Level 4
Yes - 30 minutes - 1 week  - Level 3
Yes -  2 hours   - 1 month - Level 3
Yes -  1 day     - 1 year  - Level 3

One of the suggestions from another incident (#6580) was to run specific metrics only, which I have had running over the last 24 hours and that has started consistently reporting data points last night. I adjusted the config file to this:

From:

  datastore_metric_include = []

To:

  datastore_metric_include = [
    "disk.used.latest",
    "disk.provisioned.latest",
    "disk.capacity.latest",
    "disk.capacity.provisioned.average",
    "disk.capacity.usage.average",
    ]

And this was the resulting log entries:

2020-01-04T00:55:04Z D! [outputs.influxdb] Buffer fullness: 0 / 1000 metrics
2020-01-04T00:55:07Z D! [inputs.vsphere] Raw interval 5m1.1333895s, padded: 7m31.1333895s, estimated: 5m0s
2020-01-04T00:55:07Z D! [inputs.vsphere] Interval estimated to 5m0s
2020-01-04T00:55:07Z D! [inputs.vsphere] Latest: 2020-01-04 00:25:00 +0000 UTC, elapsed: 1812.696243, resource: datastore
2020-01-04T00:55:07Z D! [inputs.vsphere] Collecting metrics for 6 objects of type datastore for VC1.RFHN.local
2020-01-04T00:55:07Z D! [inputs.vsphere] Queuing query: 6 objects, 30 metrics (0 remaining) of type datastore for VC1.RFHN.local. Total objects 6 (final chunk)
2020-01-04T00:55:07Z D! [inputs.vsphere] Query for datastore has 6 QuerySpecs
2020-01-04T00:55:08Z D! [inputs.vsphere] Query for datastore returned metrics for 6 objects
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.used.latest, SW-SSD
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.provisioned.latest, SW-SSD
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.capacity.latest, SW-SSD
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.used.latest, SW-HDD
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.provisioned.latest, SW-HDD
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.capacity.latest, SW-HDD
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.used.latest, VS2-NVMe
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.provisioned.latest, VS2-NVMe
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.capacity.latest, VS2-NVMe
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.used.latest, VS1-NVMe
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.provisioned.latest, VS1-NVMe
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.capacity.latest, VS1-NVMe
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.used.latest, VS2-HDD
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.provisioned.latest, VS2-HDD
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.capacity.latest, VS2-HDD
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.used.latest, VS1-HDD
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.provisioned.latest, VS1-HDD
2020-01-04T00:55:08Z D! [inputs.vsphere] Missing value for: disk.capacity.latest, VS1-HDD
2020-01-04T00:55:08Z D! [inputs.vsphere] CollectChunk for datastore returned 0 metrics
2020-01-04T00:55:08Z D! [inputs.vsphere] Latest sample for datastore set to 0001-01-01 00:00:00 +0000 UTC
2020-01-04T00:55:08Z D! [inputs.vsphere] purged timestamp cache. 0 deleted with 6 remaining

This wassn't showing up in my original log, would have been helpful to get this error when running the "all" [] tag on the datastore_metric_include command, but perhaps this is something unique to my environment.

Again any and all help with this is appreciated!

prydin commented 4 years ago

It's difficult to give you a good answer on the first portion of your question without knowing specifically which metrics are only reported a couple of times per day. Some metrics are generated at a much lower frequency than 5m, but I've never seen them be reported as infrequently as you describe. In general, storage metrics in vSphere are notorious for being a bit unpredictable. Depending on what the underlying storage architecture is, you're going to se different sets of metrics.

The second part is easy to address. When you're specifying metrics using wildcards (remember that [] is really interpreted as ["*"]), the vSphere plugin goes out to discover available metrics. If some are missing, they simply won't be included in the discovered set. But when you explicitly specify metric names like you did, you're forcing the metric collection logic to include them in the query. If they're missing, you'll get an empty result set. This is how the plugin is designed to operate. It turns out that querying for specific names when they're available (i.e. when you're not using wildcards) is a bit faster, so we do that whenever possible.

In general, you should be a bit careful when reading debug logs. They're designed for debugging of the internals of the code, not for "public consumption". The general guideline is that a line starting with "D!" represents a benign behavior.

prydin commented 4 years ago

@paladin245 Do you have access to the govc tool? It's available here: https://github.com/vmware/govmomi/tree/master/govc

Using that tool, you can check what metrics are available from the vSphere API. I just had to use that tool since I was experiencing an issue similar to what you described and it turned out that the data that Telegraf was querying really was missing. Not sure why, but it would point to an issue with vCenter rather than Telegraf.

Try something like this:

prydin-a02:telegraf prydin$ govc metric.sample -n=100 -t=true /wavefrontDC/datastore/somedatastore disk.used.latest
somedatastore  -  disk.used.latest  2020-01-07T10:00:00Z,300,2020-01-07T10:30:00Z,300,2020-01-07T11:00:00Z,300,2020-01-07T11:30:00Z,300,2020-01-07T12:00:00Z,300,2020-01-07T12:30:00Z,300,2020-01-07T13:00:00Z,300,2020-01-07T13:30:00Z,300,2020-01-07T14:00:00Z,300,2020-01-07T14:30:00Z,300,2020-01-07T15:00:00Z,300,2020-01-07T15:30:00Z,300,2020-01-07T16:00:00Z,300,2020-01-07T16:30:00Z,300,2020-01-07T17:00:00Z,300,2020-01-07T17:30:00Z,300,2020-01-07T18:00:00Z,300  45237650848,44494183928,43970221320,43977077672,43979106032,43979221264,43979328096,43979464656,43980064872,43980207696,43980581144,43981371160,43981951248,43982271904,43982369560,43982686872,43983162104  KB

That will give you the samples and timestamps (in UTC) of the data that's available in vCenter. If that matches what you're seeing, you should turn troubleshooting towards vCenter rather than telegraf.

paladin245 commented 4 years ago

@prydin I haven't tried govc yet, I am trying to stand up a collectd instance as I figured if it is also not capturing data points then that might be the case.

I've just left telegraf running over the last 4 days with the specific metric collection turned on from my last post and it was capturing data points every 30 minutes, but then had sections of 4-5 hours where it just didn't capture anything again.

I'll try your suggestion and report back with my findings.

prydin commented 4 years ago

You could also look in the vCenter Web UI to check if it is missing data points.

prydin commented 4 years ago

@paladin245 Thank you for reporting this issue! I was able to reproduce it and it turns out to be an issue with Telegraf and vSphere 6.7.

For some reason that I haven't been able to track down yet, there can sometimes be a significant delay for non-realtime metrics to become available in vCenter. I've seen delays as long as an hour. The vSphere Telegraf plugin tries to handle delayed metrics by always looking three sample periods back in time. If the metric is at 5 minute granularity, that works out to 15 minutes, which isn't enough to catch extremely delayed metrics.

This may be a regression in vSphere 6.7, because I haven't seen this until very recently.

The way I want to solve this is to make the lookback period configurable. I'm reluctant to make the default longer than 3 periods back, since it would cause some overhead. But if one is experiencing issues with dropped metrics, the solution would be to increase the lookback value.

Again, thanks for the detailed issue report. It helps us make this plugin better!

prydin commented 4 years ago

@paladin245 Just wanted to give you an update on this.

It turned out this was a bit harder than I thought to fix. Here's how it works today: We have to assume that metrics can be significantly delayed. So this means we can't just look 5 minutes back for a 5 minute granularity metric, but need to look back to the time just after the last metric we were successfully able to retrieve.

We do this by keeping a table of the last timestamp we've seen for a particular resource. This worked fine, since vCenter seemed to delay all metrics by about the same amount of time. However, this doesn't seem to be the case with 6.7. So I could have one metric for resource A that's running on time and one that's 30 minutes late. This means that the algorithm of keeping track of the latest timestamp per resource no longer works. We'd have to keep track of it per resource AND metric. This makes the code a lot more complicated and causes potential memory consumption issues.

I'm currently looking into how to solve this. Most likely, I'll just keep timestamps per metric and resource type. Since we can omit VMs (they don't get delayed in the same way), the number of timestamps to keep around would probably be manageable (less than a million, probably a lot less).

paladin245 commented 4 years ago

@prydin In vcenter I can see history in 5 minute intervals ever since I turned up the statistics level. (I can see week / month views of datastore capacity for example and metrics are showing in vcenter).

I ran govc as requested and this is a small sample of the output:

SW-SSD  77    disk.used.latest  2020-01-09T17:25:00Z,300,2020-01-09T17:55:00Z,300,2020-01-09T18:25:00Z,300,2020-01-09T18
:55:00Z,300,2020-01-09T19:25:00Z,300,2020-01-09T19:55:00Z,300,2020-01-09T20:25:00Z,300,2020-01-09T20:55:00Z,300,2020-01-
09T21:25:00Z,300,2020-01-09T21:55:00Z,300,2020-01-09T22:25:00Z,300,2020-01-09T22:55:00Z,300,2020-01-09T23:25:00Z,300,202
0-01-09T23:55:00Z,300,2020-01-10T00:25:00Z,300,2020-01-10T00:55:00Z,300,2020-01-10T01:25:00Z,300  58834951,58834951,5883
4951,58834951,58834951,58834951,58834951,58834951,58834951,58834951,58834951,58834951,58834951,58834951,58834951,5883495
1,58834951                   KB

I assume this indicates that there is data present on vCenter and it is possible to query that data. Any further troubleshooting suggestions?

paladin245 commented 4 years ago

@prydin Sorry I just read your messages above, thank you for getting back to me on this as it has been really frustrating. I'm doing this in our test environment and trying to get it set up for roll out to the production site. If I'm going to convince my executive rep and my manager that we should pay for active support I need to be able to demonstrate the software works first!

I thought this may be something to do with vCenter 6.7 as I have today stood up a 6.5 instance and temporarily added my test hosts too it. Interestingly metric collection appears to work perfectly fine and my dashboard works, so the issue indeed seems only to be 6.7 and the way VMware has changed metric collection and storage.

If there's anything else I can do to assist with troubleshooting this further please let me know, I'm happy to help and run any tests you would like.

prydin commented 4 years ago

@paladin245 Sorry for the delay. It turned out that I had to rewrite part of the metric collection code to fix this. A lot of the code was based on the assumption that if one metric on a resource was delayed, so was everything else. That's no longer true in 6.7, which is probably a good thing. But it broke my code. Badly.

Anyway, here's a prototype of the new code: https://github.com/wavefrontHQ/telegraf/tree/prydin-lookbackfix

Feel free to test it! I've also added a Linux binary if you don't want to build the code yourself. It's here: https://github.com/wavefrontHQ/telegraf/releases/tag/PRYDIN-LOOKBACKFIX USE AT YOUR OWN RISK! THIS HAS ONLY UNDERGONE BASIC TESTING! I'd love some feedback, though!

paladin245 commented 4 years ago

@prydin are you able to build a windows binary for testing? We are running 95% windows platforms so all my testing boxes (and therefore the one this is set up on currently) is a windows server machine. Cheers!

danielnelson commented 4 years ago

@prydin Feel free to open a "Draft Pull Request" which will trigger a build by CircleCI, then I can add a link to the build.

keliansb commented 4 years ago

Just tried the Linux binary you provided @prydin and it seems to be working perfectly! I was having the same issues than @paladin245, the datastore metrics were not collected after a while (not sure if it was after the initial discovery or not btw). I run your modified version for 2 days now and the datastore metrics are still collected and displayed in Grafana. Let me know if you need additional logs or tests, I would be happy to help.

prydin commented 4 years ago

Yeah, I've seen much better results in my lab too after the fix. Let me run it for another day or two and then I'll cut a PR!

paladin245 commented 4 years ago

I've managed to get mine running pretty consistently now, but still not stable enough to put into full time production use, so I look forward to when the new version is out and I can re-test. how are you going with it @prydin ?

prydin commented 4 years ago

I've just submitted a PR with the latest changes. However, I'd be very interested in knowing what instabilities you're still experiencing.

prydin commented 4 years ago

@danielnelson I think we can close this. I reworked the way metrics are collected to account for some new behavior introduced by vSphere 6.7. Since I never heard back from the reporter on any additional issues, I think we can consider this as solved.

BuzzITServices commented 4 years ago

@prydin - Hi, I am so new to all this, but learning!!

I am experiencing the same issue described here and just wanted to understand how I can run the new code too? Sorry for a stuidply easy questions, but I am still learning!

Thanks!!

paladin245 commented 4 years ago

@prydin My apologies for not responding to you, I've taken on a new role where I work and honestly haven't gotten time to come back to my Grafana test lab. Unfortunately the test version you provided that @CaptAintHere used didn't work for me as I'm running it in a Windows Server 2016 lab (I'm assuming it has to be compiled different for use on Windows).

Have the changes you made been included in the latest windows supported build?

paladin245 commented 4 years ago

@BuzzIT-Roadshow

I ended up splitting my queries into several instances of Telegraf using the following settings for each:

telegraf-realtime.conf

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

## Realtime instance
[[inputs.vsphere]]

  # Connection
  interval = "30s"
  timeout = "30s"
  separator = "_"
  vcenters = [ "https://VC1.RFHN.local/sdk" ]
  username = "administrator@vcenter.local"
  password = "a4ntHNdsxq@LgPk%"
  insecure_skip_verify = true

  # Discovery
  force_discover_on_init = true
  discover_concurrency = 1
  object_discovery_interval = "360s"

  # Query
  max_query_objects = 256
  max_query_metrics = 256
  collect_concurrency = 1

  # Include all real time metrics
  host_metric_include = []
  vm_metric_include = []

  # Exclude all historical metrics
  datastore_metric_exclude = ["*"]
  cluster_metric_exclude = ["*"]
  datacenter_metric_exclude = ["*"]

telegraf-historical.conf

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

## Historical instance
[[inputs.vsphere]]

  # Connection
  interval = "300s"
  timeout = "300s"
  separator = "_"
  vcenters = [ "https://VC1.RFHN.local/sdk" ]
  username = "administrator@vcenter.local"
  password = "a4ntHNdsxq@LgPk%"
  insecure_skip_verify = true

  # Discovery
  force_discover_on_init = true
  discover_concurrency = 1
  object_discovery_interval = "3600s"

  # Query
  max_query_objects = 256
  max_query_metrics = 256
  collect_concurrency = 1

  # Include all historical metrics
  datastore_metric_include = []
  cluster_metric_include = []
  datacenter_metric_include = []

  # Exclude all real time metrics
  host_metric_exclude = ["*"]
  vm_metric_exclude = ["*"]
  cluster_metric_exclude = ["*"]
  datacenter_metric_exclude = ["*"]

I found that doing it this way seems to allow it to actually capture all of the information with less of the time outs I was seeing in the debug log. You will need to read through and configure the other settings above the Input Plugins header block yourself as these will be specific to your environment.

Hopefully this helps, otherwise @prydin usually responds quickly :)

BuzzITServices commented 4 years ago

@prydin My apologies for not responding to you, I've taken on a new role where I work and honestly haven't gotten time to come back to my Grafana test lab. Unfortunately the test version you provided that @CaptAintHere used didn't work for me as I'm running it in a Windows Server 2016 lab (I'm assuming it has to be compiled different for use on Windows).

Have the changes you made been included in the latest windows supported build?

No need to apologize! All is working perfectly by splitting the queries up and have had around 15hrs worth of metrics delivered with no drops (was dropping the datastore metrics after 3hrs).

Thank you again!

paladin245 commented 4 years ago

@BuzzIT-Roadshow that is fantastic to hear!

Note that even with this split, I still find that over the course of a week you will experience some gaps in the metrics, but for the most start splitting does seem to resolve it for any user who is working out of Windows.

If you have a Unix based system then the latest release build should have resolved the issue for your, I'm just waiting to hear from @prydin if the fix was incorporated into a new Windows compatible release.

aagrou commented 4 years ago

Hello , I hope this message find you well, All my VCenters 6.5 works perfect but 6.7 doesn’t work, i updated my telegraf to last version but 6.7 doesn’t work. I have the last version of telegraf,

I got this messages in the telegraf.log

2020-06-13T01:21:59Z I! [inputs.vsphere] Starting plugin 2020-06-13T01:21:59Z D! [inputs.vsphere] Creating client: 172.29.161.30 2020-06-13T01:21:59Z D! [inputs.vsphere] Option query for maxQueryMetrics failed. Using default 2020-06-13T01:21:59Z D! [inputs.vsphere] vCenter version is: 6.7.0 2020-06-13T01:21:59Z D! [inputs.vsphere] vCenter says max_query_metrics should be 256 2020-06-13T01:21:59Z D! [inputs.vsphere] Running initial discovery 2020-06-13T01:21:59Z D! [inputs.vsphere] Discover new objects for 172.29.161.30 2020-06-13T01:21:59Z D! [inputs.vsphere] Discovering resources for datacenter 2020-06-13T01:21:59Z D! [inputs.vsphere] Find(Datacenter, /) returned 0 objects 2020-06-13T01:21:59Z D! [inputs.vsphere] Discovering resources for cluster 2020-06-13T01:21:59Z D! [inputs.vsphere] Find(ClusterComputeResource, //host/) returned 0 objects 2020-06-13T01:21:59Z D! [inputs.vsphere] Discovering resources for host 2020-06-13T01:21:59Z D! [inputs.vsphere] Find(HostSystem, //host/) returned 0 objects 2020-06-13T01:21:59Z D! [inputs.vsphere] Discovering resources for vm 2020-06-13T01:21:59Z D! [inputs.vsphere] Discovering resources for datastore 2020-06-13T01:21:59Z D! [inputs.vsphere] Find(Datastore, //datastore/**) returned 0 objects 2020-06-13T01:21:59Z D! [inputs.vsphere] Using fast metric metadata selection for datastore 2020-06-13T01:22:00Z D! [inputs.vsphere] Interval estimated to 1m0s 2020-06-13T01:22:00Z D! [inputs.vsphere] Collecting metrics for 0 objects of type vm for 172.29.161.30 2020-06-13T01:22:00Z D! [inputs.vsphere] Latest sample for vm set to 0001-01-01 00:00:00 +0000 UTC 2020-06-13T01:22:00Z D! [inputs.vsphere] Interval estimated to 1m0s 2020-06-13T01:22:00Z D! [inputs.vsphere] Collecting metrics for 0 objects of type datacenter for 172.29.161.30 2020-06-13T01:22:00Z D! [inputs.vsphere] Latest sample for datacenter set to 0001-01-01 00:00:00 +0000 UTC 2020-06-13T01:22:00Z D! [inputs.vsphere] Interval estimated to 1m0s 2020-06-13T01:22:00Z D! [inputs.vsphere] Collecting metrics for 0 objects of type cluster for 172.29.161.30 2020-06-13T01:22:00Z D! [inputs.vsphere] Latest sample for cluster set to 0001-01-01 00:00:00 +0000 UTC 2020-06-13T01:22:00Z D! [inputs.vsphere] Interval estimated to 1m0s 2020-06-13T01:22:00Z D! [inputs.vsphere] Collecting metrics for 0 objects of type host for 172.29.161.30 2020-06-13T01:22:00Z D! [inputs.vsphere] Latest sample for host set to 0001-01-01 00:00:00 +0000 UTC 2020-06-13T01:22:00Z D! [inputs.vsphere] purged timestamp cache. 0 deleted with 0 remaining

I think the telegraf couldn’t discover information from the vcenter. Could you please help me,

Many thanks, Abdelkrime

prydin commented 4 years ago

@aagrou Can you please share your config file?

aagrou commented 4 years ago

Thanks prydin, I'm using the latest verions: Grafana v7.0.3 (00ee734baf) InfluxDB shell version: 1.8.0

Below my config file it contains some vCenters 6.5 and other 6.7, the 6,5 work perfect but the 6.7 not working, as shared in the previous logs. The config file is :

Realtime instance

[[inputs.vsphere]]

List of vCenter URLs to be monitored. These three lines must be uncommented

and edited for the plugin to work.

interval = "20s"

vcenters = [ "https://vcenter6.5/sdk", "https://vcenter6.5/sdk", "https://vcenter6.7/sdk", "https://vcenter6.7/sdk" ]

username = "" password = ""

vm_metric_include = [] host_metric_include = [] cluster_metric_include = [] datastore_metric_exclude = ["*"]

max_query_metrics = 256 timeout = "60s" insecure_skip_verify = true

Historical instance

[[inputs.vsphere]] interval = "300s"

vcenters = [ "https://vcenter6.5/sdk", "https://vcenter6.5/sdk", "https://vcenter6.7/sdk", "https://vcenter6.7/sdk" ]

username = "****" password = "****"

datastore_metric_include = [ "disk.capacity.latest", "disk.used.latest", "disk.provisioned.latest"] insecure_skip_verify = true force_discover_on_init = true host_metric_exclude = [""] # Exclude realtime metrics vm_metric_exclude = [""] # Exclude realtime metrics max_query_metrics = 256 collect_concurrency = 3

Many thanks for your collaboration,

Regards, Abdelkrime

aagrou commented 4 years ago

Hello,

Could you please help me.

Many thanks, Abdelkrime