influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

vSphere Plugin - VSAN extension not working #13519

Closed jaymzmac closed 1 year ago

jaymzmac commented 1 year ago

Relevant telegraf.conf

[[inputs.vsphere]]
    interval = "300s"
    vcenters = [ "https://<vcenter-fqdn>/sdk" ]
    username = "user"
    password = "password"
    insecure_skip_verify = true
    collect_concurrency = 1
    discover_concurrency = 1

    # Exclude all other metrics
    host_metric_exclude = ["*"]
    vm_metric_exclude = ["*"]
    cluster_metric_exclude = [ "*" ]
    datastore_metric_exclude = [ "*" ]
    datacenter_metric_exclude = [ "*" ]
    resourcepool_metric_exclude = [ "*" ]

    vsan_metric_include = [ "summary.*" ]
    vsan_metric_skip_verify = true

Logs from Telegraf

root@vsphere-telegraf-01:/etc/telegraf/telegraf.d# telegraf --config vsphere-vsan.conf --test --debug
2023-06-29T09:38:08Z I! Loading config: vsphere-vsan.conf
2023-06-29T09:38:08Z W! DeprecationWarning: Option "force_discover_on_init" of plugin "inputs.vsphere" deprecated since version 1.14.0 and will be removed in 2.0.0: option is ignored
2023-06-29T09:38:08Z I! Starting Telegraf 1.27.1
2023-06-29T09:38:08Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
2023-06-29T09:38:08Z I! Loaded inputs: vsphere
2023-06-29T09:38:08Z I! Loaded aggregators:
2023-06-29T09:38:08Z I! Loaded processors:
2023-06-29T09:38:08Z I! Loaded secretstores:
2023-06-29T09:38:08Z W! Outputs are not used in testing mode!
2023-06-29T09:38:08Z I! Tags enabled: host=vsphere-telegraf-01
2023-06-29T09:38:08Z W! Deprecated inputs: 0 and 1 options
2023-06-29T09:38:08Z D! [agent] Initializing plugins
2023-06-29T09:38:08Z D! [agent] Starting service inputs
2023-06-29T09:38:08Z I! [inputs.vsphere] Starting plugin
2023-06-29T09:38:08Z D! [inputs.vsphere] Creating client: <vcenter-fqdn>
2023-06-29T09:38:09Z D! [inputs.vsphere] Option query for maxQueryMetrics failed. Using default
2023-06-29T09:38:09Z D! [inputs.vsphere] vCenter version is: 7.0.3
2023-06-29T09:38:09Z D! [inputs.vsphere] vCenter says max_query_metrics should be 256
2023-06-29T09:38:09Z D! [inputs.vsphere] Running initial discovery
2023-06-29T09:38:09Z D! [inputs.vsphere] Discover new objects for <vcenter-fqdn>
2023-06-29T09:38:09Z D! [inputs.vsphere] Discovering resources for vsan
2023-06-29T09:38:09Z D! [inputs.vsphere] Discovering resources for datacenter
2023-06-29T09:38:09Z D! [inputs.vsphere] Find(Datacenter, /*) returned 2 objects
2023-06-29T09:38:09Z D! [inputs.vsphere] Discovering resources for cluster
2023-06-29T09:38:09Z D! [inputs.vsphere] Find(ClusterComputeResource, /*/host/**) returned 1 objects
2023-06-29T09:38:09Z D! [inputs.vsphere] Discovering resources for resourcepool
2023-06-29T09:38:09Z D! [inputs.vsphere] Find(ResourcePool, /*/host/**) returned 4 objects
2023-06-29T09:38:09Z D! [inputs.vsphere] Discovering resources for host
2023-06-29T09:38:10Z D! [inputs.vsphere] Find(HostSystem, /*/host/**) returned 3 objects
2023-06-29T09:38:10Z D! [inputs.vsphere] Discovering resources for vm
2023-06-29T09:38:10Z D! [inputs.vsphere] Discovering resources for datastore
2023-06-29T09:38:10Z D! [inputs.vsphere] Find(Datastore, /*/datastore/**) returned 2 objects
2023-06-29T09:38:10Z D! [inputs.vsphere] purged timestamp cache. 0 deleted with 0 remaining
2023-06-29T09:38:10Z D! [agent] Stopping service inputs
2023-06-29T09:38:10Z I! [inputs.vsphere] Stopping plugin
2023-06-29T09:38:10Z D! [inputs.vsphere] Waiting for endpoint "<vcenter-fqdn>" to finish
2023-06-29T09:38:10Z D! [inputs.vsphere] Exiting discovery goroutine for <vcenter-fqdn>
2023-06-29T09:38:10Z D! [agent] Input channel closed
2023-06-29T09:38:10Z D! [agent] Stopped Successfully

System info

Telegraf 1.27.1, Debian 11, vSphere 7.0U3

Docker

No response

Steps to reproduce

  1. Install telegraf version 1.27+ on system (which has the vsan extension feature)
  2. Using the configuration file attached to this issue, attempt to collect vsan metrics from a vCenter server with a VSAN cluster

Expected behavior

VSAN metrics are collected from vCenter.

Actual behavior

No VSAN metrics are collected from vCenter.

Additional info

I have verified that the performance metrics service is enabled for the VSAN cluster and that I can view the VSAN performance metrics directly in vCenter.

I have also tried configuring the telegraf plugin with a user which was admin rights in vCenter, but that didn't help.

powersj commented 1 year ago

--test

You will note from your output you got no metrics at all. That includes metrics from other resources. Using --test, runs the collection interval only once and does not wait. As the vsphere plugin readme documents, not all resources produce real time metrics and some produce historical metrics which may be delayed.

My suggestion is to run the plugin for up to 35 or so minutes and then see what you get.

jaymzmac commented 1 year ago

If you look at my configuration you will notice that I am excluding all other resource metrics (vm, host, datastore, etc), so it is not surprising that the output is not showing those.

Also, according to the documentation, the vsan summary.* metrics which I have configured the plugin to collect are real time, so I would expect those to work even if using the --test option.

Anyway, I will run the plugin for a while as you suggest and let you know whether any vsan metrics appear.

jaymzmac commented 1 year ago

Still no VSAN metrics being collected after leaving the plugin running for 12 hours.

powersj commented 1 year ago

@gangadharaswamy,

What is missing in this scenario? Does @jaymzmac need to explicitly include the following to collect metrics?

vsan_metric_exclude = [ "" ]
jaymzmac commented 1 year ago

The telegraf process crashes if I include:

vsan_metric_exclude = [ "" ]

or

vsan_metric_exclude = [ ]

Logs:

2023-07-03T09:11:15Z I! [inputs.vsphere] Starting plugin
panic: strconv.ParseInt: parsing "3.0": invalid syntax

goroutine 37 [running]:
github.com/coreos/go-semver/semver.Must(...)
        /go/pkg/mod/github.com/coreos/go-semver@v0.3.1/semver/semver.go:65
github.com/coreos/go-semver/semver.New({0xc000b7ca38?, 0x0?})
        /go/pkg/mod/github.com/coreos/go-semver@v0.3.1/semver/semver.go:49 +0x45
github.com/influxdata/telegraf/plugins/inputs/vsphere.versionLowerThan({0xc000b7ca38?, 0x100000000000000?}, {0x6e86677, 0x3})
        /go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsan.go:499 +0x45
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).collectVsan(0xc00097b200, {0x7b527d8?, 0xc00011c050}, {0x7b77360?, 0xc002649b00})
        /go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsan.go:44 +0x85
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect.func1({0x6e89b16?, 0x0?})
        /go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:962 +0x95
created by github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect
        /go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:958 +0x44a
jaymzmac commented 1 year ago

vCenter API version in my environment is 7.0.3.0. Maybe it's not getting parsed correctly when checking the minimum API version.

powersj commented 1 year ago

vCenter API version in my environment is 7.0.3.0

correct - this is not a valid semantic version. I have put up https://github.com/influxdata/telegraf/pull/13557 which should handle these types of versions. Can you please download the artifacts attached to the above pull request and try vsan_metric_exclude = [ "" ] together please.

jaymzmac commented 1 year ago

I've tested the artifacts in https://github.com/influxdata/telegraf/pull/13557 and confirmed this fixes the issue and we are able to see the vsan metrics.

Hipska commented 1 year ago

Strange thing here is that the log says this:

2023-06-29T09:38:09Z D! [inputs.vsphere] vCenter version is: 7.0.3

So when does that ".0" get added?

jaymzmac commented 1 year ago

7.0.3 is Client.ServiceContent.About.Version (https://github.com/influxdata/telegraf/blob/v1.27.1/plugins/inputs/vsphere/client.go#L295)

7.0.3.0 is Client.ServiceContent.About.ApiVersion (https://github.com/influxdata/telegraf/blob/v1.27.1/plugins/inputs/vsphere/endpoint.go#L467)

Hipska commented 1 year ago

Oh, subtle difference indeed ..

Hipska commented 1 year ago

@powersj is it really needed to use ApiVersion instead of Version for this check? (then we could still use semver...