influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.45k stars 5.55k forks source link

vSphere Cluster CPU - Usage Average not collecting info #8316

Open jorgedlcruz opened 3 years ago

jorgedlcruz commented 3 years ago

Hello, I have been running the vSphere plugin since day one, in the past the vsphere_cluster_cpu - usage_average was working perfectly fine, giving us (every 5 minutes yes) the CPU Usage per cluster.

But it seems that for a few months this is not working anymore, I have the field on InfluxDB, but no new metrics are coming.

On the same measurement, the vsphere_cluster_cpu, the totamhz_average works, so there should be something somewhere that is not allowing this to works. image

The vsphere_cluster_mem - usage_average works perfectly as well.

ssoroka commented 3 years ago

@prydin any insight here?

sjwang90 commented 3 years ago

Hey @prydin, checking in if you're heard of this bug at all.

jorgedlcruz commented 3 years ago

I have deployed a vanilla telegraf+influxdb on Ubuntu 20.04, still the same, so this is easily reproducible, I just have vsphere_cluster_cpu - totalmhz_average but not usage average. Can please somebody take a look? Thanks!

prydin commented 3 years ago

Could you please send me some logs? A common problem is that the vpxd.stats.maxQueryMetrics in vCenter is set too low. vCenter is a bit quirky in the way it counts rows in cluster performance queries, so even if it looks like this value is set very high, you may still run into problems. Try setting it to -1 for unlimited query complexity.

jorgedlcruz commented 3 years ago

Hello, Sure, enabled the parameter on VMware as you mentioned, following VMware instructions. Restarted telegraf, this was working without me editing the vCenter a few of telegraf editions ago, log:

prydin commented 3 years ago

It could be that the number of VMs has grown. Some background: When you ask for the cluster metrics, vSphere actually constructs it from VM and host metrics. That query can be very complex and it appears that vCenter is counting the total number of rows queried from VMs and hosts.

Is the log you posted from after you made the change? Is it working now?

jorgedlcruz commented 3 years ago

The amount of VMs in my environment is very low, just 32 powered on but I get the point :) Yes, the log is after the parameter was added to the vCenter.

Nope, still no usage average at all. I could always get the vSphere_Cluster_mem - usage_average without a problem, but not the cpu_average. Really strange, I am not alone, and this is a new vanilla telegraf + influxdb

prydin commented 3 years ago

Sorry for the delay. I've been able to reproduce this in a lab. I will provide an update once I understand the root cause a bit better.

prydin commented 3 years ago

@jorgedlcruz I think I have found the issue. This is caused by vCenter delaying data by about 30 minutes. When Telegraf looks for metrics in vCenter, it checks 3 sample periods back in time. For cluster metrics, the minimum sample period is 5 minutes, so the lookback is 15 minutes. Since the latest datapoint is 30 minutes old, it will not be included in this query. We could of course increase the lookback, but that may cause performance issues.

Here's an example of some very delayed data:

prydin-a02:govc prydin$ govc metric.sample -n=12 -t /wavefrontDC/host/wfvsancluster cpu.usage.average
wfvsancluster  -  cpu.usage.average  2020-12-03T17:10:00Z,300,2020-12-03T17:15:00Z,300,2020-12-03T17:20:00Z,300,2020-12-03T17:25:00Z,300,2020-12-03T17:30:00Z,300,2020-12-03T17:35:00Z,300,2020-12-03T17:40:00Z,300  11.30,10.78,11.09,10.93,11.27,11.73,10.90  %
prydin-a02:govc prydin$ date -u
Thu Dec  3 18:09:57 UTC 2020
prydin-a02:govc prydin$ 

In general, host data is of better quality, cheaper to collect and available at 20 minute average. As a workaround, I would recommend replacing this metrics with a query that aggregates host cpu metrics instead.

jorgedlcruz commented 3 years ago

Hello, Thank you for the answer, but that does not explain why this was working fine for quite some years, and now we can see these delays. I am not even sure if I try with an old telegraf release this will happen, like 1.14 or below.

And if it works on older telegraf releases I guess it might be not VMware related. Could you try with telegraf 1.14?

kkruzich commented 3 years ago

@prydin My attention was called over here from https://github.com/influxdata/telegraf/issues/8681. I'm experiencing the same issue. For me, looks like cluster metrics stopped sometime in July-August 2020. I don't know if this is the result of an update to vCenter or telegraf. Nonetheless, I have a few projects which have merged towards needing these metrics and I'm in a difficult spot. I'd like to find the best possible solution for this.

When you say 'This is caused by vCenter delaying data by about 30 minutes.' has this always been the case? Is this an issue since 6.7? How did previous versions of telegraf deal with this? Nonetheless, with that 30 minute delay being the case, is there anything I can do about it in my configuration? I would almost prefer offering delayed, or "poorer quality" metrics than none at all.

Is this the only workaround at the moment? "I would recommend replacing this metrics with a query that aggregates host cpu metrics instead."

FWIW, I'd like to provide some examples from my environment that support what you've noted above:

Name:         VMware vCenter Server
Vendor:       VMware, Inc.
Version:      6.7.0
Build:        16046713
OS type:      linux-x64
API type:     VirtualCenter
API version:  6.7.3
Product ID:   vpx
UUID:         c531e2a4-95de-470d-b8c5-9a5ada7a865b

# govc metric.sample -n=3 -t /SantaClara/host/SCLCLD100 cpu.usage.average
  [ nothing returned ]

# govc metric.sample -n=6 -t /SantaClara/host/SCLCLD100 cpu.usage.average
SCLCLD100  -  cpu.usage.average  2021-01-13T14:20:00Z,300  31.01  %

# govc metric.sample -n=12 -t /SantaClara/host/SCLCLD100 cpu.usage.average
SCLCLD100  -  cpu.usage.average  2021-01-13T13:50:00Z,300,2021-01-13T13:55:00Z,300,2021-01-13T14:00:00Z,300,2021-01-13T14:05:00Z,300,2021-01-13T14:10:00Z,300,2021-01-13T14:15:00Z,300,2021-01-13T14:20:00Z,300  34.40,35.87,32.15,31.57,30.94,32.00,31.01  %

# govc metric.sample -n=24 -t /SantaClara/host/SCLCLD100 cpu.usage.average
SCLCLD100  -  cpu.usage.average  2021-01-13T12:50:00Z,300,2021-01-13T12:55:00Z,300,2021-01-13T13:00:00Z,300,2021-01-13T13:05:00Z,300,2021-01-13T13:10:00Z,300,2021-01-13T13:15:00Z,300,2021-01-13T13:20:00Z,300,2021-01-13T13:25:00Z,300,2021-01-13T13:30:00Z,300,2021-01-13T13:35:00Z,300,2021-01-13T13:40:00Z,300,2021-01-13T13:45:00Z,300,2021-01-13T13:50:00Z,300,2021-01-13T13:55:00Z,300,2021-01-13T14:00:00Z,300,2021-01-13T14:05:00Z,300,2021-01-13T14:10:00Z,300,2021-01-13T14:15:00Z,300,2021-01-13T14:20:00Z,300  29.63,28.62,30.04,30.24,31.73,36.87,42.54,40.23,38.85,36.90,37.42,36.60,34.40,35.87,32.15,31.57,30.94,32.00,31.01  %

# date -u
Wed Jan 13 14:48:08 UTC 2021
sjwang90 commented 3 years ago

@prydin Could you provide any insight on this issue? Seems like a pretty big issue with Telegraf 1.14+

prydin commented 3 years ago

@sjwang90 Investigating now.

prydin commented 3 years ago

@sjwang90 et al: 1) I am initiating a discussion with the vCenter team to try to understand the reason for the delayed data. 2) As I stated above: It is usually better to synthesize cluster data from host metrics using queries. Host data is "real time" and should not suffer from these delays. 3) I can introduce a lookback parameter. As I said, we currently look 3 period (15 minutes) back. I'm reluctant to increase the default, as it could affect performance. However, we could make it configurable. It still would mean that you'd be seeing fairly old data, but at least it wouldn't be missing.

Thoughts?