influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

inputs.cpu errors or return potential wrong values on Windows #4269

Closed back2root closed 3 years ago

back2root commented 6 years ago

Relevant telegraf.conf:

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states.
  report_active = false

Not related to the error but only for ease of reproduction:

[[outputs.file]]
#   Write output to "stdout"
files = ["stdout"]
data_format = "influx"

System info:

Affected operating System: Windows

Tested on:

Steps to reproduce:

  1. Download Telegraf
  2. Ensure that inputs.cpu is enabled
  3. I suggest to enable outputs.file with: files = ["stdout"] as one and only output but the output shouldn't matter
  4. Run Telegraf. E.g.: telegraf -config telegraf.conf (--test)

Expected behavior:

Telegraf is running smothely and collecting every interval seconds CPU metrics.

Actual behavior:

From time to time Telegraf is throwing an error and isn't reporting CPU metrics for all CPU cores:

E! Error: current total CPU time is less than previous total CPU time

The more cpu cores you have (+ percpu = true) the more likely it is that the error is thrown.

Additional info:

Telegraf uses github.com/shirou/gopsutil/cpu to gather cpu metrics and expects that the returned values are used cpu time. How ever for Windows Plattform the used library already returns percentage values that the library itself gatherd via WMI. Thus later checks on the returned values fail as it is expected that the cpu time used may only rise on normal conditions. How ever the retruned cpu percent used will not follow this expectation. In addition later calculations of the cpu percent used makes no sence on percent values.

So for Windows plattform all the checks and calculation made by Telegraf using the variable lastStats are not needed/problematic.

argerus commented 6 years ago

Probably caused by this bug in github.com/shirou/gopsutil/cpu

zak-pawel commented 3 years ago

I easily reproduced it using Telegraf 1.6.4. I couldn't reproduce it using Telegraf 1.16.2 - seems that mentioned bug in github.com/shirou/gopsutil/cpu was fixed here in v2.18.12 version. Currently Telegraf uses v2.20.9