influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

Make the output of [[inputs.win_perf_counters]] compatible with their linux counterparts #7138

Closed FrankyBoy closed 11 months ago

FrankyBoy commented 4 years ago

Proposal:

Make [[inputs.win_perf_counters]] generate the same output as [[inputs.cpu]], [[inputs.memory]], etc

Current behavior:

[[inputs.win_perf_counters]] generates (for example) "win_cpu" with fields like "Percent_Idle_Time", while [[inputs.cpu]] generates "cpu" with fields like "usage_idle". For all intents and purposes these fields are identical though.

Desired behavior:

Align these pointless differences.

Use case:

This discrepancy makes it impossible to share windows and linux usage numbers on one dashboard, meaning I never can monitor my whole system at a glance but have to switch between two views and have duplicate work for no good reason.

danielnelson commented 4 years ago

The first thing we should do is document a set of plugin configurations that would match. Is this something you could help with?

I can see a few issues coming up such as needing a processor to rename fields and float vs integer fields as well as counters vs rates. The transition to compatible metrics would be a little tricky to pull off as well.

Also, consider trying the regular plugin set: mem, cpu, net, etc. These should work on Windows but aren't as well tested.

FrankyBoy commented 4 years ago

I can try but I am really new to this whole topic, so idk how correct my results are gonna be ;)

Another option could also be reimplementing inputs.cpu to actually be based on performance counters, as the whole reason that inputs.cpu is discouraged is claimed performance problems from the WMI it uses. Because then you can get rid of even having a separate plugin for the same thing (though obviously you'd still have two implementations). While that might not be any advantage from the development effort side, it definitely is way nicer to use for users (one could probably even roll the same config everywhere then).

danielnelson commented 4 years ago

Another option could also be reimplementing inputs.cpu to actually be based on performance counters, as the whole reason that inputs.cpu is discouraged is claimed performance problems from the WMI it uses.

I think this is the approach we should take: identify the current issues with the default enabled plugins on Windows with the goal to switch to using the same set. Almost all of these plugins use the implementation from gopsutil which has improved its Windows support quite a bit over the last few years and I think we may be closer than we think.

Do you think you could take the default Linux config file, run it on Windows and report back with any major performance, memory, or major missing metrics?

M0rdecay commented 4 years ago

Let's try. Telegraf version - 1.15.3 Configuration tested on:

[[inputs.cpu]]

[[inputs.mem]]

[[inputs.disk]]

[[inputs.diskio]]

[[inputs.net]]

[[inputs.system]]

[[inputs.processes]]

[[inputs.kernel]]

[[inputs.linux_sysctl_fs]]

[[inputs.swap]]

[[inputs.procstat]]
  pattern = ".*"
  pid_tag = true
  fielddrop = [ "rlimit_*" ]
  namepass = [ "procstat" ]
  pid_finder = "native" # Only in Windows

[[inputs.procstat]]
  pattern = "telegraf.exe" # "telegraf" on Linux
  namepass = [ "procstat_lookup" ]
  pid_finder = "native" # Only in Windows
  [inputs.procstat.tags]
    appl = "telegraf"

inputs.cpu

Windows

> cpu,cpu=cpu0,host=host.local,stand_name=stand_windows usage_guest=0,usage_guest_nice=0,usage_idle=93.93939393939394,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=6.0606060606060606,usage_user=0 1600862963000000000
> cpu,cpu=cpu1,host=host.local,stand_name=stand_windows usage_guest=0,usage_guest_nice=0,usage_idle=90.9090909090909,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=6.0606060606060606,usage_user=3.0303030303030303 1600862963000000000
> cpu,cpu=cpu2,host=host.local,stand_name=stand_windows usage_guest=0,usage_guest_nice=0,usage_idle=84.84848484848484,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=15.151515151515152,usage_user=0 1600862963000000000
> cpu,cpu=cpu3,host=host.local,stand_name=stand_windows usage_guest=0,usage_guest_nice=0,usage_idle=90.9090909090909,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=9.090909090909092,usage_user=0 1600862963000000000
> cpu,cpu=cpu-total,host=host.local,stand_name=stand_windows usage_guest=0,usage_guest_nice=0,usage_idle=92.1874996565748,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=7.0312503463355815,usage_user=0.781249997089617 1600862963000000000

Linux

> cpu,cpu=cpu0,host=host.local,stand_name=stand_linux usage_guest=0,usage_guest_nice=0,usage_idle=60.869565358234034,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=2.1739130396958632,usage_steal=0,usage_system=23.913043431909305,usage_user=13.04347824133864 1600862929000000000
> cpu,cpu=cpu1,host=host.local,stand_name=stand_linux usage_guest=0,usage_guest_nice=0,usage_idle=55.10204090942,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=30.612244999174088,usage_user=14.285714328988202 1600862929000000000
> cpu,cpu=cpu2,host=host.local,stand_name=stand_linux usage_guest=0,usage_guest_nice=0,usage_idle=69.56521732088298,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=19.56521735568104,usage_user=10.869565196897586 1600862929000000000
> cpu,cpu=cpu3,host=host.local,stand_name=stand_linux usage_guest=0,usage_guest_nice=0,usage_idle=59.18367348490333,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=22.4489796700207,usage_user=18.367346999504452 1600862929000000000
> cpu,cpu=cpu-total,host=host.local,stand_name=stand_linux usage_guest=0,usage_guest_nice=0,usage_idle=60.732984219670264,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0.5235602105110597,usage_steal=0.5235602093682391,usage_system=23.560209471854865,usage_user=14.65968589278591 1600862929000000000

Conclusion - the metrics are identical.

Upd. I think that the CPU load for Windows should be gathered in a slightly different way - the types of load are different:

  [[inputs.win_perf_counters.object]]
    # CPU metrics
    ObjectName = "Processor"
    Instances = ["*"]
    Counters = [ "% Idle Time", 
                 "% Interrupt Time", 
                 "% DPC Time", 
                 "% User Time", 
                 "% Privileged Time", 
                 "% Processor Time" ]
    Measurement = "win_cpu"
    WarnOnMissing = false
    IncludeTotal = true

inputs.mem

Windows

> mem,host=host.local,stand_name=stand_windows available=3073785856i,available_percent=71.71405963907694,total=4286169088i,used=1212383232i,used_percent=28.285940360923064 1600862962000000000

Linux

> mem,host=host.local,stand_name=stand_linux active=4997435392i,available=2703466496i,available_percent=33.07635023869159,buffered=0i,cached=2495905792i,commit_limit=5160443904i,committed_as=6686736384i,dirty=172032i,free=620630016i,high_free=0i,high_total=0i,huge_page_size=2097152i,huge_pages_free=0i,huge_pages_total=0i,inactive=1895534592i,low_free=0i,low_total=0i,mapped=54816768i,page_tables=17313792i,shared=24346624i,slab=405049344i,sreclaimable=245194752i,sunreclaim=159854592i,swap_cached=2555904i,swap_free=780681216i,swap_total=1073737728i,total=8173412352i,used=5056876544i,used_percent=61.86983265028349,vmalloc_chunk=35180028882944i,vmalloc_total=35184372087808i,vmalloc_used=68923392i,write_back=0i,write_back_tmp=0i 1600862928000000000

Conclusion - the minimum set of information is returned in Windows. At the very least, you want to see information about the allocated memory pages.

inputs.disk

Windows

> disk,device=C:,fstype=NTFS,host=host.local,mode=unknown,path=\C:,stand_name=stand_windows free=26673659904i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=53317988352i,used=26644328448i,used_percent=49.972493845973375 1600862962000000000

Linux

> disk,device=dm-0,fstype=xfs,host=host.local,mode=rw,path=/,stand_name=stand_linux free=4578582528i,inodes_free=4109389i,inodes_total=4192256i,inodes_used=82867i,total=8575254528i,used=3996672000i,used_percent=46.60703640865737 1600862928000000000

Conclusion - the metrics are identical.

inputs.diskio

Windows

No metrics.

Linux

> diskio,host=host.local,name=dm-0,stand_name=stand_linux io_time=2297117i,iops_in_progress=0i,merged_reads=0i,merged_writes=0i,read_bytes=3565867520i,read_time=1080292i,reads=143561i,weighted_io_time=6325809i,write_bytes=26874711552i,write_time=5244870i,writes=3556438i 1600862928000000000

Conclusion - this plugin does not work on Windows. Replaced by something like this:

  [[inputs.win_perf_counters.object]]
    # Disks info physical
    ObjectName = "PhysicalDisk"
    Instances = ["*"]
    Counters = ["% Idle Time", "% Disk Time","% Disk Read Time", "% Disk Write Time", "% User Time", "Avg. Disk Queue Length", "Current Disk Queue Length", 
    "Avg. Disk sec/Read", "Avg. Disk sec/Write", "% Free Space", "Free Megabytes", "Disk Reads/sec", "Disk Writes/sec", "Disk Transfers/sec", "Avg. Disk Bytes/Transfer"]
    Measurement = "win_disk_physical"

inputs.net

Windows

> net,host=host.local,interface=Local\ Area\ Connection,stand_name=stand_windows bytes_recv=4076673423i,bytes_sent=3643414976i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=25145684i,packets_sent=22805918i 1600862962000000000

Linux

> net,host=host.local,interface=eth0,stand_name=stand_linux bytes_recv=2111338081142i,bytes_sent=3239348395418i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=15976633797i,packets_sent=16217086767i 1600862928000000000
> net,host=host.local,interface=all,stand_name=stand_linux icmp_inaddrmaskreps=0i,icmp_inaddrmasks=0i,icmp_incsumerrors=0i,icmp_indestunreachs=3715041i,icmp_inechoreps=0i,icmp_inechos=10i,icmp_inerrors=160i,icmp_inmsgs=3715051i,icmp_inparmprobs=0i,icmp_inredirects=0i,icmp_insrcquenchs=0i,icmp_intimeexcds=0i,icmp_intimestampreps=0i,icmp_intimestamps=0i,icmp_outaddrmaskreps=0i,icmp_outaddrmasks=0i,icmp_outdestunreachs=3714636i,icmp_outechoreps=10i,icmp_outechos=0i,icmp_outerrors=0i,icmp_outmsgs=3714646i,icmp_outparmprobs=0i,icmp_outredirects=0i,icmp_outsrcquenchs=0i,icmp_outtimeexcds=0i,icmp_outtimestampreps=0i,icmp_outtimestamps=0i,icmpmsg_intype3=3715041i,icmpmsg_intype8=10i,icmpmsg_outtype0=10i,icmpmsg_outtype3=3714636i,ip_defaultttl=64i,ip_forwarding=2i,ip_forwdatagrams=0i,ip_fragcreates=0i,ip_fragfails=0i,ip_fragoks=0i,ip_inaddrerrors=0i,ip_indelivers=15981074435i,ip_indiscards=0i,ip_inhdrerrors=0i,ip_inreceives=15982811257i,ip_inunknownprotos=0i,ip_outdiscards=1851421i,ip_outnoroutes=0i,ip_outrequests=16224615579i,ip_reasmfails=0i,ip_reasmoks=0i,ip_reasmreqds=0i,ip_reasmtimeout=0i,tcp_activeopens=4682488i,tcp_attemptfails=3575308i,tcp_currestab=75i,tcp_estabresets=3074i,tcp_incsumerrors=2i,tcp_inerrs=3646i,tcp_insegs=15976664856i,tcp_maxconn=-1i,tcp_outrsts=102549i,tcp_outsegs=16243509336i,tcp_passiveopens=31709718i,tcp_retranssegs=54709i,tcp_rtoalgorithm=1i,tcp_rtomax=120000i,tcp_rtomin=200i,udp_incsumerrors=0i,udp_indatagrams=35989i,udp_inerrors=0i,udp_noports=3714307i,udp_outdatagrams=3750296i,udp_rcvbuferrors=0i,udp_sndbuferrors=0i,udplite_incsumerrors=0i,udplite_indatagrams=0i,udplite_inerrors=0i,udplite_noports=0i,udplite_outdatagrams=0i,udplite_rcvbuferrors=0i,udplite_sndbuferrors=0i 1600862928000000000

Conclusion - the basic indicators are identical. Windows does not have an extended set of metrics. Unfortunately, I don't know how it can be gathered in Windows. And is it necessary at all?

inputs.system

Windows

> system,host=host.local,stand_name=stand_windows load1=0,load15=0,load5=0,n_cpus=4i,n_users=0i 1600862962000000000
> system,host=host.local,stand_name=stand_windows uptime=13393443i 1600862962000000000
> system,host=host.local,stand_name=stand_windows uptime_format="155 days,  0:24" 1600862962000000000

Linux

> system,host=host.local,stand_name=stand_linux load1=0.15,load15=0.16,load5=0.09,n_cpus=4i,n_users=1i 1600862928000000000
> system,host=host.local,stand_name=stand_linux uptime=17623474i 1600862928000000000
> system,host=host.local,stand_name=stand_linux uptime_format="203 days, 23:24" 1600862928000000000

Conclusions: Since Windows doesn't know anything about load average, there are always zeros here. I don't think this applies at all to Windows, we are looking at the Processor Queue Length. The number of users for Windows is incorrect - it looks like it is always zero, or RDP sessions are not counted.

inputs.processes

Windows

No metrics.

Linux

> processes,host=host.local,stand_name=stand_linux blocked=0i,dead=0i,idle=0i,paging=0i,running=1i,sleeping=154i,stopped=0i,total=155i,total_threads=685i,unknown=0i,zombies=0i 1600866201000000000

Conclusion - this plugin does not work on Windows. Unfortunately, I don't know how it can be gathered.

inputs.kernel

Windows

No metrics.

Linux

> kernel,host=host.local,stand_name=stand_linux boot_time=1583239454i,context_switches=86505042818i,entropy_avail=3452i,interrupts=27772527413i,processes_forked=2455215206i 1600862928000000000

Conclusion - this plugin does not work on Windows. Replaced by something like this:

  [[inputs.win_perf_counters.object]]
    # System counters
    ObjectName = "System"
    Counters = ["Context Switches/sec", "Processor Queue Length", "Processes"]
    Instances = ["*"]
    Measurement = "win_sys"

  [[inputs.win_perf_counters.object]]
    # System counters
    ObjectName = "Processor"
    Counters = ["Interrupts/sec"]
    Instances = ["_Total"]
    Measurement = "win_sys"

inputs.swap

Windows

> swap,host=host.local,stand_name=stand_linux free=780681216i,total=1073737728i,used=293056512i,used_percent=27.293118641352237 1600862928000000000
> swap,host=host.local,stand_name=stand_linux in=976695296i,out=2823151616i 1600862928000000000

Linux

> swap,host=host.local,stand_name=stand_windows free=3537133568i,total=5024366592i,used=1487233024i,used_percent=29.600408265751003 1600862962000000000
> swap,host=host.local,stand_name=stand_windows in=0i,out=0i 1600862962000000000

Conclusion - the basic indicators are identical. This is almost always enough.

inputs.procstat (lookup section)

Works well everywhere.

inputs.pocstat

Windows

> procstat,host=host.local,pattern=.*,pid=3896,process_name=telegraf.exe,stand_name=stand_windows,user=host.local\Administrator cpu_time_guest=0,cpu_time_guest_nice=0,cpu_time_idle=0,cpu_time_iowait=0,cpu_time_irq=0,cpu_time_nice=0,cpu_time_soft_irq=0,cpu_time_steal=0,cpu_time_system=0.265625,cpu_time_user=0.046875,cpu_usage=10.32258320249803,created_at=1600862961511000000i,memory_data=0i,memory_locked=0i,memory_rss=35663872i,memory_stack=0i,memory_swap=0i,memory_usage=0.8320687413215637,memory_vms=31064064i,num_threads=9i,read_bytes=1456i,read_count=8i,write_bytes=24343i,write_count=54i 1600862963000000000

Linux

> procstat,host=host.local,pattern=.*,pid=97608,process_name=telegraf,stand_name=stand_linux,user=telegraf child_major_faults=0i,child_minor_faults=3049i,cpu_time=1i,cpu_time_guest=0,cpu_time_guest_nice=0,cpu_time_idle=0,cpu_time_iowait=0.01,cpu_time_irq=0,cpu_time_nice=0,cpu_time_soft_irq=0,cpu_time_steal=0,cpu_time_system=1.12,cpu_time_user=0.66,cpu_usage=80.07258072561304,created_at=1600867489000000000i,involuntary_context_switches=2i,major_faults=1i,memory_data=1199177728i,memory_locked=0i,memory_rss=37597184i,memory_stack=135168i,memory_swap=0i,memory_usage=0.45999372005462646,memory_vms=1277517824i,minor_faults=10583i,nice_priority=20i,num_fds=13i,num_threads=10i,read_bytes=1921024i,read_count=31593i,realtime_priority=0i,signals_pending=0i,voluntary_context_switches=561i,write_bytes=4096i,write_count=946i 1600867491000000000

Conclusion - the main metrics (cpu usage, memory usage, i/o, threads) are saved. This is almost always enough. Instead of the number of used descriptors, you need to collect Hande Count:

  [[inputs.win_perf_counters.object]]
    # Processes info
    ObjectName = "Process"
    Counters = ["Hande Count"]
    Instances = ["*"]
M0rdecay commented 4 years ago

@ssoroka, please take a look when you have a little time. Previously, this topic was led by Daniel, I would not want it to be abandoned. The idea is quite interesting.

ssoroka commented 3 years ago

Thanks for your work here, @M0rdecay. One of my concerns is that changing the names of the fields here would be a breaking change, so we'd have to do that carefully.

M0rdecay commented 3 years ago

I think we don't need to change the field names - we can try to implement different methods for collecting the same metrics depending on the platform.

It seems like the right approach is not to change the field names in [[inputs.perf_counters]], since they are directly inherited from the counter names, but to maintain compatibility in other input plugins.

A little later I'll find a little time to look at the input plugins - where gopsutil is used, it will be possible to open an issue to the library.