Closed FrankyBoy closed 11 months ago
The first thing we should do is document a set of plugin configurations that would match. Is this something you could help with?
I can see a few issues coming up such as needing a processor to rename fields and float vs integer fields as well as counters vs rates. The transition to compatible metrics would be a little tricky to pull off as well.
Also, consider trying the regular plugin set: mem, cpu, net, etc. These should work on Windows but aren't as well tested.
I can try but I am really new to this whole topic, so idk how correct my results are gonna be ;)
Another option could also be reimplementing inputs.cpu to actually be based on performance counters, as the whole reason that inputs.cpu is discouraged is claimed performance problems from the WMI it uses. Because then you can get rid of even having a separate plugin for the same thing (though obviously you'd still have two implementations). While that might not be any advantage from the development effort side, it definitely is way nicer to use for users (one could probably even roll the same config everywhere then).
Another option could also be reimplementing inputs.cpu to actually be based on performance counters, as the whole reason that inputs.cpu is discouraged is claimed performance problems from the WMI it uses.
I think this is the approach we should take: identify the current issues with the default enabled plugins on Windows with the goal to switch to using the same set. Almost all of these plugins use the implementation from gopsutil which has improved its Windows support quite a bit over the last few years and I think we may be closer than we think.
Do you think you could take the default Linux config file, run it on Windows and report back with any major performance, memory, or major missing metrics?
Let's try. Telegraf version - 1.15.3 Configuration tested on:
[[inputs.cpu]]
[[inputs.mem]]
[[inputs.disk]]
[[inputs.diskio]]
[[inputs.net]]
[[inputs.system]]
[[inputs.processes]]
[[inputs.kernel]]
[[inputs.linux_sysctl_fs]]
[[inputs.swap]]
[[inputs.procstat]]
pattern = ".*"
pid_tag = true
fielddrop = [ "rlimit_*" ]
namepass = [ "procstat" ]
pid_finder = "native" # Only in Windows
[[inputs.procstat]]
pattern = "telegraf.exe" # "telegraf" on Linux
namepass = [ "procstat_lookup" ]
pid_finder = "native" # Only in Windows
[inputs.procstat.tags]
appl = "telegraf"
> cpu,cpu=cpu0,host=host.local,stand_name=stand_windows usage_guest=0,usage_guest_nice=0,usage_idle=93.93939393939394,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=6.0606060606060606,usage_user=0 1600862963000000000
> cpu,cpu=cpu1,host=host.local,stand_name=stand_windows usage_guest=0,usage_guest_nice=0,usage_idle=90.9090909090909,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=6.0606060606060606,usage_user=3.0303030303030303 1600862963000000000
> cpu,cpu=cpu2,host=host.local,stand_name=stand_windows usage_guest=0,usage_guest_nice=0,usage_idle=84.84848484848484,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=15.151515151515152,usage_user=0 1600862963000000000
> cpu,cpu=cpu3,host=host.local,stand_name=stand_windows usage_guest=0,usage_guest_nice=0,usage_idle=90.9090909090909,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=9.090909090909092,usage_user=0 1600862963000000000
> cpu,cpu=cpu-total,host=host.local,stand_name=stand_windows usage_guest=0,usage_guest_nice=0,usage_idle=92.1874996565748,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=7.0312503463355815,usage_user=0.781249997089617 1600862963000000000
> cpu,cpu=cpu0,host=host.local,stand_name=stand_linux usage_guest=0,usage_guest_nice=0,usage_idle=60.869565358234034,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=2.1739130396958632,usage_steal=0,usage_system=23.913043431909305,usage_user=13.04347824133864 1600862929000000000
> cpu,cpu=cpu1,host=host.local,stand_name=stand_linux usage_guest=0,usage_guest_nice=0,usage_idle=55.10204090942,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=30.612244999174088,usage_user=14.285714328988202 1600862929000000000
> cpu,cpu=cpu2,host=host.local,stand_name=stand_linux usage_guest=0,usage_guest_nice=0,usage_idle=69.56521732088298,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=19.56521735568104,usage_user=10.869565196897586 1600862929000000000
> cpu,cpu=cpu3,host=host.local,stand_name=stand_linux usage_guest=0,usage_guest_nice=0,usage_idle=59.18367348490333,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=22.4489796700207,usage_user=18.367346999504452 1600862929000000000
> cpu,cpu=cpu-total,host=host.local,stand_name=stand_linux usage_guest=0,usage_guest_nice=0,usage_idle=60.732984219670264,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0.5235602105110597,usage_steal=0.5235602093682391,usage_system=23.560209471854865,usage_user=14.65968589278591 1600862929000000000
Conclusion - the metrics are identical.
Upd. I think that the CPU load for Windows should be gathered in a slightly different way - the types of load are different:
[[inputs.win_perf_counters.object]]
# CPU metrics
ObjectName = "Processor"
Instances = ["*"]
Counters = [ "% Idle Time",
"% Interrupt Time",
"% DPC Time",
"% User Time",
"% Privileged Time",
"% Processor Time" ]
Measurement = "win_cpu"
WarnOnMissing = false
IncludeTotal = true
> mem,host=host.local,stand_name=stand_windows available=3073785856i,available_percent=71.71405963907694,total=4286169088i,used=1212383232i,used_percent=28.285940360923064 1600862962000000000
> mem,host=host.local,stand_name=stand_linux active=4997435392i,available=2703466496i,available_percent=33.07635023869159,buffered=0i,cached=2495905792i,commit_limit=5160443904i,committed_as=6686736384i,dirty=172032i,free=620630016i,high_free=0i,high_total=0i,huge_page_size=2097152i,huge_pages_free=0i,huge_pages_total=0i,inactive=1895534592i,low_free=0i,low_total=0i,mapped=54816768i,page_tables=17313792i,shared=24346624i,slab=405049344i,sreclaimable=245194752i,sunreclaim=159854592i,swap_cached=2555904i,swap_free=780681216i,swap_total=1073737728i,total=8173412352i,used=5056876544i,used_percent=61.86983265028349,vmalloc_chunk=35180028882944i,vmalloc_total=35184372087808i,vmalloc_used=68923392i,write_back=0i,write_back_tmp=0i 1600862928000000000
Conclusion - the minimum set of information is returned in Windows. At the very least, you want to see information about the allocated memory pages.
> disk,device=C:,fstype=NTFS,host=host.local,mode=unknown,path=\C:,stand_name=stand_windows free=26673659904i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=53317988352i,used=26644328448i,used_percent=49.972493845973375 1600862962000000000
> disk,device=dm-0,fstype=xfs,host=host.local,mode=rw,path=/,stand_name=stand_linux free=4578582528i,inodes_free=4109389i,inodes_total=4192256i,inodes_used=82867i,total=8575254528i,used=3996672000i,used_percent=46.60703640865737 1600862928000000000
Conclusion - the metrics are identical.
No metrics.
> diskio,host=host.local,name=dm-0,stand_name=stand_linux io_time=2297117i,iops_in_progress=0i,merged_reads=0i,merged_writes=0i,read_bytes=3565867520i,read_time=1080292i,reads=143561i,weighted_io_time=6325809i,write_bytes=26874711552i,write_time=5244870i,writes=3556438i 1600862928000000000
Conclusion - this plugin does not work on Windows. Replaced by something like this:
[[inputs.win_perf_counters.object]]
# Disks info physical
ObjectName = "PhysicalDisk"
Instances = ["*"]
Counters = ["% Idle Time", "% Disk Time","% Disk Read Time", "% Disk Write Time", "% User Time", "Avg. Disk Queue Length", "Current Disk Queue Length",
"Avg. Disk sec/Read", "Avg. Disk sec/Write", "% Free Space", "Free Megabytes", "Disk Reads/sec", "Disk Writes/sec", "Disk Transfers/sec", "Avg. Disk Bytes/Transfer"]
Measurement = "win_disk_physical"
> net,host=host.local,interface=Local\ Area\ Connection,stand_name=stand_windows bytes_recv=4076673423i,bytes_sent=3643414976i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=25145684i,packets_sent=22805918i 1600862962000000000
> net,host=host.local,interface=eth0,stand_name=stand_linux bytes_recv=2111338081142i,bytes_sent=3239348395418i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=15976633797i,packets_sent=16217086767i 1600862928000000000
> net,host=host.local,interface=all,stand_name=stand_linux icmp_inaddrmaskreps=0i,icmp_inaddrmasks=0i,icmp_incsumerrors=0i,icmp_indestunreachs=3715041i,icmp_inechoreps=0i,icmp_inechos=10i,icmp_inerrors=160i,icmp_inmsgs=3715051i,icmp_inparmprobs=0i,icmp_inredirects=0i,icmp_insrcquenchs=0i,icmp_intimeexcds=0i,icmp_intimestampreps=0i,icmp_intimestamps=0i,icmp_outaddrmaskreps=0i,icmp_outaddrmasks=0i,icmp_outdestunreachs=3714636i,icmp_outechoreps=10i,icmp_outechos=0i,icmp_outerrors=0i,icmp_outmsgs=3714646i,icmp_outparmprobs=0i,icmp_outredirects=0i,icmp_outsrcquenchs=0i,icmp_outtimeexcds=0i,icmp_outtimestampreps=0i,icmp_outtimestamps=0i,icmpmsg_intype3=3715041i,icmpmsg_intype8=10i,icmpmsg_outtype0=10i,icmpmsg_outtype3=3714636i,ip_defaultttl=64i,ip_forwarding=2i,ip_forwdatagrams=0i,ip_fragcreates=0i,ip_fragfails=0i,ip_fragoks=0i,ip_inaddrerrors=0i,ip_indelivers=15981074435i,ip_indiscards=0i,ip_inhdrerrors=0i,ip_inreceives=15982811257i,ip_inunknownprotos=0i,ip_outdiscards=1851421i,ip_outnoroutes=0i,ip_outrequests=16224615579i,ip_reasmfails=0i,ip_reasmoks=0i,ip_reasmreqds=0i,ip_reasmtimeout=0i,tcp_activeopens=4682488i,tcp_attemptfails=3575308i,tcp_currestab=75i,tcp_estabresets=3074i,tcp_incsumerrors=2i,tcp_inerrs=3646i,tcp_insegs=15976664856i,tcp_maxconn=-1i,tcp_outrsts=102549i,tcp_outsegs=16243509336i,tcp_passiveopens=31709718i,tcp_retranssegs=54709i,tcp_rtoalgorithm=1i,tcp_rtomax=120000i,tcp_rtomin=200i,udp_incsumerrors=0i,udp_indatagrams=35989i,udp_inerrors=0i,udp_noports=3714307i,udp_outdatagrams=3750296i,udp_rcvbuferrors=0i,udp_sndbuferrors=0i,udplite_incsumerrors=0i,udplite_indatagrams=0i,udplite_inerrors=0i,udplite_noports=0i,udplite_outdatagrams=0i,udplite_rcvbuferrors=0i,udplite_sndbuferrors=0i 1600862928000000000
Conclusion - the basic indicators are identical. Windows does not have an extended set of metrics. Unfortunately, I don't know how it can be gathered in Windows. And is it necessary at all?
> system,host=host.local,stand_name=stand_windows load1=0,load15=0,load5=0,n_cpus=4i,n_users=0i 1600862962000000000
> system,host=host.local,stand_name=stand_windows uptime=13393443i 1600862962000000000
> system,host=host.local,stand_name=stand_windows uptime_format="155 days, 0:24" 1600862962000000000
> system,host=host.local,stand_name=stand_linux load1=0.15,load15=0.16,load5=0.09,n_cpus=4i,n_users=1i 1600862928000000000
> system,host=host.local,stand_name=stand_linux uptime=17623474i 1600862928000000000
> system,host=host.local,stand_name=stand_linux uptime_format="203 days, 23:24" 1600862928000000000
Conclusions:
Since Windows doesn't know anything about load average, there are always zeros here. I don't think this applies at all to Windows, we are looking at the Processor Queue Length
.
The number of users for Windows is incorrect - it looks like it is always zero, or RDP sessions are not counted.
No metrics.
> processes,host=host.local,stand_name=stand_linux blocked=0i,dead=0i,idle=0i,paging=0i,running=1i,sleeping=154i,stopped=0i,total=155i,total_threads=685i,unknown=0i,zombies=0i 1600866201000000000
Conclusion - this plugin does not work on Windows. Unfortunately, I don't know how it can be gathered.
No metrics.
> kernel,host=host.local,stand_name=stand_linux boot_time=1583239454i,context_switches=86505042818i,entropy_avail=3452i,interrupts=27772527413i,processes_forked=2455215206i 1600862928000000000
Conclusion - this plugin does not work on Windows. Replaced by something like this:
[[inputs.win_perf_counters.object]]
# System counters
ObjectName = "System"
Counters = ["Context Switches/sec", "Processor Queue Length", "Processes"]
Instances = ["*"]
Measurement = "win_sys"
[[inputs.win_perf_counters.object]]
# System counters
ObjectName = "Processor"
Counters = ["Interrupts/sec"]
Instances = ["_Total"]
Measurement = "win_sys"
> swap,host=host.local,stand_name=stand_linux free=780681216i,total=1073737728i,used=293056512i,used_percent=27.293118641352237 1600862928000000000
> swap,host=host.local,stand_name=stand_linux in=976695296i,out=2823151616i 1600862928000000000
> swap,host=host.local,stand_name=stand_windows free=3537133568i,total=5024366592i,used=1487233024i,used_percent=29.600408265751003 1600862962000000000
> swap,host=host.local,stand_name=stand_windows in=0i,out=0i 1600862962000000000
Conclusion - the basic indicators are identical. This is almost always enough.
Works well everywhere.
> procstat,host=host.local,pattern=.*,pid=3896,process_name=telegraf.exe,stand_name=stand_windows,user=host.local\Administrator cpu_time_guest=0,cpu_time_guest_nice=0,cpu_time_idle=0,cpu_time_iowait=0,cpu_time_irq=0,cpu_time_nice=0,cpu_time_soft_irq=0,cpu_time_steal=0,cpu_time_system=0.265625,cpu_time_user=0.046875,cpu_usage=10.32258320249803,created_at=1600862961511000000i,memory_data=0i,memory_locked=0i,memory_rss=35663872i,memory_stack=0i,memory_swap=0i,memory_usage=0.8320687413215637,memory_vms=31064064i,num_threads=9i,read_bytes=1456i,read_count=8i,write_bytes=24343i,write_count=54i 1600862963000000000
> procstat,host=host.local,pattern=.*,pid=97608,process_name=telegraf,stand_name=stand_linux,user=telegraf child_major_faults=0i,child_minor_faults=3049i,cpu_time=1i,cpu_time_guest=0,cpu_time_guest_nice=0,cpu_time_idle=0,cpu_time_iowait=0.01,cpu_time_irq=0,cpu_time_nice=0,cpu_time_soft_irq=0,cpu_time_steal=0,cpu_time_system=1.12,cpu_time_user=0.66,cpu_usage=80.07258072561304,created_at=1600867489000000000i,involuntary_context_switches=2i,major_faults=1i,memory_data=1199177728i,memory_locked=0i,memory_rss=37597184i,memory_stack=135168i,memory_swap=0i,memory_usage=0.45999372005462646,memory_vms=1277517824i,minor_faults=10583i,nice_priority=20i,num_fds=13i,num_threads=10i,read_bytes=1921024i,read_count=31593i,realtime_priority=0i,signals_pending=0i,voluntary_context_switches=561i,write_bytes=4096i,write_count=946i 1600867491000000000
Conclusion - the main metrics (cpu usage, memory usage, i/o, threads) are saved. This is almost always enough.
Instead of the number of used descriptors, you need to collect Hande Count
:
[[inputs.win_perf_counters.object]]
# Processes info
ObjectName = "Process"
Counters = ["Hande Count"]
Instances = ["*"]
@ssoroka, please take a look when you have a little time. Previously, this topic was led by Daniel, I would not want it to be abandoned. The idea is quite interesting.
Thanks for your work here, @M0rdecay. One of my concerns is that changing the names of the fields here would be a breaking change, so we'd have to do that carefully.
I think we don't need to change the field names - we can try to implement different methods for collecting the same metrics depending on the platform.
It seems like the right approach is not to change the field names in [[inputs.perf_counters]]
, since they are directly inherited from the counter names, but to maintain compatibility in other input plugins.
A little later I'll find a little time to look at the input plugins - where gopsutil
is used, it will be possible to open an issue to the library.
Proposal:
Make [[inputs.win_perf_counters]] generate the same output as [[inputs.cpu]], [[inputs.memory]], etc
Current behavior:
[[inputs.win_perf_counters]] generates (for example) "win_cpu" with fields like "Percent_Idle_Time", while [[inputs.cpu]] generates "cpu" with fields like "usage_idle". For all intents and purposes these fields are identical though.
Desired behavior:
Align these pointless differences.
Use case:
This discrepancy makes it impossible to share windows and linux usage numbers on one dashboard, meaning I never can monitor my whole system at a glance but have to switch between two views and have duplicate work for no good reason.