prometheus-community / windows_exporter

Prometheus exporter for Windows machines
MIT License
2.92k stars 703 forks source link

Use PDH native functions as alternative to Registry Calls #1350

Open jkroepke opened 11 months ago

jkroepke commented 11 months ago

Current Progress:


Reading the documentation about PerfCounter

Microsoft highly recommends to fetch data via PHD function instead native registry. It seems like that PDH function are more performance.

We have issues like #724 and the exporter does not work well under load and the mentioned Zabbix exporter is using PDH functions, too. Also datadog and telegraf using PDF library.

Ref:

We should look into it, since the Registry supports V1 PerfCounter only, while the PDH functions support both V1 and V2.

Since telegraph is using the same OSS license, we should consider to use telegraf libraries as base instead starting from scratch. The OTEL collector is doing the same.

image

breed808 commented 11 months ago

I'm very much in favour of moving to PDH, though I'm not currently in a position to implement and test this myself. Are you comfortable implementing this, if you have the time?

jkroepke commented 11 months ago

I play with PDH function today. However I got different values back. It seems like that the perfcounter exporter mutate the values (e.g. from 100ns to 1sec).

ref: https://github.com/prometheus-community/windows_exporter/blob/470f5d58522fda17a9045c517f99654f29d55de5/pkg/perflib/unmarshal.go#L87-L94

Additionally, I have no clue, what the "secondValue" is. https://github.com/prometheus-community/windows_exporter/blob/470f5d58522fda17a9045c517f99654f29d55de5/pkg/perflib/perflib.go#L348

I can't find anything at the MS documentation and no clue, whats the source of windows_cpu_processor_rtc_total is.

github-actions[bot] commented 8 months ago

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

jkroepke commented 6 months ago

@breed808 I thought after the future use of PHD in favor registry based collectors.

Since we can't guarantee the exact metrics value between registry based collectors PHD, I had the idea the offer both source at once.

What did you think would be the best possible approach here?


Everyone else in the community is invited to provide feedback here.

DiniFarb commented 2 months ago

I like the global switch option - something like --config.usePDH=true which defaults to false. And if the switch is enabled the same metric-names and collectors are used but of course with PDH native functions instead of the reg calls.

jkroepke commented 2 months ago

At the moment, I do not really have an Idea, what the "best" way of implementing the PDH native function.

Currently, the exporter invoke perfdata once and collection can consume data from it. From code point of view, this is quite complex.

Alternately, each collector invokes PDH calls on they own, but again, no idea if there are any downside, if the exporter holds multiple handles to the Windows API.

At the end, invoking the PDH calls are really complex and it looks like there is no go library which make everything just developer friendly. And the code at telegraph looks very complex, https://github.com/influxdata/telegraf/blob/master/plugins/inputs/win_perf_counters/win_perf_counters.go and other implements need to validate, e.g. (https://github.com/elastic/beats/blob/main/metricbeat/helper/windows/pdh/pdh_windows.go)

At the end, it's an really, really time consuming task. It's hard to find an start here.

DiniFarb commented 2 months ago

Yes I see, this is far from easy! You have already done a lot of work in the PR https://github.com/prometheus-community/windows_exporter/pull/1459 and it looks very good :)

I have tried it out an run in a issue (German System hehe)

Seems like windows does create the object names in the systems language.

panic: "windows_perfdata_prozessorinformationen_c3-übergänge_s" is not a valid metric name

goroutine 58 [running]:
github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
        C:/Users/user01/go/pkg/mod/github.com/prometheus/client_golang@v1.19.1/prometheus/value.go:129
github.com/prometheus-community/windows_exporter/pkg/collector/perfdata.(*collector).collect(0xc0000ac400, 0xc0005903c0)
        C:/Users/user01/source/repos/github/jk/windows_exporter/pkg/collector/perfdata/perfdata.go:199 +0x385
github.com/prometheus-community/windows_exporter/pkg/collector/perfdata.(*collector).Collect(0xc0000ac400, 0x0?, 0x0?)
        C:/Users/user01/source/repos/github/jk/windows_exporter/pkg/collector/perfdata/perfdata.go:178 +0x1f
github.com/prometheus-community/windows_exporter/pkg/collector.(*Prometheus).execute(0xc0000942c0, {0xdf59e3, 0x8}, {0xf29e20, 0xc0000ac400}, 0xc000192238, 0xc0005903c0)
        C:/Users/user01/source/repos/github/jk/windows_exporter/pkg/collector/prometheus.go:176 +0x8f
github.com/prometheus-community/windows_exporter/pkg/collector.(*Prometheus).Collect.func2({0xdf59e3, 0x8}, {0xf29e20?, 0xc0000ac400?})
        C:/Users/user01/source/repos/github/jk/windows_exporter/pkg/collector/prometheus.go:117 +0xa5
created by github.com/prometheus-community/windows_exporter/pkg/collector.(*Prometheus).Collect in goroutine 56
        C:/Users/user01/source/repos/github/jk/windows_exporter/pkg/collector/prometheus.go:115 +0x470
exit status 2
jkroepke commented 2 months ago

I guess, the hypen is the issue here. Prometheus has UTF-8 support. But good catch. I may have to take a look to keep data non-localized.

jkroepke commented 2 months ago

@DiniFarb could you please try out the lastest version of #1459 ? It's available here: https://github.com/prometheus-community/windows_exporter/actions/runs/10658016872/artifacts/1880022574

The functionally and options has been reduced. Instead using the implementation from telegraf, I build a own one.

While it's supporting less, the a bit easier to debug in case of issue. For example, wildcards at counters has been removed.

The counter values has been compared with the values from the other collectors and the values are equal. I would like to hear your feedback.

DiniFarb commented 2 months ago

I did a quick smoke test on a german win11 system with:

PS C:\Users\*******\windows_exporter_binaries> .\windows_exporter-0.28.1-4-gdfc8d37-amd64.exe --log.level=debug --collectors.enabled="perfdata" --collector.perfdata.objects='[{"object":"Processor Information","instances":["*"],"counters": {"% Processor Time": {}}},{"object":"Memory","counters": {"Cache Faults/sec": {"type": "counter"}}}]'
ts=2024-09-02T07:47:21.713Z caller=exporter.go:147 level=debug msg="Logging has Started"
ts=2024-09-02T07:47:21.728Z caller=perfdata.go:97 level=warn msg="The perfdata collector is in an experimental state! The configuration may change in future. Please report any issues."
ts=2024-09-02T07:47:22.758Z caller=exporter.go:216 level=info msg="Running as *********"
ts=2024-09-02T07:47:22.759Z caller=exporter.go:223 level=info msg="Enabled collectors: perfdata"
ts=2024-09-02T07:47:22.759Z caller=exporter.go:258 level=info msg="Starting windows_exporter" version="(version=0.28.1-4-gdfc8d37, branch=HEAD, revision=dfc8d37dae0311bd2e2de503ed5c9efdd13069c4)"
ts=2024-09-02T07:47:22.759Z caller=exporter.go:259 level=info msg="Build context" build_context="(go=go1.22.6, platform=windows/amd64, user=runneradmin@fv-az1390-362, date=20240901-23:04:09, tags=unknown)"
ts=2024-09-02T07:47:22.759Z caller=exporter.go:260 level=debug msg="Go MAXPROCS" procs=16
ts=2024-09-02T07:47:22.759Z caller=tls_config.go:313 level=info msg="Listening on" address=[::]:9182
ts=2024-09-02T07:47:22.759Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=[::]:9182
ts=2024-09-02T07:47:47.533Z caller=prometheus.go:191 level=debug msg="collector perfdata succeeded after 0.000000s."
/metrics result ``` # HELP go_gc_duration_seconds A summary of the wall-time pause (stop-the-world) duration in garbage collection cycles. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 0 go_gc_duration_seconds{quantile="0.25"} 0 go_gc_duration_seconds{quantile="0.5"} 0 go_gc_duration_seconds{quantile="0.75"} 0 go_gc_duration_seconds{quantile="1"} 0 go_gc_duration_seconds_sum 0 go_gc_duration_seconds_count 0 # HELP go_gc_gogc_percent Heap size target percentage configured by the user, otherwise 100. This value is set by the GOGC environment variable, and the runtime/debug.SetGCPercent function. Sourced from /gc/gogc:percent # TYPE go_gc_gogc_percent gauge go_gc_gogc_percent 100 # HELP go_gc_gomemlimit_bytes Go runtime memory limit configured by the user, otherwise math.MaxInt64. This value is set by the GOMEMLIMIT environment variable, and the runtime/debug.SetMemoryLimit function. Sourced from /gc/gomemlimit:bytes # TYPE go_gc_gomemlimit_bytes gauge go_gc_gomemlimit_bytes 9.223372036854776e+18 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 12 # HELP go_info Information about the Go environment. # TYPE go_info gauge go_info{version="go1.22.6"} 1 # HELP go_memstats_alloc_bytes Number of bytes allocated in heap and currently in use. Equals to /memory/classes/heap/objects:bytes. # TYPE go_memstats_alloc_bytes gauge go_memstats_alloc_bytes 1.910128e+06 # HELP go_memstats_alloc_bytes_total Total number of bytes allocated in heap until now, even if released already. Equals to /gc/heap/allocs:bytes. # TYPE go_memstats_alloc_bytes_total counter go_memstats_alloc_bytes_total 1.910128e+06 # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. Equals to /memory/classes/profiling/buckets:bytes. # TYPE go_memstats_buck_hash_sys_bytes gauge go_memstats_buck_hash_sys_bytes 1.453271e+06 # HELP go_memstats_frees_total Total number of heap objects frees. Equals to /gc/heap/frees:objects + /gc/heap/tiny/allocs:objects. # TYPE go_memstats_frees_total counter go_memstats_frees_total 0 # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. Equals to /memory/classes/metadata/other:bytes. # TYPE go_memstats_gc_sys_bytes gauge go_memstats_gc_sys_bytes 1.49412e+06 # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and currently in use, same as go_memstats_alloc_bytes. Equals to /memory/classes/heap/objects:bytes. # TYPE go_memstats_heap_alloc_bytes gauge go_memstats_heap_alloc_bytes 1.910128e+06 # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. Equals to /memory/classes/heap/released:bytes + /memory/classes/heap/free:bytes. # TYPE go_memstats_heap_idle_bytes gauge go_memstats_heap_idle_bytes 3.121152e+06 # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. Equals to /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes # TYPE go_memstats_heap_inuse_bytes gauge go_memstats_heap_inuse_bytes 4.120576e+06 # HELP go_memstats_heap_objects Number of currently allocated objects. Equals to /gc/heap/objects:objects. # TYPE go_memstats_heap_objects gauge go_memstats_heap_objects 3101 # HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. Equals to /memory/classes/heap/released:bytes. # TYPE go_memstats_heap_released_bytes gauge go_memstats_heap_released_bytes 2.809856e+06 # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. Equals to /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes + /memory/classes/heap/released:bytes + /memory/classes/heap/free:bytes. # TYPE go_memstats_heap_sys_bytes gauge go_memstats_heap_sys_bytes 7.241728e+06 # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. # TYPE go_memstats_last_gc_time_seconds gauge go_memstats_last_gc_time_seconds 0 # HELP go_memstats_mallocs_total Total number of heap objects allocated, both live and gc-ed. Semantically a counter version for go_memstats_heap_objects gauge. Equals to /gc/heap/allocs:objects + /gc/heap/tiny/allocs:objects. # TYPE go_memstats_mallocs_total counter go_memstats_mallocs_total 3101 # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. Equals to /memory/classes/metadata/mcache/inuse:bytes. # TYPE go_memstats_mcache_inuse_bytes gauge go_memstats_mcache_inuse_bytes 18688 # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. Equals to /memory/classes/metadata/mcache/inuse:bytes + /memory/classes/metadata/mcache/free:bytes. # TYPE go_memstats_mcache_sys_bytes gauge go_memstats_mcache_sys_bytes 32704 # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. Equals to /memory/classes/metadata/mspan/inuse:bytes. # TYPE go_memstats_mspan_inuse_bytes gauge go_memstats_mspan_inuse_bytes 75680 # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. Equals to /memory/classes/metadata/mspan/inuse:bytes + /memory/classes/metadata/mspan/free:bytes. # TYPE go_memstats_mspan_sys_bytes gauge go_memstats_mspan_sys_bytes 81600 # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. Equals to /gc/heap/goal:bytes. # TYPE go_memstats_next_gc_bytes gauge go_memstats_next_gc_bytes 4.194304e+06 # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. Equals to /memory/classes/other:bytes. # TYPE go_memstats_other_sys_bytes gauge go_memstats_other_sys_bytes 1.118521e+06 # HELP go_memstats_stack_inuse_bytes Number of bytes obtained from system for stack allocator in non-CGO environments. Equals to /memory/classes/heap/stacks:bytes. # TYPE go_memstats_stack_inuse_bytes gauge go_memstats_stack_inuse_bytes 1.114112e+06 # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. Equals to /memory/classes/heap/stacks:bytes + /memory/classes/os-stacks:bytes. # TYPE go_memstats_stack_sys_bytes gauge go_memstats_stack_sys_bytes 1.114112e+06 # HELP go_memstats_sys_bytes Number of bytes obtained from system. Equals to /memory/classes/total:byte. # TYPE go_memstats_sys_bytes gauge go_memstats_sys_bytes 1.2536056e+07 # HELP go_sched_gomaxprocs_threads The current runtime.GOMAXPROCS setting, or the number of operating system threads that can execute user-level Go code simultaneously. Sourced from /sched/gomaxprocs:threads # TYPE go_sched_gomaxprocs_threads gauge go_sched_gomaxprocs_threads 16 # HELP go_threads Number of OS threads created. # TYPE go_threads gauge go_threads 10 # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total counter process_cpu_seconds_total 0.234375 # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 1.6777216e+07 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 388 # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes 2.9421568e+07 # HELP process_start_time_seconds Start time of the process since unix epoch in seconds. # TYPE process_start_time_seconds gauge process_start_time_seconds 1.725263485e+09 # HELP process_virtual_memory_bytes Virtual memory size in bytes. # TYPE process_virtual_memory_bytes gauge process_virtual_memory_bytes 2.8868608e+07 # HELP windows_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, goversion from which windows_exporter was built, and the goos and goarch for the build. # TYPE windows_exporter_build_info gauge windows_exporter_build_info{branch="HEAD",goarch="amd64",goos="windows",goversion="go1.22.6",revision="dfc8d37dae0311bd2e2de503ed5c9efdd13069c4",tags="unknown",version="0.28.1-4-gdfc8d37"} 1 # HELP windows_exporter_collector_duration_seconds windows_exporter: Duration of a collection. # TYPE windows_exporter_collector_duration_seconds gauge windows_exporter_collector_duration_seconds{collector="perfdata"} 0 # HELP windows_exporter_collector_success windows_exporter: Whether the collector was successful. # TYPE windows_exporter_collector_success gauge windows_exporter_collector_success{collector="perfdata"} 1 # HELP windows_exporter_collector_timeout windows_exporter: Whether the collector timed out. # TYPE windows_exporter_collector_timeout gauge windows_exporter_collector_timeout{collector="perfdata"} 0 # HELP windows_exporter_perflib_snapshot_duration_seconds Duration of perflib snapshot capture # TYPE windows_exporter_perflib_snapshot_duration_seconds gauge windows_exporter_perflib_snapshot_duration_seconds 18.1017429 # HELP windows_perfdata_memory_cache_faults_sec Performance data for \\Memory\\Cache Faults/sec # TYPE windows_perfdata_memory_cache_faults_sec counter windows_perfdata_memory_cache_faults_sec 1.16080088e+08 # HELP windows_perfdata_processor_information__processor_time Performance data for \\Processor Information\\% Processor Time # TYPE windows_perfdata_processor_information__processor_time counter windows_perfdata_processor_information__processor_time{instance="0,0"} 3.47984140625e+12 windows_perfdata_processor_information__processor_time{instance="0,1"} 3.51457234375e+12 windows_perfdata_processor_information__processor_time{instance="0,10"} 3.50648578125e+12 windows_perfdata_processor_information__processor_time{instance="0,11"} 3.5071621875e+12 windows_perfdata_processor_information__processor_time{instance="0,12"} 3.5095271875e+12 windows_perfdata_processor_information__processor_time{instance="0,13"} 3.5146825e+12 windows_perfdata_processor_information__processor_time{instance="0,14"} 3.5042853125e+12 windows_perfdata_processor_information__processor_time{instance="0,15"} 3.51249265625e+12 windows_perfdata_processor_information__processor_time{instance="0,2"} 3.49953671875e+12 windows_perfdata_processor_information__processor_time{instance="0,3"} 3.510353125e+12 windows_perfdata_processor_information__processor_time{instance="0,4"} 3.50246890625e+12 windows_perfdata_processor_information__processor_time{instance="0,5"} 3.5092778125e+12 windows_perfdata_processor_information__processor_time{instance="0,6"} 3.5012315625e+12 windows_perfdata_processor_information__processor_time{instance="0,7"} 3.50659328125e+12 windows_perfdata_processor_information__processor_time{instance="0,8"} 3.50766140625e+12 windows_perfdata_processor_information__processor_time{instance="0,9"} 3.51288828125e+12 ```

looks good 👍 only thing was that the query took a bit long:

prefdata was ok:

windows_exporter_collector_duration_seconds{collector="perfdata"} 0

but the perflib snapshot was set with (even though I had just the perfdata collector active)

windows_exporter_perflib_snapshot_duration_seconds 18.1017429

As soon as I added another "classic" collector like --collectors.enabled="cpu,perfdata" the responses were fast as always. I searched a little and saw that if no perflib collector is set - this func receives an empty string. https://github.com/jkroepke/windows_exporter/blob/d8f0665bdc3f3c4d6e6119b1d2d7fa78c0931fa3/pkg/collector/collector.go#L209-L216

For testing I changed the function like:

diff --git a/pkg/collector/collector.go b/pkg/collector/collector.go
index e829ed5..90d7d22 100644
--- a/pkg/collector/collector.go
+++ b/pkg/collector/collector.go
@@ -209,6 +209,9 @@ func (c *Collectors) Build(logger log.Logger) error {

 // PrepareScrapeContext creates a ScrapeContext to be used during a single scrape.
 func (c *Collectors) PrepareScrapeContext() (*types.ScrapeContext, error) {
+       if c.perfCounterQuery == "" {
+               return nil, nil
+       }
        objs, err := perflib.GetPerflibSnapshot(c.perfCounterQuery)
        if err != nil {
                return nil, err

and it worked fast as always. This was of course there already but maybe not recognized, cos why would you run the classic win exporter with no collectors. But now with the new perfdata collector it is different. I think that if only the perfdata collector is active all functionality of perflib should be disabled.

P.S. there is the possibility that I did configure something wrong - had not much time to look into it. I can manage some more time in the coming days if needed.

jkroepke commented 2 months ago

Normally, I develop this exporter on my windows 10 machine and I did not have any issues. But I can remember that other users hat the issue as well.

Ref: https://github.com/prometheus-community/windows_exporter/issues/1458

DiniFarb commented 2 months ago

ah yes sure, there are non perflib collectors already - my mistake, what was I thinking. So yes this is a already existing issue and seems only to happen on win11.

jkroepke commented 2 months ago

I will plan this change on 0.30

DiniFarb commented 2 months ago

Nice let me know if I can help in any way :)

jkroepke commented 2 months ago

@DiniFarb The generic perf counter collector will be part of the next release.

After that, current collectors will be switched to the new system.

In terms of the generic collector, think about your use-cases, testing testing testing.

JDA88 commented 6 days ago

I see that in the v0.30.0-beta.0 release you use an environment variable WINDOWS_EXPORTER_PERF_COUNTERS_ENGINE to enable the feature. Any reason not use a command line flag like the other features?

jkroepke commented 6 days ago

I see that in the v0.30.0-beta.0 release you use an environment variable WINDOWS_EXPORTER_PERF_COUNTERS_ENGINE to enable the feature. Any reason not use a command line flag like the other features?

Good call, added in #1723