prometheus-community / windows_exporter

Prometheus exporter for Windows machines
MIT License
2.91k stars 700 forks source link

Add HyperV Dynamic Memory Balancer counter to HyperV integration #1709

Open cflanderscpc opened 2 days ago

cflanderscpc commented 2 days ago

Problem Statement

The Hyper-V integration in windows_exporter is missing a key performance counter for memory monitoring, per

https://learn.microsoft.com/en-us/windows-server/administration/performance-tuning/role/hyper-v-server/detecting-virtualized-environment-bottlenecks#memory-bottlenecks

Proposed Solution

Add the performance counter "HyperV DynamicMemoryBalancer\Available Memory" which is available from CIM/MI at Win32_RawPerfData_BalancerStats_HyperVDynamicMemoryBalancer.AvailableMemory

Additional information

This counter is used in conjunction with the Memory\Available MBytes counter to confirm memory bottlenecks on Hyper-V host systems.

Acceptance Criteria

jkroepke commented 1 day ago

Hi @cflanderscpc

in context of general improvements, I would like reduce the dependencies of WMI.

All Win32_RawPerfData data are available via native performance counters as well.

Currently, I do not have an HyperV environment.

Would it be possible that you can type

typeperf -qx > counters.txt

on the command line and provide the counters.txt?

cflanderscpc commented 1 day ago

counters.txt As requested :)

This is from a Windows Server 2022 Datacenter - Core, Hyper-V host that is currently in a 2 node Failover Cluster.

jkroepke commented 1 day ago

I found 4 counters:

\Hyper-V Dynamic Memory Balancer(System Balancer)\Available Memory For Balancing
\Hyper-V Dynamic Memory Balancer(System Balancer)\System Current Pressure
\Hyper-V Dynamic Memory Balancer(System Balancer)\Available Memory
\Hyper-V Dynamic Memory Balancer(System Balancer)\Average Pressure

which of them make sense to offer? All of them?

cflanderscpc commented 1 day ago

For completeness' sake you could offer all - never know who's going to want the other counters. From a MSFT guided troubleshooting perspective, they only reference the Available Memory counter for checking host memory saturation.

Personally, I can see use cases for System Current Pressure and Average Pressure for long term forecasting on hardware upgrades/replacements.

jkroepke commented 1 day ago

Some good and some bad news here:

I was able to get them quickly, however the whole Hyper-V is not in a good state is not allowing the prometheus best practices. Some metrics are report MB values, while they should be reported in bytes. It can take a while and might be result in some breaking changes. However, they are nessesary, because we plan version 1.0 next year and everything needs to be clean-up before.

cflanderscpc commented 23 hours ago

Interesting, but not entirely surprising w.r.t. mixed unit sizes. Usually RawPerfData is in bytes and FormattedPerfData is in whatever unit the OS shows to the end user, but as with most things MSFT that isn't always the case.

My org is all in on Hyper-V as our virtualization solution, so as it comes time for testing changes I'm sure we would be more than happy to update our dev environment for testing purposes :)

jkroepke commented 22 hours ago

Usually RawPerfData is in bytes and FormattedPerfData is in

Even Raw Data is using MB,

(Get-Counter -Counter "\Hyper-V Dynamic Memory Balancer(*)\Available Memory For Balancing").CounterSamples | Format-List -Property *

Path             : \\vm-jok-dev\hyper-v dynamic memory balancer(system balancer)\available memory for balancing
InstanceName     : system balancer
CookedValue      : 10166
RawValue         : 10166
SecondValue      : 0
MultipleCount    : 1
CounterType      : NumberOfItems32
Timestamp        : 10/31/2024 3:34:57 PM
Timestamp100NSec : 133748624974314944
Status           : 0
DefaultScale     : 0
TimeBase         : 10000000

But I'm happy for your offer that you could test the collector.

You have some some scrape times for how long it takes to scrape the HyperV data?

cflanderscpc commented 22 hours ago

I can't say it's something we have specifically paid attention to - Is there somewhere that type of info is stored, or is it more like something we would have to time externally? - Apologies if this is something well known in the community; I'm personally new to the whole grafana/prometheus/alloy tech stack

jkroepke commented 22 hours ago

Its an metric windows_exporter_collector_duration_seconds with an label collector=hyperv which can be query via Prometheus/Grafana.

cflanderscpc commented 22 hours ago

Thanks!

Scrape times for the hyperv collector are averaging between 6 and 9 seconds. Min time: 4.8 seconds, Max time 9.22 seconds over the last hour.

If I blow that out to 3 hours the result is pretty close to the same (within about 2%).