Enabling [process] collector can make others metrics flap on version v0.29

JDA88 commented 1 week ago

Current Behavior

Metrics are not stable with v0.29, if I refresh the /metrics page regulary I have metrics missing, same behaviour with Prometheus scraping it.

Sometimes a metric completely disappear and sometime the metric is there but there is a "instance" missing. The funny thing is that it's ALWAYS the same member missing.

I have a 7 logical disks on the server and most of the time I see the metrics for all 7, but when i have the issue it's always the volume="L:" that is missing.
I have a 3 monitored services on the server and most of the time I see the metrics for all 3, but when i have the issue it's always the name="W32Time" that is missing.

After some more testing it look like everything is stable if I remove the process collector But not sure it it's the root cause or a conjonction and no idea why it would cause metrics outside his scope to flap.

Those are all the metrics where I have observed the issue:

windows_cpu_logical_processor{}
windows_cpu_processor_.*{}
windows_logical_disk_.*{}
windows_memory_physical_.*{}
windows_net_current_bandwidth_bytes{}
windows_net_packets_received_discarded_total{}
windows_net_packets_received_errors_total{}
windows_os_paging_free_bytes{}
windows_os_paging_limit_bytes{}
windows_os_virtual_memory_free_bytes{}
windows_process_cpu_time_total{}
windows_service_state{}
windows_system_processor_queue_length{}
windows_system_threads{}
windows_system_uptime_seconds{}
+metrics from textfile

When this append all windows_exporter_collector_success are = 1 and all windows_exporter_collector_timeout = 0. No message in logs, 100% reproductible on multiple servers, nothing change after a service or computer restart.

Expected Behavior

Like in the previous version we where using (v0.25.0) stable metrics visibles at every scrap

Steps To Reproduce

Command line used:
"windows_exporter.exe" --collectors.enabled os,system,cpu,logical_disk,memory,net,service,process,textfile --process.priority abovenormal --collector.logical_disk.volume-exclude HarddiskVolume.+ --collector.service.include="LanmanServer|MSiSCSI|W32Time" --collector.process.include="(?i)Ms(MpEng|Sense)|TiWorker|TrustedInstaller" --collector.textfile.directories="C:\Program Files\Custom_Metrics" --log.file eventlog --web.listen-address 10.0.0.1:1111

Environment

windows_exporter Version: 0.29.0
Windows Server Version: 2016 & 2022

windows_exporter logs

Except the occasional log bellow nothing of interest

Cannot create another system semaphore.

Even when there is missing metrics all the collectors return success

ts=2024-09-25T17:01:58.101+02:00 level=debug caller=textfile.go(textfile.(*Collector).Collect.func1):335 msg="Processing file: C:\\Program Files\\Custom_Metrics\\Metrics.prom" collector=textfile
ts=2024-09-25T17:01:58.102+02:00 level=info caller=prometheus.go(collector.(*Prometheus).execute):238 msg="collector textfile succeeded after 853.5µs"
ts=2024-09-25T17:01:58.104+02:00 level=info caller=prometheus.go(collector.(*Prometheus).execute):238 msg="collector memory succeeded after 3.1741ms"
ts=2024-09-25T17:01:58.104+02:00 level=info caller=prometheus.go(collector.(*Prometheus).execute):238 msg="collector system succeeded after 3.1741ms"
ts=2024-09-25T17:01:58.105+02:00 level=info caller=prometheus.go(collector.(*Prometheus).execute):238 msg="collector logical_disk succeeded after 3.6815ms"
ts=2024-09-25T17:01:58.106+02:00 level=info caller=prometheus.go(collector.(*Prometheus).execute):238 msg="collector process succeeded after 4.6146ms"
ts=2024-09-25T17:01:58.106+02:00 level=info caller=prometheus.go(collector.(*Prometheus).execute):238 msg="collector net succeeded after 4.6146ms"
ts=2024-09-25T17:01:58.106+02:00 level=info caller=prometheus.go(collector.(*Prometheus).execute):238 msg="collector service succeeded after 4.6146ms"
ts=2024-09-25T17:01:58.106+02:00 level=info caller=prometheus.go(collector.(*Prometheus).execute):238 msg="collector cpu succeeded after 4.6146ms"
ts=2024-09-25T17:01:58.108+02:00 level=info caller=prometheus.go(collector.(*Prometheus).execute):238 msg="collector os succeeded after 7.1909ms"

Anything else?

Maybe chang the log from msg="collector os succeeded after 1.234ms" to msg="collector os succeeded after 1.234ms, resulting in xx different metrics with a total of yy lines"

We had to stop the deployement of 0.29.0 early until we find a solution, hard for me to test other version between 0.25 and 0.29 but tell me if it can help

jkroepke commented 1 week ago

Hi @JDA88 thanks for reporting this! I have disabled the 0.29 as latest release for now.

Could you please check, if #1643 solves the issue?

Snapshot builds: https://github.com/prometheus-community/windows_exporter/actions/runs/11059551224/artifacts/1984426396

JDA88 commented 1 week ago

Thx for the fix, testing it right now!

While I'm at it, dont you think the collector textfile succeeded after logs should be debug and not info consdering how verbose it is? We always had the level to info and log in eventlog, but with those messages it got flooded so we tuned it down to warn.

It's more a reflexion than a request we are fine with the warn level

jkroepke commented 1 week ago

https://github.com/prometheus/node_exporter/blob/71d9b6c06103a440a6590135467bc4c96174c9a1/collector/collector.go#L172

On node_exporter, is debug too. I will adjust the level

JDA88 commented 1 week ago

A little early to be 100% sure but the metrics look stable now. Thanks for que quick fix

prometheus-community / windows_exporter