Closed knollet closed 11 months ago
Hi,
I believe this for loop https://github.com/influxdata/telegraf/blob/master/plugins/inputs/win_perf_counters/win_perf_counters.go#L245 is going empty, not producing any measurements and not logging any "Valid:..."s.
That would explain why there is no error and we simply return so quickly. Looking at what modifies hostCounters
, the call to cleanQueries which will set it to nil. This is called during gather when:
lastRefreshed
is zeroCountersRefreshInterval
is greater than zero, the default is 1 minuteCould you possible set CountersRefreshInterval
to zero and see if this avoids this refresh and the gaps? Or is that required for the data you are gathering?
It's a testconfig, so yes, I could do that. But for the pattern we see here (the gap always having a length of 3), I think it's a given that it is as you state. cleanQueries sets it to nil and the refresh doesn't fill it again.... sometimes.
For the affected production, I don't think we can do that. It's process monitoring, and if I understand correctly the refresh is neccessary to add new processes to the monitoring and remove dead ones.
So I did that and set the CountersRefreshInterval
to "0s". There are no gaps anymore.
But now I am getting TONS of "error while getting value for counter "\\myhostname\Process(explorer)\ID Process", instance: explorer, will skip metric: The data is not valid." for different processes and metrics (because it's not refreshed, probably)
So I did that and set the CountersRefreshInterval to "0s". There are no gaps anymore.
Thanks for trying that out! So we have our light switch, that the refresh interval is getting hit, clearing out hostCounters.
for different processes and metrics (because it's not refreshed, probably)
Agreed, I propose we look at the refresh logic here. It appers to correctly zero things out, but the subsequent calls seem to indicate that we attempt to reset the host counters, collect data, and sleep. I would want to see what ParseConfig is producing.
Are we getting anything back from that? Or do we have to wait for a second refresh interval to trigger?
How would I check that?
Are you able to build a custom telegraf? Essentially would need to add some logging since I don't think debugging this live is a viable option. We would want to focus on the area I linked to above, specifically to see what the state of m.hostCounters
is during that process.
https://github.com/influxdata/telegraf/compare/release-1.27...knollet:telegraf:debug-14097
So, I am running this debug modification now and wait for gaps.
First a bunch of "Valid:"s
2023-10-26T04:20:00Z I! [inputs.win_perf_counters::perf_counters_test] Valid: \\MyHostname\Process(svchost)\IO Write Bytes/sec
2023-10-26T04:20:00Z I! [inputs.win_perf_counters::perf_counters_test] Valid: \\MyHostname\Process(svchost)\IO Data Bytes/sec
2023-10-26T04:20:00Z I! [inputs.win_perf_counters::perf_counters_test] Valid: \\MyHostname\Process(svchost)\IO Other Bytes/sec
2023-10-26T04:20:00Z I! [inputs.win_perf_counters::perf_counters_test] Valid: \\MyHostname\Process(svchost)\Working Set - Private
...LOTS MORE...
Then a bunch of found counters in the refresh:
2023-10-26T04:20:01Z D! [inputs.win_perf_counters::perf_counters_test] REFRESHED = Length of hostCounterSet: 2520
2023-10-26T04:20:01Z D! [inputs.win_perf_counters::perf_counters_test] Found counter no 0 : "Percent_Processor_Time"
2023-10-26T04:20:01Z D! [inputs.win_perf_counters::perf_counters_test] + counterPath "\\\\MyHostname\\Process(Idle)\\% Processor Time"
2023-10-26T04:20:01Z D! [inputs.win_perf_counters::perf_counters_test] + instance "Idle"
2023-10-26T04:20:01Z D! [inputs.win_perf_counters::perf_counters_test] + objectName "Process"
2023-10-26T04:20:01Z D! [inputs.win_perf_counters::perf_counters_test] Found counter no 1 : "Percent_User_Time"
2023-10-26T04:20:01Z D! [inputs.win_perf_counters::perf_counters_test] + counterPath "\\\\MyHostname\\Process(Idle)\\% User Time"
2023-10-26T04:20:01Z D! [inputs.win_perf_counters::perf_counters_test] + instance "Idle"
2023-10-26T04:20:01Z D! [inputs.win_perf_counters::perf_counters_test] + objectName "Process"
...LOTS MORE...
No "Valid:"s at all 'then'
2023-10-26T04:21:31Z D! [inputs.win_perf_counters::perf_counters_test] REFRESHED = Length of hostCounterSet: 0
2023-10-26T04:21:31Z D! [inputs.win_perf_counters::perf_counters_test] Gathering from localhost
2023-10-26T04:21:31Z D! [inputs.win_perf_counters::perf_counters_test] Gathering from localhost finished in 0s
...
Mind that my diff (linked above) is not against master but against the version we were running: tag release-1.27
Thank you for trying this out!
REFRESHED = Length of hostCounterSet: 0
This would seem to indicate that we did not get any counters from Windows? Did you see this same output for 3 collection intervals?
No. I put the "REFRESHED"-Line into the refresh-block, so it's only displayed when a refresh runs.
The empty refresh (Length of hostCounterSet: 0
) starts the series of 3 empty gatherings. (after that, the next refresh runs, and it's back to working)
The usual gather runs (not the refreshs) don't update the counters. Firstly it would make disabling refreshs useless, but it would also lead to shorter or more irregular gaps.
So it's quite clear, I think. There are two jobs:
I had some Ideas about that:
So I am at a loss...
So it's quite clear, I think. There are two jobs:
I agree in principle with your description of the two phases.
- A windows bug, where windows doesn't report an error and also doesn't return any counters. (Couldn't find something like this with google, though)
This is what I was suggesting in my last comment ;)
During that refresh cycle, the only reason it should still be empty is if we get nothing back from ParseConfig. The call to AddItem specifically is what modifies the hostCounters
.
My next step would be to dig into that AddItem and see if AddItem even called? If it is not called, then I agree with your next thought, that there is some Windows bug or delay. If it is called, then we need to dig into that function with greater detail.
- Somehow the PDH handles are only valid in the context of a specific thread (probably the one that created it) and telegraf starts up goroutines which sometimes run in a different thread and the request fails silently (Couldn't find a relevant place where telegraf starts up a goroutine between obtaining the handle and requesting the counters, though.)
Do you have multiple instances of win_perf_counters plugin? Inside the plugin the only goroutine I see kicked off is during the gather while it loops over the host counters.
So. I modified the debug-prints. Now it looks like this: https://github.com/influxdata/telegraf/compare/release-1.27...knollet:telegraf:debug-14097 (again: mind, this is against release-1.27)
I ran it on a different machine (where I built it), Win11 this time, and I got a gap:
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## AddItem about to be entered
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## AddItem entered
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## PerfomanceQuery open
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## Counter added to Query
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## CounterPath gotten
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## expanded WildCardPath
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## extracted CounterInfoFromCounterPath
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## Added counter to Query: "\\\\MORIARTY\\Prozess(Idle)\\Prozessorzeit (%)"
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] Valid: \\MORIARTY\Prozess(Idle)\Prozessorzeit (%)
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## Added counter to Query: "\\\\MORIARTY\\Prozess(Idle)\\Benutzerzeit (%)"
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] Valid: \\MORIARTY\Prozess(Idle)\Benutzerzeit (%)
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## Added counter to Query: "\\\\MORIARTY\\Prozess(Idle)\\Privilegierte Zeit (%)"
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] Valid: \\MORIARTY\Prozess(Idle)\Privilegierte Zeit (%)
***...LOTS MORE...***
2023-10-26T20:51:30Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## Added counter to Query: "\\\\MORIARTY\\Prozess(_Total)\\E/A-Datenbytes/s"
2023-10-26T20:51:30Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## Added counter to Query: "\\\\MORIARTY\\Prozess(_Total)\\Andere E/A-Bytes/s"
2023-10-26T20:51:30Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## Added counter to Query: "\\\\MORIARTY\\Prozess(_Total)\\Arbeitsseiten - privat"
2023-10-26T20:51:31Z D! [inputs.win_perf_counters::perf_counters_test] REFRESHED = Length of hostCounterSet: 4788
2023-10-26T20:51:31Z D! [inputs.win_perf_counters::perf_counters_test] Gathering from localhost
2023-10-26T20:51:31Z D! [inputs.win_perf_counters::perf_counters_test] Gathering from localhost finished in 21.1186ms
2023-10-26T20:39:30Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## AddItem about to be entered
2023-10-26T20:39:30Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## AddItem entered
2023-10-26T20:39:30Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## PerfomanceQuery open
2023-10-26T20:39:30Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## Counter added to Query
2023-10-26T20:39:30Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## CounterPath gotten
2023-10-26T20:39:31Z D! [inputs.win_perf_counters::perf_counters_test] REFRESHED = Length of hostCounterSet: 0
2023-10-26T20:39:31Z D! [inputs.win_perf_counters::perf_counters_test] Gathering from localhost
2023-10-26T20:39:31Z D! [inputs.win_perf_counters::perf_counters_test] Gathering from localhost finished in 0s
2023-10-26T20:39:31Z D! [outputs.discard] Buffer fullness: 0 / 10000 metrics
2023-10-26T20:39:41Z D! [outputs.discard] Buffer fullness: 0 / 10000 metrics
2023-10-26T20:39:51Z D! [outputs.discard] Buffer fullness: 0 / 10000 metrics
2023-10-26T20:40:00Z D! [inputs.win_perf_counters::perf_counters_test] Gathering from localhost
2023-10-26T20:40:00Z D! [inputs.win_perf_counters::perf_counters_test] Gathering from localhost finished in 0s
2023-10-26T20:40:01Z D! [outputs.discard] Buffer fullness: 0 / 10000 metrics
2023-10-26T20:40:11Z D! [outputs.discard] Buffer fullness: 0 / 10000 metrics
2023-10-26T20:40:21Z D! [outputs.discard] Buffer fullness: 0 / 10000 metrics
2023-10-26T20:40:30Z D! [inputs.win_perf_counters::perf_counters_test] Gathering from localhost
2023-10-26T20:40:30Z D! [inputs.win_perf_counters::perf_counters_test] Gathering from localhost finished in 0s
2023-10-26T20:40:31Z D! [outputs.discard] Buffer fullness: 0 / 10000 metrics
2023-10-26T20:40:41Z D! [outputs.discard] Buffer fullness: 0 / 10000 metrics
2023-10-26T20:40:51Z D! [outputs.discard] Buffer fullness: 0 / 10000 metrics
*** vvv This is already the next refresh vvv ***
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## AddItem about to be entered
2023-10-26T20:41:00Z I! [inputs.win_perf_counters::perf_counters_test] ## DEBUG ## AddItem entered
## DEBUG ## expanded WildCardPath
## DEBUG ## extracted CounterInfoFromCounterPath
// DEBUG - 14097
if m.PrintValid {
m.Log.Infof("## DEBUG ## CounterPath gotten") // <--- this is still printed
}
// !DEBUG
counters, err := hostCounter.query.ExpandWildCardPath(counterPath) // <-- wth?!
if err != nil {
return err // <-- this can only return non-nil, ...
} // which would lead to an error message at the calling site.
// DEBUG - 14097
if m.PrintValid {
m.Log.Infof("## DEBUG ## expanded WildCardPath") // <--- this is not
}
// !DEBUG
Aaannnnd I've got it.
In the linked line the Error gets eaten if PerfObject.FailOnMissing
is not set
Ok, I set WarnOnMissing
for the next run. Let's see what error it gives us. There shouldn't be one. It's not like are usually >4000 counters and then suddenly none.
WarnOnMissing
should maybe be default. (Feature Request ;) )
The documentation states that it only provides warnings when the plugin is loaded which is incorrect. It provides information when a refresh job fails. (which in our case is every 1m30s.)
2023-10-26T21:47:30Z E! [inputs.win_perf_counters::perf_counters_test] Invalid counterPath "\\Process(*)\\*": Der angegebene Puffer reicht für die zurückgegebene Datenmenge nicht aus. Weisen Sie einen größeren Puffer zu, und wiederholen Sie den Vorgang.
Buffer too small.
This shouldn't be
if ...; ret == PdhMoreData {
but
for...; ret == PdhMoreData {
I would think. Obviously, the required buffer size can change between the first try and the second.
Perhaps with a MaxTries
of, idk, 5... and then a warning by default.
If this error occurs, there are a growing number of Items (processes, in this case), and telegraf silently reports none. I would think that would warrant a warning. The documentation recommending to not enable the warnings is weird in this regard.
https://github.com/knollet/telegraf/blob/dc0835ad80493f5ca20fa72d475e4accf1f4c6d2/plugins/inputs/win_perf_counters/performance_query.go#L130 It seems you're doing that in more than one place.
@knollet thank you for tracking this down and digging in! If you want to put up a PR I'd be happy to review. Otherwise, I will take a deeper look tomorrow as well.
Thanks again!
I won't be able to address this for the next week, I think.
The first relevant thing is, I would say, to set either
The second thing would be to fix this race condition.
I have run this now on my machine for some hours, and got 15 gaps.
I then changed the one relevant (the one which I highlighted) if
to a for
. (no maxtries or anything, I am not a Go programmer ;) )
I will let this run for some more hours and report back if this helped.
Then I will be gone for about a week.
Then I will be gone for about a week.
Thanks for the heads up and trying the fix. If you get back and it all looks good, let's make that change.
WarnOnMissing to true by default or
If this issue is resolved, would we still want to make this change? For users using wildcards would they start getting extra warnings if we were to switch the default?
Thanks again for your help on this issue!
So. @powersj , could you have a quick look at the commit above? It compiles and seems to fix the bug (no gaps anymore). But: The performance_query.go has multiple functions with this structure: asking for buffer size, allocating buffer, second call to get the buffer filled...
I think that not all of them need a loop because I don't think that every one of them is susceptible to a race condition. On the other hand a loop probably wouldn't hurt, so why not put a loop with (I now set it to 5...) some limited retries in all of them?
@knollet
Sven and I chatted about this today. He was looking at the Windows docs which mentioned not trusting the reported buffer size. Sounds like your retries, while they might be working is not the ideal fix. Instead, we will look at increasing the buffer each time (i.e. double the buffer size) until the value is completely read.
Thank you again for reproducing this and digging into the root cause. It is extremely helpful with these Windows issues :)
Sven will look into a PR soon.
@knollet please test the binary in PR #14241 available as soon as CI finished the tests. Let me know if this fixes the issue for you!
I'd like to, but it seems you didn't test compile your changes before committing?
# github.com/influxdata/telegraf/plugins/inputs/win_perf_counters
plugins/inputs/win_perf_counters/win_perf_counters.go:591:70: cannot use math.MaxUint32 (untyped int constant 4294967295) as int value in argument to fmt.Errorf (overflows)
make[1]: *** [Makefile:277: build/windows-386/telegraf.exe] Error 1
make: *** [Makefile:343: windows_i386.zip] Error 2
I did, but not on a 32-bit machine... ;-)
well, ok. that makes sense. I wouldn't've either.
Give me some seconds to fix that...
@knollet you should have a binary now...
... testing ...
seems to work. Didn't get any new gaps.
@knollet thank you for your work on narrowing down the issue and testing the fix!
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.27.3 (git: HEAD@afcf0133), Windows Server 2022
Docker
No response
Steps to reproduce
Expected behavior
No gaps
Actual behavior
Gaps
Additional info
It always seems to fail thrice in a row, I suspect it's because how the "last time refreshed" is recorded.
6:49:30 - plugin starts, has to refresh
6:49:31 - refreshed and recorded "refreshed - 6:49:31", then doing a gather.
somehow there are maybe no counters? I believe this for loop https://github.com/influxdata/telegraf/blob/master/plugins/inputs/win_perf_counters/win_perf_counters.go#L245 is going empty, not producing any measurements and not logging any "Valid:..."s. Gathering nothing and not looping at all takes time.Duration=000
6:50:00 - No new refresh, and as the refresh at 6:49:30 didn't produce any counters, again no measurements
6:50:30 - The last refresh is not 60s ago, yet, as it is 6:50:30 and last refresh was recorded at 6:49:31. -> still no measurements.
So:
It seems apparent why there's always a gap of three and roughly where to look for the error, but I can't find it, lest it be windows itself returning an empty process list and no error. (not even PDH_NO_DATA)
We have had this problem for months and many telegraf-versions now. It's a "sometimes" problem (which is always bad, it's a bad look for reliability), so we tried gathering information and solving it on our own, first.