lxn / win

A Windows API wrapper package for the Go Programming Language
Other
1.19k stars 313 forks source link

PdhValidatePath unexpected errors #49

Open jwenz723 opened 6 years ago

jwenz723 commented 6 years ago

I am working on building a pdh exporter for prometheus. Currently when my application initiates it calls the win.PdhValidatePath() function approximately 3000 times. Most of the calls succeed, however, some fail inconsistently.

I've discovered that the errors that are occurring are the following:

For all pdh counters that result in the errors above, if I try to collect them individually then no errors occur. The errors only occur when large amounts of pdh counters are using the PdhValidatePath() function. I do not have the PdhValidatePath() function inside of a Go routine, so concurrency should not be the problem here.

Any ideas on how to fix this? One solution I found online that wasn't officially answered suggested using the PdhValidatePathEx function rather than PdhValidatePath.

Thought I'd ask here before going down that path in case I am doing something wrong. If you want to see my code, it can be seen here.

krpors commented 6 years ago

I'm the original committer of pdh.go but has been quite some time since I last used it myself... Anyway, one thing I can say that everything is obviously just a wrapper around the DLL. Pretty good chance that it's just how the whole thing 'works' :)

Are you able to write up the smallest test case which is able to reproduce it? I could check it tonight, (also when there's no corporate proxy interfering).

jwenz723 commented 6 years ago

Well I wrote up a small test case and put it in this repo.

What I am finding now, is that the PdhCollectQueryData function is actually the problem rather than the PdhValidatePath. Like I said before, the issue is very inconsistent.

To run the test program, fill the counters.txt file with a large amount of pdh counter paths (1 per line). I just ran in a command prompt: typeperf -q then copied the output into counters.txt.

It seems to me that the problem might start occurring when too many PDH queries are opened at the same time? But i'm not sure what 'too many' is. Or how to catch that. Any insight would be very helpful.

krpors commented 6 years ago

OK I tried something. I used typeperf -q > counters.txt and ran your test program and observed that the PdhCollectQueryData indeed looks to be the culprit here. I got a total of 1877 counters. 456 went into error due to the PdhCollectQueryData and 13 due to PdhAddEnglishCounter.

I took one of the groups which failed (\Distributed Routing Table(*)) and tried adding them all using the Windows Performance Monitor instead (perfmon.exe). They did not show up in the graph, and no error was displayed. Couldn't see anything in the event viewer as well. I'm unsure what is going on here.

I found this link but that did not seem to work on my machine (Windows 7 Enterprise).

Edit: I also tried putting a sleep of 10, 50 and 100 ms in the loop, but that did not have any effect. I got deterministic output, so it seems.

jwenz723 commented 6 years ago

That article you found is interesting. I tried running the lodctr /r command, but it also seemed to have no effect. It doesn't seem to me like the errors are occurring because of corruption. Rather it seems like the number of counters being gathered simultaneously has exceeded what pdh is capable of handling.

Wish there was more documentation on the win pdh stuff.

jwenz723 commented 6 years ago

I think I may have found a solution to this problem. My previous method was using 1 PDH Query Handle per PDH Counter. I changed my code so that now only 1 PDH Query Handle is used to contain ALL PDH Counters. This seems to cause all of the errors that were occurring to stop. It appears that PDH has an undocumented limit on how many PDH Queries can be opened simultaneously.

I updated my small test project here to reflect these changes.

So I believe that this repository can remain unchanged and that the errors I was experiencing were just due to my limited understanding of how to properly use PDH.

krpors commented 6 years ago

Awesome!

Perhaps it's a good idea to document this behavior or whatever into the documentation of pdh.go. I agree that PDH is rather cryptic. I therefore tried to document its usage as properly as possible back in the day. I suppose your finding is a rather useful addition.