tyoma / micro-profiler

Cross-platform low-footprint realtime C/C++ Profiler
https://visualstudiogallery.msdn.microsoft.com/133d5764-b32b-4ec9-8ee8-5546eca64584
MIT License
234 stars 27 forks source link

Negative execution times #57

Open pk1954 opened 4 years ago

pk1954 commented 4 years ago

I just tortured my application with a long time stress test running unattended for about half an hour. When I came back, I saw to my astonishment, this: Anmerkung 2020-02-28 012510 MicroProfiler helped me to improve the performance of my app a lot, but negative execution times - whow! Did you implement some kind of time machine? But seriously, there seems to be some overflow condition. I don't think it is necessary to put a lot of effort into handling such extreme situations. A simple marker to indicate overflow would be sufficient imho.

tyoma commented 4 years ago

Hi Peter!

I wish it was a time machine... Unfortunately, it's not )

The thing is that profiler's impact although very small is still measurable. The whole enter/exit hook adds about 120 CPU cycle (40ns) on my laptop. Now consider the functions you mentioned: their average call times are about 1ns or zero.

How can we accurately measure the timing of such small functions? The right answer - is we cannot, we can only estimate it, by subtracting the profiler's latency (that enter/exit hook time) from the observed time of the function.

Now, how do you get the latency value? That's the tricky part mailnly because this value tends to drift depending on many factors, the most important of which is the cache, the shared (by different threads/cores) memory access, etc. I used to calibrate this value by taking the smallest observed latency during the calibration step (when micro-profiler_*.dll gets loaded). But this led to the error accumulation, since the average latency was ~5-15 cycles was longer than that. With this change the average measured value is used.

But again as a I said this 'average' value is taken during the dll load, which means the cpu is busy and the average value is slightly greater than the one expected in the 'free' run. So, this becomes visible with the tiny functions.

Take SlowMotionDisplay::Notify for instance - it's average printed exclusive/inclusive times equals -3.95ns. We can easily get such value if we take the real value of this function as extremely short (say 0.33ns - one cycle on 3GHz machine) and we overcompensated it by 4.2ns.

So, all in all, there is no overflow here, all the arithmetic is signed. But the outcome is confusing. I knew you'd spot it, just don't know, how to display this without lying to the user.

What I'm thinking about is periodical latency reevaluation - like with kind of negative feedback - for instance detecting what was the most average negative value observed and decrease the calibrated latency by that value.

pk1954 commented 4 years ago

Hi tyoma!

thank you for this detailed and open explanation.

I understand the problem with measuring execution times in the nanosecond range. The best method I know of is QueryPerformanceCounter. I found it also in the micro-profiler sources. As far as I see, it has an accuracy of 100 ns. So you have a really hard job in measuring the runtime of very small functions. Processor caches make the problem much worse. I once heard - I think it was in a talk of Scott Meyers - that good or bad usage of CPU caches can make a factor of 20 in runtime! I still remember the times, when it was possible to determine the runtime of code exactly by adding cycle times of machine instructions. I used to look them up in a light blue book called "The Intel 8086 Instruction Set" or something like that. My boss always told me not to waste time on such useless stuff. Later, when I was a manager myself, I said similar things to my developers. But that's another story.

I'm not sure if it makes sense to mess around with latency reevaluation or things like that. At the end of the day it will be only a more sophisticated way to lie to the user. For me the benefit of micro-profiler is not diminished now that I understand better the limits of its accuracy. To stick to the example above: The interesting thing is, that SlowMotionDisplay::Notify is called more than 150 million times. That is suspicious, no matter how fast it is.

You wrote

… just don't know, how to display this without lying to the user

What about asking your marketing manager? The answer will be something like "Are you scrupulous about lying to your customers? How cute. Forget it! Customers want to be lied to. Print out a small (positive!) random number with 10 numbers of accuracy. They will love it." What? You don't have a marketing manager? I almost thought it. Can I apply for the job? No, not really. I'm glad I'm out of business. I just do a little bit of programing for my own pleasure. And I enjoy being in contact with competent, honest people like you.

Best regards Peter

pk1954 commented 4 years ago

sorry for my long-tongued comment. In short: I understand your explanation and will close the issue now.

tyoma commented 4 years ago

Hi Peter! I'm reopening this issue, as I've reimplemented collection buffer with almost no bus locking (atomic operations). I also improved calibration logic, so the numbers acquired during calibration are kept stable. New collection enables almost 2-times faster operation, so your application (at best) may work 2 times faster, when profiled. In reality this improvement is lower, as the analyzer thread may not catch up to that speed :) - I'll be improving that part later. So, if you want to try this version - please, get it here: https://github.com/tyoma/micro-profiler/releases/download/v1.7.621.0/micro-profiler.v1.7.621.0.vsix

pk1954 commented 4 years ago

Hi Artem! a short feedback on v1.7.621.0. It works without problems. I didn't notice any negative execution times so far, but I have not done exhaustive tests. I cannot verify that the instrumented version runs faster than before. To be honest I was a bit shocked how big the difference is between a version compiled with profiling versus a version without. I measure the duty cycle of a worker thread using the QueryPerformanceCounter API. Under identical conditions the profiler version had a workload of 30%, the normal version only 10%. I mistrusted my evaluation logic and double checked with windows resource monitor - same result. Normally the resource monitor display doesn't help, because the scheduler distributes the workload evenly across all 12 cores of my CPU. But for this test I nailed the thread to one core with SetThreadAffinityMask. So I'm pretty sure, that microprofiler slows down at least this thread by a factor of three. That's a lot. I also tested with v1.6.622.0. and had more or less the same result.

By the way, during the reinstallation of both versions (v1.6.622.0 and v1.7.621.0) I experienced again the old problem from issue #51. In the meanwhile a have a certain routine to fix it. Copy path name from linker error message and rename microprofiler installation directory. It works, but I suppose it`s not meant that way.

Regards Peter

tyoma commented 4 years ago

Hi Peter! Quick reply: this version should be better.

Long reply: the version you've tried (1.7.621) has a 2x faster collector code, but it's analyzer code was slow, so that collector had to wait until analyzer crunches everything collected before the data is submitted to the frontend. v1.7.624 has faster analyzer (~40% better in x64 code), which makes things faster approximately twice comparing to v1.6.. And these numbers are all about high load - when the code profiled has lots of functions measured as N-tens and less nanoseconds (as a reference on my machine, i5 3570 3.4Ghz 4 cores, which is ancient comparing to your's - profiler v1.7. adds approximately 16ns of overhead). If the profiling load is lower (i.e. tiny functions get inlined), the gain is higher. Anyway, the profiler shows the lengths of the calls as if there's no profiler attached - so if you measure a profiled function making 10^7 calls to spend 1 second (using QueryPerformanceCounter or thread CPU time), the profiler (given effective 16ns overhead I mentioned) is going to show you only 0.84s of inclusive time, because that is how much the function actually weighs w/o overhead incurred. If you're interested I can advise on how to get more functions inlined - in the end, this is a major aspect of performance improvement. And on #51 - next time, please, try to restart VisualStudio after the first run with a newly installed profiler's vsix - the first run after the installation sets up environment variables, but those may not be active yet for the running session.

pk1954 commented 4 years ago

Hi Artem!

v1.7.624 seems to be a bit faster, but not that much in my perception. But it doesn't matter. What I've written about the overhead caused by microprofiler, true or not, in no way diminishes the practical value of the tool. It helps me to identify the hotspots of my code and that's what it's for.

The functions I am currently investigating are very small, about 10 - 20 ns. Considering that microprofiler causes about 16 ns overhead, it is plausible that the overall performance is affected in the order of magnitude I observe. My new CPU (AMD Ryzen 3600 4,2 Ghz, 6(12) cores) is indeed faster than your reference, but the single thread performance is not that much different, maybe 50% faster. I have tons of cores, but I still can't use them effectively. Still struggling with parallelizing the computation intensive code.

Yes, I'm interested to learn on how to get more functions inlined. Do you need my source code or is it more general advice?

According #51: You're right. Restart VS and it's ok. I was too impatient :-)

tyoma commented 4 years ago

"Yes, I'm interested to learn on how to get more functions inlined. Do you need my source code or is it more general advice?" totally - I can take a look if you point out the hotspots you currently have. Are you talking about your project here on github? I had and still do some interest in evolutionary models (you can see, I have polyworld forked in my profile - it shattered me when I first watched it's presentation), so it'll be a fun ride.

"Considering that microprofiler causes about 16 ns overhead" "My new CPU (AMD Ryzen 3600 4,2 Ghz, 6(12) cores)" - in seem to be 10ns or less, bc Ryzen has more sophisticated pipeline and bigger cache. The version you have outputs the total per-function overhead to the console of the profiled app. If you don't have a console in your app, you can write a simple one and compile it with a profiler enabled to see how high is it. On my machine it's 58 for x86 and 65 for x64 (although the assembly is pretty much the same, it has to write more data, as pointers are 8-byte long).

pk1954 commented 4 years ago

Hi Artem! The "Evolution" project you found on my GitHub account is on hold. I currently work on another project "NNet" which is inspired by a friend researching neural networks. It is a graphical editor for the interconnection of neurons and dendrites and a a simulation of nerve impulses running through that network. I would be very happy about any advice that will help me make the program faster and more stable. So far I haven't posted it on GitHub, but I'll do it now. On one point I am not quite sure what is the best way. When I started with NNet I wanted to reuse some elements of "Evolution" and former projects and separated them into a common toolbox library. Should this "Toolbox" be a separate GitHub project? From the architectural point of view, it would be logical, but it requires discipline to keep the projects in sync. What is your advice?

The version you have outputs the total per-function overhead to the console of the profiled app.

I have a console, but don't see any output from microprofiler. May be my way to start the console conflicts with microprofiler?

void Util::StdOutConsole( ) { FILE * fp; BOOL bRes = AllocConsole( ); errno_t res = _wfreopen_s( &fp, L"CONOUT$", L"w", stdout ); wcout << L"Console started" << endl; HWND hwnd = ::GetConsoleWindow(); if (hwnd != NULL) { LONG style = GetWindowLong( hwnd , GWL_STYLE ); style = style & ~(WS_SYSMENU);
SetWindowLong( hwnd, GWL_STYLE, style );
}
}

(I turn off the system menu because my "customer" likes to close the console window and then complains, that the application exits unexpectedly)

Best regards Peter

tyoma commented 4 years ago

Hi Peter! The console initialization you have is probably taking place after the profiler gets initialized. And the output I mentioned takes place at init - that's why you may not see it. Please, hit me at LinkedIn - you can see it in my profile. thanks!